Overview | AXP Documentation

AI agents are a new kind of user for your product. AXP helps you find out how well they can actually use it.

Unlike eval tools that score model output against datasets, AXP runs sandboxed experiments that measure whether agents can operate your real product surfaces: CLIs, SDKs, APIs, web apps, MCP servers, and docs, in real complex environments.

The goal is to use those experiments to observe agents and turn agent behavior into repeatable product evidence.

It is built for product, DevRel, and engineering teams improving how agents use their product, and for teams evaluating competing products before they adopt them.

Use AXP to test new features, improvements, and bug fixes before you release them. You can answer questions like:

How do agents use your product?
Do changes to your product make agents better or worse at completing tasks?
Which models or coding agents perform best with your product? Which perform worse?

You can also run AXP experiments to evaluate competing products that you are considering adopting, buying, or investing in. You can answer questions like:

How does your product perform against an agent inventing a solution from scratch?
How does your product perform against your competition?
In the contexts your customers work in, do agents suggest your product?

These questions map to common experiment patterns: discovery and install, core user flows, product optimization, interface comparison, competitive analysis, and product marketing. See Experiment Design.

How AXP Works

Design an experiment

Choose the question, variants, and tests.

Run your experiment

Execute each variant in its own sandbox.

Analyze results

Compare variants and decide what to change.

Iterate: feed each result into your next experiment.

AXP has three main pieces:

Design an experiment: define the task, the variants you compare (different agents, models, environments, or product versions), and the tests (application and introspection) that measure success.
Run your experiment: execute each variant in its own sandbox on the AXP platform with axp run (local Docker is available with axp local run), capturing logs, file changes, test results, token usage, and cost.
Analyze results: compare data across runs and variants so you can see what worked, what failed, and what to change next.

experiment.yaml

# yaml-language-server: $schema=https://docs.514.ai/schema/experiment.v2.schema.yamlschema_version: 2id: add-healthcheckname: "Add a healthcheck endpoint"agents:  - name: claude    model: claude-sonnet-4-6prompts:  - id: healthcheck    prompt: |      Add GET /health. It should return JSON with { "ok": true }.environments:  - name: current-cli    setup: "true"  - name: next-cli    setup: |      npm install -g @acme/cli@nexttests:  application:    - name: healthcheck-responds      script: |        curl -fsS http://localhost:3000/health | jq -e '.ok == true'  introspection: []limits:  max_turns: 25  max_time_seconds: 300  max_cost_usd: 0.50

Experiments

The spec that defines the task, variants, setup, and tests you want an agent to run.

terminal

❯ axp run --watch -j 6 ./examples/experiments/git-init-commit.yaml2026-06-11T21:22:01.957682Z  INFO axp.command{command="run" cli_version="0.3.22-rp" channel="prod" platform_url_kind="production"}:axp.run{repeat=1 jobs=6 dry_run=false agent.driver="acp"}: axp: running variants with concurrency 6[trunk-branch] ▶ started[main-branch] ▶ started[main-branch]   Let me initialize a git repository, create a README.md, commit it, and write the SHA to a file.[main-branch] 🔧 Terminal[trunk-branch]   Let me complete this task step by step:[trunk-branch]   1. Initialize a new git repository at /workspace/repo[trunk-branch]   2. Create README.md with exactly "hello from $AXP_VARIANT_ID" (literal string, not expanded)[trunk-branch]   3. Commit with message "initial commit"[trunk-branch]   4. Write the full 40-char commit SHA to /workspace/sha.txt[trunk-branch] 🔧 Terminal[trunk-branch]    ↳ update[trunk-branch] ⏸ permission: execute — auto-allowed[trunk-branch]    ↳ update[trunk-branch]    ✓ completed[main-branch]    ↳ update[main-branch] ⏸ permission: execute — auto-allowed[main-branch]    ↳ update[main-branch]    ✓ completed[trunk-branch] 🔧 Write[main-branch] ↻ cost=$0.0336[trunk-branch]    ↳ update[main-branch]   Done. Commit SHA `ecba1736b2f2dbe07dbb50a61ee896907a77ad8d` written to `/workspace/sha.txt`.[trunk-branch] ⏸ permission: edit — auto-allowed[trunk-branch]    ↳ update[trunk-branch]    ✓ completed[main-branch] ✓ done — status=fail cost=$0.0336 turns=4 tokens=3/282[trunk-branch] 🔧 Terminal[trunk-branch]    ↳ update[trunk-branch] ⏸ permission: execute — auto-allowed[trunk-branch]    ↳ update[trunk-branch]    ✓ completed[trunk-branch] 🔧 Terminal[trunk-branch]    ↳ update[trunk-branch]    ↳ update[trunk-branch]    ✓ completed

Runs

A completed execution of an experiment, including variant outputs, costs, logs, and artifacts.

$0.57

Avg cost

5m00s

Avg wall clock

321

Tool failures

100%

Test pass rate

Group by: Variant

X-axis: Wall clock

Y-axis: Cost

Cost vs wall clock, grouped by variant

0m00s

13m20s

Results

The summaries you generate from runs to compare variants and decide what to change next.

Installation

The fastest way to get started is the official AXP CLI installer:

bash <(curl -fsSL https://dl.514.ai/install.sh) axp
axp --help

Platform runs require an account (currently closed alpha). Request access, then sign in with axp auth login. See Installation for details.

Supported Coding Agents

AXP currently supports Anthropic Claude Code, OpenAI Codex, and Cursor.

Need support for another coding agent? Ask us to add it.

Next steps

Install AXP and sign in.
Run your first experiment in Getting Started.
Design your own with Experiment Design.

Search documentation

How AXP Works

Installation

Supported Coding Agents

Next steps