Experiment Design

Design AXP experiments that produce useful product guidance.

AXP experiments work best when they answer a product question and point to a clear next step. This experiment design guide helps you move from observed agent behavior to a product change, docs change, regression test, or follow-up experiment.

The experiment loop

  1. Find a starting point. Start from observed agent behavior, product analytics, support feedback, prior runs, or a gut feel about where agents may hit friction.
  2. Ask a research question. Turn the starting point into a question that compares surfaces or variants. For example: "Do agents complete onboarding more reliably with an MCP tool than with docs alone?"
  3. Place a bet. Write the hypothesis in plain language before running the experiment. For example: "An agent with MCP will have a lower failure rate for onboarding and use fewer tokens than an agent without it."
  4. Define the experiment. Choose the setup, the variants, and the tests.
  5. Smoke test and refine. Run just enough trials to test the experiment itself. Fix ambiguous prompts, brittle tests, broken setup, or variants that do not isolate the thing you meant to study.
  6. Scale the experiment. Spend the larger budget across the full variant matrix.
  7. Analyze and decide. Read the outcomes, friction, cost, traces, and failure modes. End with product guidance or a narrower follow-up question.

For AXP concepts and YAML structure, see Experiments, Runs, Results, and the experiment YAML reference.

Find a starting point

You do not need a perfect observation before you start. A useful starting point can be a prior run, analytics, support feedback, a customer question, a product area you suspect is confusing, or a workflow you want agents to complete more reliably.

Examples we will follow through the rest of this page:

  • "I get visitors to my getting started page and downloads, but not much usage." (call this install friction)
  • "My CLI sessions have a lot of help command calls." (call this help loops)

Ask a research question

Good research questions are derived from your starting point. They name the product surface you want to study, the task the agent is trying to complete, the context it runs in, the outcome you care about, and the friction you want to reduce.

Examples:

  • Which onboarding surface helps agents complete installation most reliably and with the fewest retries?
  • Which CLI help or example structure helps agents complete setup with fewer repeated help calls and failed commands?

After the first run, ask whether the results actually teach you about the research question. If not, the test may be measuring the wrong thing, the prompt may be too artificial, or the variants may not isolate the contrast.

Place bets before running

Bets turn the research question into a testable hypothesis. Write down what you expect before running the experiment; good bets are concrete enough to be wrong:

  • "The exact install command will beat improved docs on install success, but improved docs will produce a tighter distribution of token usage during install."
  • "Task-oriented CLI examples will reduce repeated help calls more than renamed commands."

Avoid vague bets like "schema matters" or "docs should help." A good bet helps you design the experiment and makes surprises visible when results disagree with expectations.

Ask your coding agent to place a bet too; its expectations can reveal a different theory of the product, and the disagreement is often useful.

Define the experiment

Define three things: the setup every variant shares, the variants you compare, and the tests that score the result. In AXP these map onto the experiment YAML; see Experiments and the experiment YAML reference.

The setup

The setup is the scenario every variant runs through:

  • Task: what the agent is trying to do (the prompt).
  • Environment: the data, files, services, credentials, and runtime the agent gets (environments and products and their setup).

Keep the setup stable unless the setup itself is the thing you are studying. If every variant runs through a different world, the results are harder to interpret.

You do not define friction metrics. AXP captures wall clock, tokens, cost, tool calls, and failures for every run. Use limits (max_turns, max_time_seconds, max_cost_usd) to cap each run before you scale.

The variants

Variants are the changes you want to measure. You do not write each one out by hand. You declare the variables you want to compare — agents, prompts, environments, and products — and AXP runs every combination as a variant. Change one variable at a time so the contrast stays clean. For each variant, make clear:

  • what changed
  • what stayed constant
  • which research question it informs
  • which test outcomes should move if the bet is right

For install friction, compare products that swap the install surface: current docs as a control, improved getting started docs, clearer install output, common install-pattern docs, and an exact install command.

For help loops, compare products again: current CLI help, task-oriented examples, renamed commands, and a skill or MCP tool.

The tests

Tests should measure the task, not coach the agent.

  • Application tests check the artifact, output, file, command, endpoint, or product behavior the agent was supposed to produce.
  • Introspection tests check the trace to understand how the agent got there, such as retries, failed commands, unsafe commands, or tool-use patterns.

Keep scaffolding in tests, not in prompts.

Smoke test and refine

Before spending a real budget, smoke test the experiment itself, not the agent. Validate it with axp validate, then run one variant once with axp run --variant <id> --repeat 1. Use --mock to exercise the setup, tests, and variants with no model spend.

Use it to check:

  • Does the setup work?
  • Do the tests work?
  • Do the results help answer the research question?
  • Are variants going through the intended paths?
  • Is the agent cheating or exploiting scaffolding?
  • Are failures real signal, or broken scaffolding?

Fix ambiguous prompts, brittle tests, broken setup, or leaky scaffolding, then smoke test again. When the setup works, the variants isolate the contrast, and the tests catch real success and failure, you are ready to scale.

Scale the experiment

Spend the larger budget: drop the --variant filter to run the full matrix, and raise --repeat for more trials per variant.

Analyze and decide

Start with the primary research question, then drill into variants, runs, tests, and traces. Read individual attempts in Runs and compare variants in Results.

Useful analysis patterns:

  • Compare distributions across variants, not just averages. A variant with the same mean but a tighter spread may be easier to trust.
  • Look for interaction effects between the things you varied. For example, docs might help one model more than another, or a CLI example might only help when paired with a specific prompt.
  • Ask your coding agent to find outliers in the results, then inspect the traces for those runs. Outliers often reveal the failure mode you actually need to fix.

When presenting results, include the research question, variants, test criteria, main evidence, and the decision you are making.

Reference: common experiment patterns

  • Discovery and install: Where do agents hit bottlenecks when discovering or onboarding onto a product?
  • Core user flows: Where do agents succeed or fail at important jobs-to-be-done?
  • Product optimization: Which variation of an API, CLI command, tool description, or interface helps agents succeed faster?
  • Interface comparison: Do agents perform better with docs, skills, MCP tools, APIs, or another interface?
  • Competitive analysis: What alternatives do agents reach for, and what can you learn from their agent experience?
  • Product marketing: Can you generate evidence that agents are more effective with your product?

Reference: common failure modes

Tests leak into prompts

Bad:

Answer each question as JSON with keys q1, q2, q3.
Use the exact sentinel INSUFFICIENT_DATA.

That tests whether the agent follows a harness protocol, not whether it can solve the user's task. Ask the task naturally, then make tests robust enough to parse natural answers.

Testing too many things

Every extra variable multiplies the number of variants. Keep the first experiment focused on the decision you need to make.

Making everything a variant

Not every difference belongs in one variant matrix. If you want to test meaningfully different setups, tasks, or worlds, separate experiments are often clearer.