Experiments | AXP Documentation

Experiments allow you to understand how agents use your products. When you define an experiment you define:

the agents (each includes a model) that will be evaluated in the experiment
the tasks you want the agents to try to accomplish, in the form of prompts
the product(s) or product version(s) that you are trying to test
the environments that you want agents to operate in
how you will measure the success of the agent, with tests

Each of these is an axis, and you can give any axis a list of values (several agents, prompts, products, or environments) to see how changing one aspect affects the agent's performance. When you run the experiment, AXP computes the cross product of those values and runs the task across every resulting variant.

For the exact field names, types, and validation rules, see the experiment YAML reference.

For guidance on choosing the product question, variants, tests, and success criteria, see Experiment Design.

Variants

A variant is one fully resolved combination of an agent (and its model), prompt, product, and environment. This forms the exact configuration that a single run executes inside of an isolated sandbox. Variants allow you to compare the performance of different configurations against each other.

You don't write variants by hand. They're the combinations produced by the cross product of your axes described above, so every combination of the values you declared becomes its own variant. Declaring two models and two products, for example, gives you four variants.

Tests

Tests allow you to define how you will measure the success of the agent at performing the task of your experiment. Tests are defined as scripts inside the experiment yaml, and each one you define is automatically executed inside the sandbox after the agent finishes its work. A test passes when its script exits 0 and fails when the script exits with an error, so write each script to inspect the agent's work and exit with an error when the result is wrong. You can define two types of tests: application tests and introspection tests.

Application tests

Application tests check the resulting state — the files, commands, endpoints, or product behavior the agent was supposed to produce. These are useful to check that the answer or work that the agent completed inside of the sandbox is correct.

Introspection tests

Introspection tests allow you to inspect the agent's trace to check how it completed the task. These are useful to catch a variant that reached the right answer the wrong way, or if the agent took a particularly circuitous path to get there.

Every experiment needs at least one test defined, either an application test or an introspection test. See the tests reference for scripting details.

Limits

Limits can be set in your experiment YAML file to prevent a single run from looping or spending too much time or money. Limits allow you to specify the maximum number of turns, wall-clock time, and cost for a run. Wall-clock and reported-cost limits abort the agent and mark the run timeout or cost_cap; turn limits are passed to the agent where supported.

Search documentation

Variants

Tests

Application tests

Introspection tests

Limits