Experiment YAML
Reference for the AXP experiment YAML schema (v2) — axes, agents and models, setups, extensions, tests, limits, secrets, files, MCP servers, and validation rules.
Experiment YAML is the authoring format passed to axp run and axp local run. The current parser accepts schema_version: 2 and rejects unknown fields at every level of the document.
v2 replaces v1's hand-assembled matrix.variant[] with independent axes (agents, prompts, and optional environments / products) that the harness cross-multiplies into variants automatically, plus a recursive extensions mechanism for the cases a clean cross product can't express. Only agent and prompt are required — every resolved variant must carry both — and each may be supplied at the top level or inside an extension. The model is not its own axis: it rides with the agent (agents[].model), so an agent and its model are chosen together.
Schema URL
Use the schema URL in authored experiment files for editor validation:
# yaml-language-server: $schema=https://docs.514.ai/schema/experiment.v2.schema.yamlPublished schema URLs:
- Latest supported schema:
https://docs.514.ai/schema/experiment.schema.yaml - Version-pinned v2 schema:
https://docs.514.ai/schema/experiment.v2.schema.yaml
axp schema prints the latest supported schema to stdout.
# yaml-language-server: $schema=https://docs.514.ai/schema/experiment.v2.schema.yaml
schema_version: 2
id: clickhouse-cli-or-not
name: "Should agents use the ClickHouse CLI?"
description: |
Context used when analyzing which arm produced the better outcome.
agents:
- name: claude
model: claude-sonnet-4-6
prompts:
- id: analysis
prompt: |
Read /workspace/task.md and write the result to /workspace/report.json.
environments:
- name: workspace
setup: |
# stage the shared fixture both arms use
printf 'event,n\na,1\n' > /workspace/events.csv
# Two arms that differ in product setup + a prompt suffix — expressed as extensions.
extensions:
- id: with-cli
products:
- name: clickhouse
type: CLI
setup: curl -fsSL https://clickhouse.com/ | sh
prompts: ["Use the ClickHouse CLI for the analysis."]
- id: without-cli
products:
- name: baseline
setup: "true"
prompts: ["Do not use the ClickHouse CLI; use another local method."]
tests:
application:
- name: report-exists
script: |
test -f /workspace/report.json
limits:
max_turns: 25
max_time_seconds: 300
max_cost_usd: 0.50Top-level fields
| Field | Requirement |
|---|---|
schema_version | Required. Must be 2. |
id | Required. Experiment identifier. Must be kebab-case matching ^[a-z0-9][a-z0-9-]*$. |
name | Required. Human-readable experiment name. Must not be empty after trimming whitespace. |
description | Optional. Human-readable context AXP can use when analyzing the outcome. If present, must not be empty after trimming. |
agents | Agent axis. One agent or a list. Each agent is a name (claude / codex / cursor) or a {name, model?} object whose model is a bare model id or a model object. A bare agent name uses the adapter's provider-default model. Optional at the top level — an extension may supply it — but every resolved variant must have an agent. |
prompts | Prompt axis. One prompt string, a list of strings, or a list of {id, prompt, tags?} objects. A bare string gets a positional id (p0, p1, …). Optional at the top level (an extension may supply it). |
environments | Optional environment axis. An object {name, setup, version?, commit?, tags?}, a list, or a bare string (sugar for an environment whose setup is that string, positional name e{idx}). setup is bash or setup object(s). Absent → the variant has no environment/setup unless an extension adds one. |
products | Optional product axis: the target system under test — {name, type?, setup, version?, commit?, tags?}, like an environment with an optional type and version/commit. A bare string is sugar for its setup (positional name pr{idx}). Cross-multiplied when present. |
extensions | Optional. Recursive refinements applied instead of the bare base cross product (narrow / redirect / compose). See Extensions. |
secrets | Optional. Host environment variable names injected into every variant. The YAML stores names only, never values. Global to the experiment. |
files | Optional. Local host files / directories staged into every variant's /workspace before setup runs. Global. See Files. |
tests | Required. Contains application and introspection lists. At least one test must exist across both lists. |
limits | Required. Hard stops for the run. |
secrets and files are global to the experiment — they apply to every variant. MCP servers and setup checks are not top-level in v2: they belong to setup objects, so a variant exposes the MCP servers and runs the setup checks of the setups its environment and products reference.
Axes and variants
The harness cross-multiplies the axes into variants automatically:
variants = agents × prompts × environments × productsenvironments and products are optional coordinates (when absent, the variant simply has no environment/product). Each resolved variant gets a derived variant_id of <agent>[__<model>][__<effort>][__<context_window_size>][__thinking][__fast]__<prompt_id>[__<environment_id>][__<product_id>][__<extension_path…>] and a · -joined human tag. The coordinate ids (agent, model and its controls, prompt, environment, product) are recorded so results can be grouped and filtered along them.
Agents and models
Each agents entry pairs a coding agent with the model it runs. The model is not a separate axis — it is chosen together with the agent:
- A bare agent name (
claude) uses the adapter's provider-default model; the resolvedmodelcoordinate is empty and no model segment appears in thevariant_id. - An agent object carries
{name, model?}.modelis either a bare model id (claude-opus-4-8) or a model object with optional controls:
| Model field | Meaning |
|---|---|
name | Required (when model is an object). The model id, e.g. claude-opus-4-8, gpt-5. Wired to the agent as MODEL. |
effort | Optional reasoning-effort control: one of low / medium / high / x-high / max. Wired as LEVEL_OF_EFFORT; each adapter honours what it supports. |
context_window_size | Optional context-window size hint (a string, e.g. 1M). Wired as CONTEXT_WINDOW; informational for fixed-context hosted models. |
thinking | Optional boolean. Enable thinking mode if the model supports it. Wired as THINKING; adds a thinking segment to the variant_id when true. |
fast | Optional boolean. Enable fast mode if the model supports it. Wired as FAST; adds a fast segment to the variant_id when true. |
Each control is single-valued (it does not sweep into multiple variants); to compare two models or two effort levels, list two agent entries.
agents:
- name: claude
model:
name: claude-opus-4-8
effort: high
context_window_size: 1M
thinking: true
- name: codex
model: gpt-5 # bare model id, provider defaults for the rest
prompts: "Refactor the parser."Product type
A product may declare a type describing the surface under test. It defaults to Other and is recorded as a result dimension. Allowed values: CLI, MCP, API, Skill, SDK, Schema, Docs, Marketing, Agents.md, Other.
Variant isolation
Each variant runs in its own sandbox with its own /workspace, and variants execute in parallel. Nothing crosses the boundary between variants: there is no shared filesystem, no shared environment, no implicit ordering. A file written by one variant's setup is invisible to every other variant.
Do:
- Produce any fixture state inside this variant's own
setup(clone, build, scaffold — whatever this variant needs). - Fetch external artifacts (registries, S3, git remotes) from
setupif they need to come from outside the harness.
Don't:
- Wait on a path under
/workspace/...expecting another variant to create it — it never will, and the wait loop will hang until the setup exec timeout fires and fails the variant. - Assume any side effect from another variant's
setup,setup_checks, or test scripts is visible here.
If you find yourself wanting to share state across variants, bake it into the base image, materialise it in each variant's setup independently, or restructure the experiment so each variant is genuinely self-contained.
Setups
An environment's or product's setup is a bash string, a list of strings, or one or more setup objects — and a setup object is a first-class thing that owns the per-variant resources around its script. A bare string is sugar for a setup object whose script is that string (with a positional name, s{idx}).
| Setup field | Requirement |
|---|---|
name | Required for the object form. Kebab-case; recorded as a result dimension. A bare-string setup gets a positional name. |
script | Required. Bash run in the variant sandbox before setup_checks and the agent. |
description | Optional. |
tags | Optional. |
files | Optional. Host files staged into /workspace before this setup's script runs (same shape as top-level files). |
secrets | Optional. Host env-var names injected for setup_checks and the agent of variants that resolve this setup. |
mcp_servers | Optional. MCP servers exposed to the agent for variants that resolve this setup. |
setup_checks | Optional. Setup checks run after this setup's script but before the agent. |
A variant gathers its merged setup from the setups its product then its environment reference, in declaration order; their scripts run sequentially, and their mcp_servers / setup_checks / secrets / files are unioned (experiment-global secrets / files are unioned in too).
environments:
- name: workspace
setup:
- name: stage-fixture
script: printf 'event,n\na,1\n' > /workspace/events.csv
secrets: [GITHUB_TOKEN]
mcp_servers:
- name: fixture-sentinel
type: stdio
command: /workspace/fixture-mcp.py
setup_checks:
- name: fixture-present
script: test -f /workspace/events.csvExtensions
When a clean cross product can't express the variants you want — co-varying a product with a prompt, or adding a new dimension only to some arms — use a top-level extensions list. Each extension is a node in a recursive tree; walking root→leaf:
- redirects axes — an extension's
agents/environments/products, when present, replace that axis for its subtree (narrow to a subset, or point at a different set). This is how a required axis can be supplied only in anextension. - contributes a prompt — the extension's
promptstext is appended to every inherited base prompt (joined by a blank line); the baseprompt_idis kept. When there is no top-level prompt, the extension's prompt stands alone as the variant's prompt under a synthesizedprompt_id(p0). - unions tags.
A leaf extension (one with no nested extensions) emits the cross product of its accumulated axes. When an experiment declares any extensions, only extension-derived variants are emitted — the bare base cross product is not. Each extension-derived variant records a resolved_extend_id (the extension id-path joined by ::, e.g. sec-edgar::aapl), and its variant_id appends that path with __.
| Field | Requirement |
|---|---|
id | Required. Kebab-case. Unique among siblings within the same extensions list. |
description | Optional. Must not be empty when present. |
tags | Optional. Unioned into the resolved variant's tags. |
agents / environments / products | Optional. When present, redirect that axis for the subtree. |
prompts | Optional. Contributes suffix text appended to the inherited prompt (it does not replace the prompt axis). |
extensions | Optional. Nested extensions compose recursively. |
# One shared task, two products, each with its own prompt suffix.
prompts:
- id: analyze
prompt: "Read /workspace/task.md and write /workspace/report.json."
extensions:
- id: with-cli
products: [{ name: cli, type: CLI, setup: "curl …/clickhouse | sh" }]
prompts: ["Use the ClickHouse CLI for the analysis."]
- id: without-cli
products: [{ name: no-cli, setup: "true" }]
prompts: ["Do not use the ClickHouse CLI; use another local method."]Tests
tests.application checks the resulting application state, such as files, commands, or endpoints.
tests.introspection is a separate test kind for trace-oriented checks (for example, inspecting the agent trace at $AXP_TRACE_PATH). Their logs are written separately from application test logs.
Each test has:
| Field | Requirement |
|---|---|
name | Required. Must be kebab-case matching ^[a-z0-9][a-z0-9-]*$. |
script | Required. Shell script to execute. Must not be empty after trimming whitespace. |
At least one test is required across application and introspection. Test names must be globally unique across both kinds. Test scripts are streamed over stdin, so the agent never sees them.
Setup checks
setup_checks are bash preflight checks that run inside the variant sandbox after the setup script but before the agent, to fail fast on a broken environment instead of burning agent tokens. They are declared on a setup object (not at the top level). Each is a {name, script} entry; they run in declaration order and the first non-zero exit short-circuits the variant (exit_reason: setup_check_failed). Unlike the setup script, setup checks run with the variant's resolved secrets injected.
Limits
| Field | Requirement |
|---|---|
limits.max_turns | Required. Must be greater than 0. Passed to the agent as its turn cap. |
limits.max_time_seconds | Required. Must be greater than 0. Enforced by the harness as the wall-clock timeout for the agent invocation. |
limits.max_cost_usd | Required. Must be greater than 0. Parsed and persisted in resolved variant YAML; this build does not abort a run when reported cost reaches the value. |
Secrets
Declare host environment variable names with top-level secrets (global to the experiment). The YAML stores names only, never values. A setup object may declare additional secrets, which are unioned into the variants that resolve it.
secrets:
- GITHUB_TOKEN
- DATABASE_URLSecret names must match ^[A-Z_][A-Z0-9_]*$. Lowercase names, names beginning with a digit, names containing hyphens or spaces, and empty names are rejected. Duplicates within the list are rejected.
Reserved names cannot be declared as experiment secrets:
ANTHROPIC_API_KEYANTHROPIC_BASE_URLOPENAI_API_KEYOPENAI_BASE_URLCURSOR_API_KEYMODELMAX_TURNSIS_SANDBOXTRACEPARENT- any name beginning with
AXP_,CLAUDE_CODE_,CODEX_,CURSOR_, orOTEL_
Secrets are injected only by local runs (axp local run, via --env / --env-file). The platform path (axp run) does not yet deliver secrets and rejects any experiment that declares them at submit. See Secrets and auth.
Files
Stage local host files or directories into every variant's /workspace before setup runs — the path for local, uncommitted builds (CLIs, MCP servers, fixtures) to enter a sandbox. Top-level files is global to the experiment; a setup object may also declare files scoped to the variants that resolve it.
files:
- name: my-cli # kebab-case handle; required when source is omitted
source: ../build/mycli # host path; relative paths resolve against this YAML's directory
dest: tools/mycli # workspace-relative destination
- source: https://example.com/fixtures/data.bin # or an http(s) URL to a public artifact
sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 # optional integrity check
dest: fixtures/data.binRules and behaviour:
- A directory
sourceextracts its contents underdest/; a filesourcelands as a single file atdest. No bind mounts are involved — files are copied into the sandbox before it starts working. - A
sourcemay be anhttp:///https://URL to a publicly fetchable artifact (no authentication or custom headers). The runner downloads it on the host before staging, and it lands as a single file atdest— an archive is not auto-extracted, so unpack it insetupif you need its contents. Other URL schemes (ftp://,file://, …) are rejected at validate time. - An optional
sha256(64 hex characters) verifies the staged bytes against the digest and fails staging on mismatch — for a local file source or a URL download alike. It is not valid for a directory source. On the CLI,--file-sha256 NAME=HASHattaches or overrides a named entry's digest; ad-hoc--file SOURCE::DESTentries have no name, so they can't carry one. - Sandboxes are Linux containers. A macOS-native binary staged from your laptop will not execute there — cross-compile WIP binaries to a Linux target before staging them.
destmust be workspace-relative: absolute paths,..,.axp-bridgecomponents, and::are rejected at validate time. A duplicatedestwithin a scope is rejected.- An entry may omit
source, in which case it must have anameand the source is bound at run time:--file NAME=SOURCEwhereSOURCEis a host path or anhttp(s)URL. The same flag overrides the source of any named entry.--file SOURCE::DESTstages an ad-hoc entry into every variant. - Directory walks honor
.axpignore(gitignore syntax). The real.gitignoreis deliberately not consulted — build outputs are usually gitignored, and staging a local build is the point..git/is always excluded; symlinks inside the walk are skipped and counted, while asourcethat is itself a symlink is followed. Owner-executable files stay executable in the sandbox. - Missing or unbound sources abort the run at preflight, listing every problem at once — but only for variants actually scheduled to run.
--dry-runinstead renders the staging table withMISSING/UNBOUNDannotations and never fails on them. - A staging failure mid-run rolls the variant up as
status=error/exit_reason=staging_failed; the rest of the run keeps going. Per-entry records of what actually staged land instaging.jsoninside the variant artifact directory. - Platform runs stage files automatically.
axp runbuilds one deterministic tar per entry locally, uploads only the digests the platform doesn't already have (content-addressed), and the in-sandbox bridge downloads, digest-verifies, and extracts each entry beforesetup. Staging works the same onaxp runandaxp local run. - Sources are read with the invoking user's permissions and may point anywhere on the host. Treat an experiment YAML like a script you run: review
files:entries from untrusted sources before running.
MCP servers
Experiments can expose Model Context Protocol (MCP) servers to the coding agent. They are declared on a setup object (not at the top level), so a variant gets the MCP servers of the setups its environment and products reference. AXP passes the list through ACP session/new after setup has run, so stdio commands may point at files produced by setup. Names must be unique across the servers a variant resolves.
Every MCP server entry is tagged by transport via type:
environments:
- name: workspace
setup:
- name: tools
script: "true"
mcp_servers:
- name: fixture-sentinel
type: stdio
command: /workspace/fixture-mcp.py
args: []
- name: axp
type: http
url: http://localhost:3001/mcpThree transports are supported: stdio (command, optional args), http (url), and sse (url).
Forwarding secrets to MCP servers
Stdio entries may declare an env list and HTTP/SSE entries may declare headers. Both reference experiment-declared secrets by name; values are resolved at runtime and never appear in the experiment YAML.
secrets:
- GITHUB_TOKEN
- SUPABASE_SERVICE_ROLE_KEY
environments:
- name: workspace
setup:
- name: tools
script: "true"
mcp_servers:
- name: gh-mcp
type: stdio
command: /workspace/gh-mcp
env:
- GITHUB_TOKEN # sugar for {name: GITHUB_TOKEN, from: GITHUB_TOKEN}
- { name: GH_AUTH, from: GITHUB_TOKEN } # rename on the way in
- name: supabase-mcp
type: http
url: https://example/mcp
headers:
- { name: Authorization, value: "Bearer ${SUPABASE_SERVICE_ROLE_KEY}" }Rules enforced at axp validate time:
- Every stdio
env[*].fromand every${NAME}placeholder in a headervaluemust reference a name visible to the variant (experiment-globalsecretsor a secret declared on a setup the variant resolves). - Bare
$NAME(no curlies) is treated as a literal; only${NAME}is a placeholder. Unterminated${and empty${}are rejected. - HTTP header names are unique per server, case-insensitively.
command/args/envare only valid fortype: stdio;url/headersare only valid fortype: http/sse.
Leak surface. Resolved secret values are written verbatim into the session/new JSON-RPC frame, which is teed to agent-events.jsonl before any redaction. They will appear in that artifact and in any debug bundle. Treat artifacts as sensitive whenever an experiment forwards secrets to MCP servers.
Validation rules
AXP validates experiment YAML when loading it. Validation runs in layers: first the versioned JSON Schema validates the parsed data model, then Rust semantic validation checks cross-field invariants, and finally the experiment is resolved to confirm the variant set is well-formed. An experiment is invalid if:
- the YAML contains a field not defined by the schema
schema_versionis anything other than2id, axis ids (prompt id, environment / productname, setupname, extension id), or test names are not kebab-case- an agent's model id contains
:: - a declared axis (
agents,prompts,environments,products) is present but empty, or has a duplicate id / name - the experiment resolves to zero variants —
agentsandpromptsare required, so each must be supplied at the top level or via anextension - a resolved variant has an empty prompt (no top-level prompt and no
extensionsupplying one) - two resolved variants collide on
variant_id(makeextensionids unique within each list) - both
tests.applicationandtests.introspectionare empty, or a test name is duplicated - a secret name is invalid, duplicated, or reserved by the harness
- a
filesentry omits bothsourceandname, has a badname/source/sha256, or adestthat is absolute, contains.././.axp-bridgecomponents or::, or duplicates anotherdestin scope - an MCP server references a secret not visible to the variant, mixes stdio-only and endpoint-only fields, or has duplicate
env/ header names or a malformed${...}placeholder - any limit is not greater than zero
YAML syntax boundaries
The experiment data model is JSON-compatible even though the authoring file is YAML.
- YAML comments are allowed.
- YAML anchors and aliases are allowed when the resolved value is JSON-compatible (handy for keeping repeated setup or prompt blocks DRY).
- Custom YAML tags are unsupported.
- Non-string mapping keys are unsupported.