Experiment YAML

Reference for the AXP experiment YAML schema (v2) — axes, agents and models, setups, extensions, tests, limits, secrets, files, MCP servers, and validation rules.

Experiment YAML is the authoring format passed to axp run and axp local run. The current parser accepts schema_version: 2 and rejects unknown fields at every level of the document.

v2 replaces v1's hand-assembled matrix.variant[] with independent axes (agents, prompts, and optional environments / products) that the harness cross-multiplies into variants automatically, plus a recursive extensions mechanism for the cases a clean cross product can't express. Only agent and prompt are required — every resolved variant must carry both — and each may be supplied at the top level or inside an extension. The model is not its own axis: it rides with the agent (agents[].model), so an agent and its model are chosen together.

Schema URL

Use the schema URL in authored experiment files for editor validation:

# yaml-language-server: $schema=https://docs.514.ai/schema/experiment.v2.schema.yaml

Published schema URLs:

  • Latest supported schema: https://docs.514.ai/schema/experiment.schema.yaml
  • Version-pinned v2 schema: https://docs.514.ai/schema/experiment.v2.schema.yaml

axp schema prints the latest supported schema to stdout.

# yaml-language-server: $schema=https://docs.514.ai/schema/experiment.v2.schema.yaml
schema_version: 2
id: clickhouse-cli-or-not
name: "Should agents use the ClickHouse CLI?"
description: |
  Context used when analyzing which arm produced the better outcome.

agents:
  - name: claude
    model: claude-sonnet-4-6

prompts:
  - id: analysis
    prompt: |
      Read /workspace/task.md and write the result to /workspace/report.json.

environments:
  - name: workspace
    setup: |
      # stage the shared fixture both arms use
      printf 'event,n\na,1\n' > /workspace/events.csv

# Two arms that differ in product setup + a prompt suffix — expressed as extensions.
extensions:
  - id: with-cli
    products:
      - name: clickhouse
        type: CLI
        setup: curl -fsSL https://clickhouse.com/ | sh
    prompts: ["Use the ClickHouse CLI for the analysis."]
  - id: without-cli
    products:
      - name: baseline
        setup: "true"
    prompts: ["Do not use the ClickHouse CLI; use another local method."]

tests:
  application:
    - name: report-exists
      script: |
        test -f /workspace/report.json

limits:
  max_turns: 25
  max_time_seconds: 300
  max_cost_usd: 0.50

Top-level fields

FieldRequirement
schema_versionRequired. Must be 2.
idRequired. Experiment identifier. Must be kebab-case matching ^[a-z0-9][a-z0-9-]*$.
nameRequired. Human-readable experiment name. Must not be empty after trimming whitespace.
descriptionOptional. Human-readable context AXP can use when analyzing the outcome. If present, must not be empty after trimming.
agentsAgent axis. One agent or a list. Each agent is a name (claude / codex / cursor) or a {name, model?} object whose model is a bare model id or a model object. A bare agent name uses the adapter's provider-default model. Optional at the top level — an extension may supply it — but every resolved variant must have an agent.
promptsPrompt axis. One prompt string, a list of strings, or a list of {id, prompt, tags?} objects. A bare string gets a positional id (p0, p1, …). Optional at the top level (an extension may supply it).
environmentsOptional environment axis. An object {name, setup, version?, commit?, tags?}, a list, or a bare string (sugar for an environment whose setup is that string, positional name e{idx}). setup is bash or setup object(s). Absent → the variant has no environment/setup unless an extension adds one.
productsOptional product axis: the target system under test — {name, type?, setup, version?, commit?, tags?}, like an environment with an optional type and version/commit. A bare string is sugar for its setup (positional name pr{idx}). Cross-multiplied when present.
extensionsOptional. Recursive refinements applied instead of the bare base cross product (narrow / redirect / compose). See Extensions.
secretsOptional. Host environment variable names injected into every variant. The YAML stores names only, never values. Global to the experiment.
filesOptional. Local host files / directories staged into every variant's /workspace before setup runs. Global. See Files.
testsRequired. Contains application and introspection lists. At least one test must exist across both lists.
limitsRequired. Hard stops for the run.

secrets and files are global to the experiment — they apply to every variant. MCP servers and setup checks are not top-level in v2: they belong to setup objects, so a variant exposes the MCP servers and runs the setup checks of the setups its environment and products reference.

Axes and variants

The harness cross-multiplies the axes into variants automatically:

variants = agents × prompts × environments × products

environments and products are optional coordinates (when absent, the variant simply has no environment/product). Each resolved variant gets a derived variant_id of <agent>[__<model>][__<effort>][__<context_window_size>][__thinking][__fast]__<prompt_id>[__<environment_id>][__<product_id>][__<extension_path…>] and a · -joined human tag. The coordinate ids (agent, model and its controls, prompt, environment, product) are recorded so results can be grouped and filtered along them.

Agents and models

Each agents entry pairs a coding agent with the model it runs. The model is not a separate axis — it is chosen together with the agent:

  • A bare agent name (claude) uses the adapter's provider-default model; the resolved model coordinate is empty and no model segment appears in the variant_id.
  • An agent object carries {name, model?}. model is either a bare model id (claude-opus-4-8) or a model object with optional controls:
Model fieldMeaning
nameRequired (when model is an object). The model id, e.g. claude-opus-4-8, gpt-5. Wired to the agent as MODEL.
effortOptional reasoning-effort control: one of low / medium / high / x-high / max. Wired as LEVEL_OF_EFFORT; each adapter honours what it supports.
context_window_sizeOptional context-window size hint (a string, e.g. 1M). Wired as CONTEXT_WINDOW; informational for fixed-context hosted models.
thinkingOptional boolean. Enable thinking mode if the model supports it. Wired as THINKING; adds a thinking segment to the variant_id when true.
fastOptional boolean. Enable fast mode if the model supports it. Wired as FAST; adds a fast segment to the variant_id when true.

Each control is single-valued (it does not sweep into multiple variants); to compare two models or two effort levels, list two agent entries.

agents:
  - name: claude
    model:
      name: claude-opus-4-8
      effort: high
      context_window_size: 1M
      thinking: true
  - name: codex
    model: gpt-5            # bare model id, provider defaults for the rest
prompts: "Refactor the parser."

Product type

A product may declare a type describing the surface under test. It defaults to Other and is recorded as a result dimension. Allowed values: CLI, MCP, API, Skill, SDK, Schema, Docs, Marketing, Agents.md, Other.

Variant isolation

Each variant runs in its own sandbox with its own /workspace, and variants execute in parallel. Nothing crosses the boundary between variants: there is no shared filesystem, no shared environment, no implicit ordering. A file written by one variant's setup is invisible to every other variant.

Do:

  • Produce any fixture state inside this variant's own setup (clone, build, scaffold — whatever this variant needs).
  • Fetch external artifacts (registries, S3, git remotes) from setup if they need to come from outside the harness.

Don't:

  • Wait on a path under /workspace/... expecting another variant to create it — it never will, and the wait loop will hang until the setup exec timeout fires and fails the variant.
  • Assume any side effect from another variant's setup, setup_checks, or test scripts is visible here.

If you find yourself wanting to share state across variants, bake it into the base image, materialise it in each variant's setup independently, or restructure the experiment so each variant is genuinely self-contained.

Setups

An environment's or product's setup is a bash string, a list of strings, or one or more setup objects — and a setup object is a first-class thing that owns the per-variant resources around its script. A bare string is sugar for a setup object whose script is that string (with a positional name, s{idx}).

Setup fieldRequirement
nameRequired for the object form. Kebab-case; recorded as a result dimension. A bare-string setup gets a positional name.
scriptRequired. Bash run in the variant sandbox before setup_checks and the agent.
descriptionOptional.
tagsOptional.
filesOptional. Host files staged into /workspace before this setup's script runs (same shape as top-level files).
secretsOptional. Host env-var names injected for setup_checks and the agent of variants that resolve this setup.
mcp_serversOptional. MCP servers exposed to the agent for variants that resolve this setup.
setup_checksOptional. Setup checks run after this setup's script but before the agent.

A variant gathers its merged setup from the setups its product then its environment reference, in declaration order; their scripts run sequentially, and their mcp_servers / setup_checks / secrets / files are unioned (experiment-global secrets / files are unioned in too).

environments:
  - name: workspace
    setup:
      - name: stage-fixture
        script: printf 'event,n\na,1\n' > /workspace/events.csv
        secrets: [GITHUB_TOKEN]
        mcp_servers:
          - name: fixture-sentinel
            type: stdio
            command: /workspace/fixture-mcp.py
        setup_checks:
          - name: fixture-present
            script: test -f /workspace/events.csv

Extensions

When a clean cross product can't express the variants you want — co-varying a product with a prompt, or adding a new dimension only to some arms — use a top-level extensions list. Each extension is a node in a recursive tree; walking root→leaf:

  • redirects axes — an extension's agents / environments / products, when present, replace that axis for its subtree (narrow to a subset, or point at a different set). This is how a required axis can be supplied only in an extension.
  • contributes a prompt — the extension's prompts text is appended to every inherited base prompt (joined by a blank line); the base prompt_id is kept. When there is no top-level prompt, the extension's prompt stands alone as the variant's prompt under a synthesized prompt_id (p0).
  • unions tags.

A leaf extension (one with no nested extensions) emits the cross product of its accumulated axes. When an experiment declares any extensions, only extension-derived variants are emitted — the bare base cross product is not. Each extension-derived variant records a resolved_extend_id (the extension id-path joined by ::, e.g. sec-edgar::aapl), and its variant_id appends that path with __.

FieldRequirement
idRequired. Kebab-case. Unique among siblings within the same extensions list.
descriptionOptional. Must not be empty when present.
tagsOptional. Unioned into the resolved variant's tags.
agents / environments / productsOptional. When present, redirect that axis for the subtree.
promptsOptional. Contributes suffix text appended to the inherited prompt (it does not replace the prompt axis).
extensionsOptional. Nested extensions compose recursively.
# One shared task, two products, each with its own prompt suffix.
prompts:
  - id: analyze
    prompt: "Read /workspace/task.md and write /workspace/report.json."
extensions:
  - id: with-cli
    products: [{ name: cli, type: CLI, setup: "curl …/clickhouse | sh" }]
    prompts: ["Use the ClickHouse CLI for the analysis."]
  - id: without-cli
    products: [{ name: no-cli, setup: "true" }]
    prompts: ["Do not use the ClickHouse CLI; use another local method."]

Tests

tests.application checks the resulting application state, such as files, commands, or endpoints.

tests.introspection is a separate test kind for trace-oriented checks (for example, inspecting the agent trace at $AXP_TRACE_PATH). Their logs are written separately from application test logs.

Each test has:

FieldRequirement
nameRequired. Must be kebab-case matching ^[a-z0-9][a-z0-9-]*$.
scriptRequired. Shell script to execute. Must not be empty after trimming whitespace.

At least one test is required across application and introspection. Test names must be globally unique across both kinds. Test scripts are streamed over stdin, so the agent never sees them.

Setup checks

setup_checks are bash preflight checks that run inside the variant sandbox after the setup script but before the agent, to fail fast on a broken environment instead of burning agent tokens. They are declared on a setup object (not at the top level). Each is a {name, script} entry; they run in declaration order and the first non-zero exit short-circuits the variant (exit_reason: setup_check_failed). Unlike the setup script, setup checks run with the variant's resolved secrets injected.

Limits

FieldRequirement
limits.max_turnsRequired. Must be greater than 0. Passed to the agent as its turn cap.
limits.max_time_secondsRequired. Must be greater than 0. Enforced by the harness as the wall-clock timeout for the agent invocation.
limits.max_cost_usdRequired. Must be greater than 0. Parsed and persisted in resolved variant YAML; this build does not abort a run when reported cost reaches the value.

Secrets

Declare host environment variable names with top-level secrets (global to the experiment). The YAML stores names only, never values. A setup object may declare additional secrets, which are unioned into the variants that resolve it.

secrets:
  - GITHUB_TOKEN
  - DATABASE_URL

Secret names must match ^[A-Z_][A-Z0-9_]*$. Lowercase names, names beginning with a digit, names containing hyphens or spaces, and empty names are rejected. Duplicates within the list are rejected.

Reserved names cannot be declared as experiment secrets:

  • ANTHROPIC_API_KEY
  • ANTHROPIC_BASE_URL
  • OPENAI_API_KEY
  • OPENAI_BASE_URL
  • CURSOR_API_KEY
  • MODEL
  • MAX_TURNS
  • IS_SANDBOX
  • TRACEPARENT
  • any name beginning with AXP_, CLAUDE_CODE_, CODEX_, CURSOR_, or OTEL_

Secrets are injected only by local runs (axp local run, via --env / --env-file). The platform path (axp run) does not yet deliver secrets and rejects any experiment that declares them at submit. See Secrets and auth.

Files

Stage local host files or directories into every variant's /workspace before setup runs — the path for local, uncommitted builds (CLIs, MCP servers, fixtures) to enter a sandbox. Top-level files is global to the experiment; a setup object may also declare files scoped to the variants that resolve it.

files:
  - name: my-cli            # kebab-case handle; required when source is omitted
    source: ../build/mycli  # host path; relative paths resolve against this YAML's directory
    dest: tools/mycli       # workspace-relative destination
  - source: https://example.com/fixtures/data.bin  # or an http(s) URL to a public artifact
    sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  # optional integrity check
    dest: fixtures/data.bin

Rules and behaviour:

  • A directory source extracts its contents under dest/; a file source lands as a single file at dest. No bind mounts are involved — files are copied into the sandbox before it starts working.
  • A source may be an http:// / https:// URL to a publicly fetchable artifact (no authentication or custom headers). The runner downloads it on the host before staging, and it lands as a single file at dest — an archive is not auto-extracted, so unpack it in setup if you need its contents. Other URL schemes (ftp://, file://, …) are rejected at validate time.
  • An optional sha256 (64 hex characters) verifies the staged bytes against the digest and fails staging on mismatch — for a local file source or a URL download alike. It is not valid for a directory source. On the CLI, --file-sha256 NAME=HASH attaches or overrides a named entry's digest; ad-hoc --file SOURCE::DEST entries have no name, so they can't carry one.
  • Sandboxes are Linux containers. A macOS-native binary staged from your laptop will not execute there — cross-compile WIP binaries to a Linux target before staging them.
  • dest must be workspace-relative: absolute paths, .., .axp-bridge components, and :: are rejected at validate time. A duplicate dest within a scope is rejected.
  • An entry may omit source, in which case it must have a name and the source is bound at run time: --file NAME=SOURCE where SOURCE is a host path or an http(s) URL. The same flag overrides the source of any named entry. --file SOURCE::DEST stages an ad-hoc entry into every variant.
  • Directory walks honor .axpignore (gitignore syntax). The real .gitignore is deliberately not consulted — build outputs are usually gitignored, and staging a local build is the point. .git/ is always excluded; symlinks inside the walk are skipped and counted, while a source that is itself a symlink is followed. Owner-executable files stay executable in the sandbox.
  • Missing or unbound sources abort the run at preflight, listing every problem at once — but only for variants actually scheduled to run. --dry-run instead renders the staging table with MISSING / UNBOUND annotations and never fails on them.
  • A staging failure mid-run rolls the variant up as status=error / exit_reason=staging_failed; the rest of the run keeps going. Per-entry records of what actually staged land in staging.json inside the variant artifact directory.
  • Platform runs stage files automatically. axp run builds one deterministic tar per entry locally, uploads only the digests the platform doesn't already have (content-addressed), and the in-sandbox bridge downloads, digest-verifies, and extracts each entry before setup. Staging works the same on axp run and axp local run.
  • Sources are read with the invoking user's permissions and may point anywhere on the host. Treat an experiment YAML like a script you run: review files: entries from untrusted sources before running.

MCP servers

Experiments can expose Model Context Protocol (MCP) servers to the coding agent. They are declared on a setup object (not at the top level), so a variant gets the MCP servers of the setups its environment and products reference. AXP passes the list through ACP session/new after setup has run, so stdio commands may point at files produced by setup. Names must be unique across the servers a variant resolves.

Every MCP server entry is tagged by transport via type:

environments:
  - name: workspace
    setup:
      - name: tools
        script: "true"
        mcp_servers:
          - name: fixture-sentinel
            type: stdio
            command: /workspace/fixture-mcp.py
            args: []
          - name: axp
            type: http
            url: http://localhost:3001/mcp

Three transports are supported: stdio (command, optional args), http (url), and sse (url).

Forwarding secrets to MCP servers

Stdio entries may declare an env list and HTTP/SSE entries may declare headers. Both reference experiment-declared secrets by name; values are resolved at runtime and never appear in the experiment YAML.

secrets:
  - GITHUB_TOKEN
  - SUPABASE_SERVICE_ROLE_KEY

environments:
  - name: workspace
    setup:
      - name: tools
        script: "true"
        mcp_servers:
          - name: gh-mcp
            type: stdio
            command: /workspace/gh-mcp
            env:
              - GITHUB_TOKEN                            # sugar for {name: GITHUB_TOKEN, from: GITHUB_TOKEN}
              - { name: GH_AUTH, from: GITHUB_TOKEN }   # rename on the way in
          - name: supabase-mcp
            type: http
            url: https://example/mcp
            headers:
              - { name: Authorization, value: "Bearer ${SUPABASE_SERVICE_ROLE_KEY}" }

Rules enforced at axp validate time:

  • Every stdio env[*].from and every ${NAME} placeholder in a header value must reference a name visible to the variant (experiment-global secrets or a secret declared on a setup the variant resolves).
  • Bare $NAME (no curlies) is treated as a literal; only ${NAME} is a placeholder. Unterminated ${ and empty ${} are rejected.
  • HTTP header names are unique per server, case-insensitively.
  • command / args / env are only valid for type: stdio; url / headers are only valid for type: http / sse.

Leak surface. Resolved secret values are written verbatim into the session/new JSON-RPC frame, which is teed to agent-events.jsonl before any redaction. They will appear in that artifact and in any debug bundle. Treat artifacts as sensitive whenever an experiment forwards secrets to MCP servers.

Validation rules

AXP validates experiment YAML when loading it. Validation runs in layers: first the versioned JSON Schema validates the parsed data model, then Rust semantic validation checks cross-field invariants, and finally the experiment is resolved to confirm the variant set is well-formed. An experiment is invalid if:

  • the YAML contains a field not defined by the schema
  • schema_version is anything other than 2
  • id, axis ids (prompt id, environment / product name, setup name, extension id), or test names are not kebab-case
  • an agent's model id contains ::
  • a declared axis (agents, prompts, environments, products) is present but empty, or has a duplicate id / name
  • the experiment resolves to zero variantsagents and prompts are required, so each must be supplied at the top level or via an extension
  • a resolved variant has an empty prompt (no top-level prompt and no extension supplying one)
  • two resolved variants collide on variant_id (make extension ids unique within each list)
  • both tests.application and tests.introspection are empty, or a test name is duplicated
  • a secret name is invalid, duplicated, or reserved by the harness
  • a files entry omits both source and name, has a bad name / source / sha256, or a dest that is absolute, contains .. / . / .axp-bridge components or ::, or duplicates another dest in scope
  • an MCP server references a secret not visible to the variant, mixes stdio-only and endpoint-only fields, or has duplicate env / header names or a malformed ${...} placeholder
  • any limit is not greater than zero

YAML syntax boundaries

The experiment data model is JSON-compatible even though the authoring file is YAML.

  • YAML comments are allowed.
  • YAML anchors and aliases are allowed when the resolved value is JSON-compatible (handy for keeping repeated setup or prompt blocks DRY).
  • Custom YAML tags are unsupported.
  • Non-string mapping keys are unsupported.