AXI: Agent eXperience Interface

10 design principles for building agent-ergonomic CLI tools. 100% task success at the lowest cost—beating both MCP and raw CLI.

Success Rate
gh-cli
86%
gh-mcp
87%
gh-mcp-search
82%
gh-mcp-code
84%
gh-axi
100%
Avg Cost per Task
gh-cli
$0.054
gh-mcp
$0.148
gh-mcp-search
$0.147
gh-mcp-code
$0.101
gh-axi
$0.050

Try it now

gh-axi is a wrapper around the gh CLI, optimized for agent ergonomics.

npm install -g gh-axi

Add to your CLAUDE.md or AGENTS.md:

Use `gh-axi` (replacement for `gh` CLI) for all GitHub operations.

Build your own AXI

Install the AXI skill to get the design guidelines and scaffolding:

npx skills add kunchenguid/axi

1. Introduction

AI agents interact with external services through two dominant paradigms. The first is shell-based CLI execution: the agent runs commands like gh issue list and parses the text output. The second is structured tool protocols like MCP (Model Context Protocol), where the agent invokes typed tool functions through the hosting framework’s native tool-calling interface.

Both approaches have significant problems for agents:

  • CLI: verbose, metadata-sparse output. Human-oriented CLIs produce tab-separated tables, ANSI formatting, and per-item descriptions that waste token budget. Worse, they omit critical metadata—a label list returns 30 items with no total count, and the agent reports 30 when the actual answer is 126.
  • MCP: massive schema overhead. MCP tool definitions consume 137K–176K input tokens per task in our benchmark, compared to 46–47K for CLI conditions. This 3–4× context inflation translates directly to 2–3× higher cost per task ($0.10–0.15 vs. $0.05).
  • Both: poor discoverability. CLI agents must guess subcommands or read --help. MCP agents with lazy loading waste turns on tool discovery. Neither provides in-context guidance on what to do next.

Recent debate has framed this as “MCP vs. CLI” [2], but the real question is neither which protocol nor which transport, but rather what design principles make an agent-tool interface effective.

AXI is 10 principles for agent-ergonomic CLI design that treat token budget as a first-class constraint. AXI achieves the reliability advantages MCP promises (structured output, discoverability) at the cost profile of a CLI.

Contributions:

  1. Ten design principles for agent-ergonomic CLI tools, organized into four categories.
  2. A reference implementation (gh-axi) demonstrating the principles on the GitHub platform.
  3. A 425-run benchmark comparing five agent-tool interface conditions, with trajectory-level analysis showing where and why each interface fails.

2. Related Work

Model Context Protocol (MCP)

Anthropic’s Model Context Protocol [1] provides a standardized way to connect AI models to external tools through structured, typed tool definitions. While MCP offers type-safe schemas and discoverable tool catalogs, the schema overhead is substantial. The evaluation below quantifies this cost: MCP tool definitions consume 3–4× more context tokens than equivalent CLI conditions, translating to 2–3× higher cost.

MCP vs. CLI benchmarks

Mao and Pradhan [2] benchmarked 756 runs comparing raw APIs, CLI, and native MCP across GitHub, Linear, and Singapore Bus APIs. They found native MCP achieved 91.7% success vs. 83.3% for CLI, with CLI using 2.9× more billed tokens. However, their CLI was auto-generated from API specs, not hand-designed for agents. They note: “What this does not settle is whether a hand-crafted, agent-first CLI could close the remaining gap.” AXI answers this question directly—an agent-first CLI not only closes the gap but surpasses MCP on every metric.

Code execution with MCP

Jones and Kelly [3] identified two scaling problems with direct MCP tool calls: tool definitions overload the context window, and intermediate results consume additional tokens as they pass through the model between calls. They proposed presenting MCP tools as a code API that the agent programs against, so that data flows through the execution environment rather than the context window. Varda and Pai [4] independently arrived at the same insight, coining the term “Code Mode” and reporting that LLMs handle complex, multi-step tool interactions better when writing TypeScript than when making direct tool calls. The mcp-with-code-mode condition in this benchmark evaluates this approach: it is the cheapest MCP condition ($0.101/task vs. $0.147–0.148), confirming that code execution reduces MCP overhead, but it remains 2× more expensive than AXI and introduces new failure modes from runtime errors in generated code.

3. The 10 Principles

  1. Token-efficient output — Use TOON format for ~40% token savings over JSON
  2. Minimal default schemas — 3–4 fields per list item, not 10+
  3. Content truncation — Truncate large text fields with size hints and escape hatches
  4. Content first — Prefer showing actual data, not a wall of help text
  5. Contextual disclosure — Append relevant next-step commands after output, not all upfront
  6. Consistent way to get help — Concise per-subcommand reference for when agents need it
  7. Pre-computed fields — Include aggregated statuses that eliminate round trips
  8. Definitive empty states — Explicit “0 results” rather than ambiguous empty output
  9. Graceful error handling — Idempotent mutations, structured errors, no interactive prompts
  10. Output discipline — stdout for data, stderr for debug; clean exit codes

Output Efficiency

1. Token-efficient output

Use TOON (Token-Optimized Object Notation) format instead of JSON or tab-separated tables. TOON omits braces, quotes, and commas, yielding approximately 40% token savings over equivalent JSON while remaining unambiguous to LLMs.

Conventional (JSON)
[{"number":42,"title":"Fix login bug","state":"open",
  "author":"alice","labels":["bug","P1"]},
 {"number":43,"title":"Add dark mode","state":"open",
  "author":"bob","labels":["feature"]}]
AXI (TOON)
issues[2]{number,title,state}:
  42,Fix login bug,open
  43,Add dark mode,open
2. Minimal default schemas

Return 3–4 fields per list item by default, not 10+. Agents rarely need all available fields and can request additional ones explicitly via a --fields flag.

3. Content truncation

Truncate large text fields to a configurable limit, appending a size hint such as (truncated, 2847 chars total — use --full to see complete body). This prevents a single verbose response from consuming the agent’s context budget while preserving enough content for most tasks.

Discoverability

4. Content first

Running a command with no arguments should display live, actionable data rather than help text.

Conventional: no-args shows help
$ gh issue
Work with GitHub issues.

USAGE
  gh issue <command> [flags]

AVAILABLE COMMANDS
  close, create, list, view, ...
AXI: no-args shows live state
$ gh-axi issue
count: 14 of 8771 total
issues[14]{number,title,state}:
  51815,"[Bug]: Telegram plugin fails...",open
  ...
help[2]:
  Run `gh-axi issue view <number>`
  Run `gh-axi issue create --title "..."`
5. Contextual disclosure

Append help[] lines after output, suggesting logical next steps as complete, copy-pasteable commands. This eliminates tool-discovery turns and guides agents through multi-step workflows.

6. Consistent way to get help

Each subcommand should offer a concise --help flag as a fallback when contextual hints are insufficient.

Completeness

7. Pre-computed fields

Include aggregate or derived fields that eliminate round trips. The most impactful example is totalCount: always report the total number of items, not just the page size. Other examples include computed CI status summaries (e.g., “27 passed, 0 failed, 10 skipped”) inline in PR views.

Conventional: no total count
$ gh label list
bug    Something isn't working    #d73a4a
docs   Improvements or additions  #0075ca
... (30 rows -- default page, no total)
AXI: total count + CI pre-computed
$ gh-axi label list
count: 126
labels[126]{name}:
  bug
  docs
  ...

$ gh-axi pr view 51772
pull_request:
  title: "refactor(plugins): route Telegram..."
  state: merged
  checks: "27 passed, 0 failed, 10 skipped"
8. Definitive empty states

When a query returns no results, output an explicit zero-result message rather than empty output. Agents cannot distinguish “no output” from “command failed silently” without this signal.

Robustness

9. Graceful error handling

Mutations should be idempotent, errors should be structured and written to stdout (not stderr), and commands must never prompt for interactive input.

10. Output discipline

Reserve stdout for structured data and stderr for debug/log output. Use clean exit codes: 0 for success, 1 for errors.

4. Experimental Setup

Conditions

The benchmark evaluates five agent-tool interface conditions:

Condition Interface Description
axi gh-axi AXI-compliant wrapper over gh CLI
cli gh CLI Raw GitHub CLI baseline
mcp-no-toolsearch GitHub MCP All tool schemas loaded upfront
mcp-with-toolsearch GitHub MCP + ToolSearch Tools discovered on demand via ToolSearch
mcp-with-code-mode TypeScript wrappers over Github MCP Agent writes .ts scripts to call tools

Task design

The benchmark includes 17 tasks across four categories:

  • Simple lookups (4 tasks): list_open_issues, view_pr, list_releases, repo_overview—single-command tasks.
  • Multi-step investigations (7 tasks): issue_then_comments, pr_then_checks, release_then_body, run_then_jobs, ci_failure_investigation, pr_review_prep, merged_pr_ci_audit—tasks requiring 2–5 sequential commands.
  • Aggregate/count tasks (4 tasks): list_labels, bug_triage_search, weekly_catchup, find_fix_for_bug—tasks requiring counts or aggregations.
  • Error handling (2 tasks): invalid_issue, nonexistent_repo—tasks where the target does not exist.

Infrastructure

Each run uses a fresh shallow clone of openclaw/openclaw. Agent: Claude Sonnet 4.6. Judge: Claude Sonnet 4.6. Five repeats per condition–task pair: 17 × 5 × 5 = 425 total runs.

Metrics

  • Task success (binary): determined by an LLM judge comparing the agent’s answer against a reference answer and rubric.
  • Cost (USD): total API cost, computed from input/output token counts at published rates.
  • Duration (seconds): wall-clock time from task start to final response.
  • Turns: number of tool invocations the agent makes.

5. Results

Aggregate — 425 runs, 17 tasks × 5 repetitions per condition

Condition Success Avg Cost Avg Duration Avg Turns
gh-cli 86% $0.054 17.4s 3
gh-mcp 87% $0.148 34.2s 6
gh-mcp-search 82% $0.147 41.1s 8
gh-mcp-code 84% $0.101 43.4s 7
gh-axi 100% $0.050 15.7s 3

AXI achieves 100% task success at the lowest average cost ($0.050/task) and shortest average duration (15.7s). The raw CLI baseline achieves 86% success at comparable cost. All three MCP conditions achieve 82–87% success at 2–3× the cost and 2–3× the duration.

Per-task breakdown

Success rate (out of 5 runs), average cost, and average duration per condition–task pair.

Task axi cli mcp mcp-search mcp-code
$ $ $ $ $
Simple lookups
list_open_issues 5/5 .028 10 5/5 .024 8 5/5 .100 25 5/5 .046 18 5/5 .086 37
view_pr 5/5 .039 8 5/5 .038 9 5/5 .045 9 5/5 .031 12 5/5 .072 28
list_releases 5/5 .039 11 5/5 .038 9 5/5 .096 8 5/5 .055 11 5/5 .035 18
repo_overview 5/5 .038 9 5/5 .038 10 5/5 .056 9 5/5 .042 14 5/5 .033 16
Multi-step investigations
issue_then_comments 5/5 .049 13 5/5 .046 11 5/5 .065 14 5/5 .050 20 5/5 .090 39
pr_then_checks 5/5 .050 14 5/5 .055 15 5/5 .112 32 5/5 .077 35 5/5 .087 46
release_then_body 5/5 .053 14 5/5 .052 12 5/5 .060 11 5/5 .041 16 5/5 .103 43
run_then_jobs 5/5 .048 13 5/5 .050 14 4/5 .063 24 0/5 .061 24 5/5 .047 20
ci_failure_investigation 5/5 .065 21 5/5 .061 22 2/5 .758 177 2/5 .568 151 5/5 .194 94
pr_review_prep 5/5 .055 19 5/5 .070 25 5/5 .100 32 5/5 .087 31 5/5 .091 35
merged_pr_ci_audit 5/5 .064 26 5/5 .076 42 3/5 .340 63 3/5 .614 136 5/5 .205 104
Aggregate/count tasks
list_labels 5/5 .044 11 0/5 .041 10 0/5 .319 108 0/5 .353 125 0/5 .041 18
bug_triage_search 5/5 .088 47 3/5 .114 52 5/5 .092 21 5/5 .088 19 5/5 .132 53
weekly_catchup 5/5 .050 15 0/5 .046 14 5/5 .123 17 5/5 .185 37 0/5 .132 53
find_fix_for_bug 5/5 .067 21 5/5 .091 26 5/5 .084 15 5/5 .093 29 5/5 .124 43
Error handling
invalid_issue 5/5 .038 8 5/5 .038 8 5/5 .053 8 5/5 .051 10 5/5 .167 69
nonexistent_repo 5/5 .039 8 5/5 .039 10 5/5 .052 8 5/5 .050 10 1/5 .069 21

✓ = successes/5 runs. $ = avg cost (USD). ⏱ = avg duration (seconds).

6. Analysis

Finding 1: AXI achieves 100% reliability at lowest cost

AXI is the only condition that passes all 85 runs (100% success) at an average cost of $0.050/task—7% less than CLI’s $0.054 and 66% less than the cheapest MCP condition ($0.101 for mcp-with-code-mode). AXI also has the fastest average duration (15.7s vs. 17.4s for CLI, 34–43s for MCP). The reliability advantage is most pronounced on aggregate tasks (where pre-computed totalCount fields eliminate pagination errors) and complex investigations (where contextual suggestions prevent wrong turns).

Finding 2: MCP conditions are 2–3× more expensive

The MCP cost premium is not uniform—it concentrates on complex tasks where schema overhead compounds across many turns. On ci_failure_investigation, mcp-no-toolsearch averages $0.758/task across 15 turns and 177 seconds, vs. AXI’s $0.065 across 3 turns and 21 seconds—a 12× cost difference. The agent re-sends the full tool schema catalog (176K tokens) on every turn; by turn 15, schema tokens dominate the context.

On simple tasks where agents need few turns, MCP’s overhead is more modest: view_pr costs $0.045 for MCP vs. $0.039 for AXI (1.2×).

Finding 3: ToolSearch saves upfront tokens but spends them on extra turns

ToolSearch starts with a smaller context (~50K vs. ~83K tokens) but needs 1–2 extra turns per task for tool discovery. Each extra turn re-sends the growing conversation, so the savings are consumed by accumulation.

mcp (eager) mcp-search
Avg input tokens 175,757 153,621
Avg turns 6 8
Total cost $12.59 $12.45
Success rate 87% 82%

On merged_pr_ci_audit, mcp-with-toolsearch takes 24 turns at $0.614/task vs. mcp-no-toolsearch’s 18 turns at $0.340—1.8× more expensive and 2.2× slower, with identical success (3/5). ToolSearch also introduces a new failure mode: on run_then_jobs, the agent cannot find the right tool via search (0/5), while eager loading succeeds 4/5.

The takeaway: lazy tool loading is a net negative when the agent needs most of the tools anyway. The 2-turn discovery overhead per task exceeds the context savings, and the indirection introduces a new failure mode (can’t find the right tool). Eager-loading pays a fixed upfront cost but gives the agent immediate access to every tool, which is more reliable and no more expensive in practice.

Finding 4: Code-mode is cheapest MCP but still 2× AXI

The mcp-with-code-mode condition, where the agent writes TypeScript scripts calling typed wrapper functions, achieves the lowest cost among MCP conditions ($0.101/task). Writing code amortizes schema costs by batching multiple API calls per script. But the code-generation overhead makes it the slowest condition (43.4s avg, the longest of all five).

On ci_failure_investigation, mcp-with-code-mode costs $0.194/task—3× AXI’s $0.065—but achieves 5/5 success, outperforming both direct MCP conditions (2/5 each). Code-mode also introduces a unique failure mode: unhandled runtime errors cause 4/5 failures on nonexistent_repo, where the agent writes scripts that throw instead of gracefully reporting the error.

Finding 5: AXI’s advantages by task type

AXI’s advantages vary by task complexity:

Simple lookups. All conditions achieve 100% success. AXI and CLI cost $0.024–0.039/task; MCP costs $0.031–0.100/task. The 1.5–3× MCP premium is pure schema overhead.

Multi-step investigations. AXI and CLI both achieve 100% success on most tasks. The cost difference emerges on the hardest tasks: ci_failure_investigation costs $0.065 for AXI vs. $0.061 for CLI (both 5/5), but $0.194–0.758 for MCP conditions (2–5/5). AXI’s pre-computed job status fields (checks: "27 passed, 0 failed") eliminate the need for separate job-listing API calls that MCP agents must make turn by turn.

Aggregate tasks. This is where AXI separates from all conditions. list_labels: AXI 5/5 at $0.044; every other condition 0/5. weekly_catchup: AXI 5/5 at $0.050; CLI 0/5; mcp-with-code-mode 0/5. The pre-computed totalCount field is the difference—without it, agents report the default page size (30) as the total.

Error handling. All CLI-based conditions handle errors well. mcp-with-code-mode scores only 1/5 on nonexistent_repo ($0.069/task) because generated scripts throw unhandled exceptions rather than reporting structured errors.

7. Case Studies

Case Study A: ci_failure_investigation

Identify the 5 most recent failed CI runs and report which jobs failed in each.

gh-axi 3 turns · $0.065 · 21s · 5/5
Turn 1 (agent → tool):
$ gh-axi run list --status failure --limit 5

Turn 1 (tool → agent):
runs[5]{id,title,status,conclusion,workflow,branch}:
  23386042824,"fix(daemon): repair stale LaunchAgent...",completed,failure,Install Smoke,...
  23385977677,"feat(auth): add models auth clean...",completed,failure,CI,...
  ... (3 more)
help[1]:
  Run `gh-axi run view <id>` to view details

Turn 2: Agent views all 5 runs in parallel

Turn 2 (tool → agent, for each run):
run:
  id: 23385977677
  conclusion: failure
  workflow: CI
jobs[29]{name,status,conclusion}:
  check,completed,failure
  check-additional,completed,failure
  build-smoke,completed,failure
  ... (26 more with pass/skip/fail)
help[2]:
  Run `gh-axi run view 23385977677 --log-failed` to see failure logs

Turn 3: Agent summarizes patterns.
gh mcp (eager loading) 15 turns · $0.758 · 177s · 2/5
Turn 1: list_workflow_runs(status:"failure",per_page:5)
  → 5 run objects (~2KB JSON each)
Turn 2: list_workflow_run_jobs(run_id:23385977677)
  → 29 job objects (~1.5KB JSON each)
  (context now: 176K schema + growing conversation)
Turn 3: list_workflow_run_jobs(run_id:23385977675)
  → 2 job objects
Turn 4–6: ... (remaining runs)
  (context exceeding 400K tokens by turn 6)
Turn 7–15: Agent re-reads earlier results,
  attempts to correlate, loses track of which
  jobs belong to which run.

Result: 2/5 runs produce correct answer.
3/5 runs misattribute job failures or omit runs.

AXI completes in 3 turns. The key advantage: each run view response includes the full job-level breakdown (jobs[29]{name,status,conclusion}) as structured data, plus a suggestion to view failure logs. The agent can immediately identify which jobs failed without additional API calls. MCP requires separate API calls for run listing and job listing (no inline job status). Each call re-sends the full schema catalog. By turn 6, the context exceeds 400K tokens, and the agent begins confusing job data across runs.

Case Study B: merged_pr_ci_audit

List the 10 most recently merged PRs and report the CI status of each.

gh-axi 5 turns · $0.064 · 26s · 5/5
Turn 1:
$ gh-axi pr list --state merged --limit 10
count: 10 of 3549
pull_requests[10]{number,title,state,...}:
  51772,"refactor(plugins): route Telegram...",merged
  ...
help: Run `gh-axi pr view <number>`

Turns 2–5: views each PR (some in parallel)
pr:
  number: 51772
  state: merged
  checks: "27 passed, 0 failed, 10 skipped"
             ^ pre-computed — no extra API call

Agent summarizes: “6 out of 10 fully green CI.”
gh CLI 4 turns · $0.076 · 42s · 5/5
Turn 1:
$ gh pr list --state merged --limit 10 \
    --json number,title,author,mergedAt,statusCheckRollup
  → 10 PR objects with nested check arrays (~2KB JSON each)

Turns 2–3: parse JSON, count pass/fail per PR
  Some runs call `gh pr checks` again to double-check

Turn 4: Agent summarizes results.
  19% more expensive, 62% slower — but succeeds.
gh mcp (with ToolSearch) 24 turns · $0.614 · 136s · 3/5
Turns 1–3: ToolSearch for PR listing tool
Turns 4–6: list merged PRs, paginate
Turns 7–24: for each PR, search for check/status
  tool, invoke it, parse response.
  Context grows to >300K tokens.
  Agent loses track of PR-to-check mapping.

Result: 3/5 runs succeed. 2/5 misreport CI status.
9.6× more expensive than AXI.

AXI’s pre-computed checks field provides a one-line CI summary inline in each PR view. The agent never needs to call a separate checks or workflow API. CLI also succeeds (5/5) by using the --json flag to get structured output in a single call, but costs 19% more ($0.076 vs. $0.064) and takes 62% longer (42s vs. 26s). The CLI agent must know to use --json with statusCheckRollup—a non-obvious field name that requires prior knowledge of the gh CLI’s JSON schema.

8. Limitations

  • Single target repository. All tasks target openclaw/openclaw. Results may vary on repositories with different sizes or structures.
  • Single agent model. Only Claude Sonnet 4.6 is evaluated. Other models may exhibit different sensitivities to output format.
  • Read-only tasks. The benchmark includes only read operations. Mutations introduce additional challenges around idempotency.
  • LLM judge. Task success is determined by an LLM judge, which may introduce scoring biases.
  • GitHub-specific implementation. While the AXI principles are domain-general, this evaluation is specific to GitHub.
  • Sub-optimal MCP baseline. The MCP conditions use the official GitHub MCP server, not a hand-optimized one. A carefully designed MCP server with fewer, better-composed tools could narrow the gap.

9. Conclusion

The debate between CLI and MCP as agent-tool interfaces misses the deeper question: what design principles make any interface effective for agents? This evaluation shows that a principled CLI design (AXI) outperforms both raw CLI and MCP on every metric—success, cost, duration, and turns.

AXI achieves 100% reliability at $0.050/task, compared to 86% at $0.054 for raw CLI and 82–87% at $0.101–0.148 for MCP. The cost gap widens dramatically on complex tasks: ci_failure_investigation costs $0.065 for AXI vs. $0.758 for MCP (12×), with AXI achieving 100% success vs. MCP’s 40%.

The 10 AXI principles provide a concrete, testable framework for designing agent-ergonomic interfaces that are both reliable and cost-efficient.

References

  1. Anthropic. Model Context Protocol specification. Technical report, 2024.
  2. H. Mao and R. Pradhan. “MCP vs CLI is the wrong fight.” Smithery Blog, March 2026.
  3. A. Jones and C. Kelly. “Code execution with MCP: Building more efficient agents.” Anthropic Engineering Blog, November 2025.
  4. K. Varda and S. Pai. “Code Mode: The better way to use MCP.” Cloudflare Blog, September 2025.