10 design principles for building agent-ergonomic CLI tools. 100% task success at the lowest cost—beating both MCP and raw CLI.
gh-axi is a wrapper around the gh CLI,
optimized for agent ergonomics.
npm install -g gh-axi
Add to your CLAUDE.md or AGENTS.md:
Use `gh-axi` (replacement for `gh` CLI) for all GitHub operations.
Install the AXI skill to get the design guidelines and scaffolding:
npx skills add kunchenguid/axi
AI agents interact with external services through two dominant
paradigms. The first is shell-based CLI execution:
the agent runs commands like gh issue list and parses
the text output. The second is
structured tool protocols like MCP (Model Context
Protocol), where the agent invokes typed tool functions through the
hosting framework’s native tool-calling interface.
Both approaches have significant problems for agents:
--help. MCP agents with lazy
loading waste turns on tool discovery. Neither provides in-context
guidance on what to do next.
Recent debate has framed this as “MCP vs. CLI” [2], but the real question is neither which protocol nor which transport, but rather what design principles make an agent-tool interface effective.
AXI is 10 principles for agent-ergonomic CLI design that treat token budget as a first-class constraint. AXI achieves the reliability advantages MCP promises (structured output, discoverability) at the cost profile of a CLI.
Contributions:
gh-axi) demonstrating the
principles on the GitHub platform.
Anthropic’s Model Context Protocol [1] provides a standardized way to connect AI models to external tools through structured, typed tool definitions. While MCP offers type-safe schemas and discoverable tool catalogs, the schema overhead is substantial. The evaluation below quantifies this cost: MCP tool definitions consume 3–4× more context tokens than equivalent CLI conditions, translating to 2–3× higher cost.
Mao and Pradhan [2] benchmarked 756 runs comparing raw APIs, CLI, and native MCP across GitHub, Linear, and Singapore Bus APIs. They found native MCP achieved 91.7% success vs. 83.3% for CLI, with CLI using 2.9× more billed tokens. However, their CLI was auto-generated from API specs, not hand-designed for agents. They note: “What this does not settle is whether a hand-crafted, agent-first CLI could close the remaining gap.” AXI answers this question directly—an agent-first CLI not only closes the gap but surpasses MCP on every metric.
Jones and Kelly [3] identified two
scaling problems with direct MCP tool calls: tool definitions
overload the context window, and intermediate results consume
additional tokens as they pass through the model between calls. They
proposed presenting MCP tools as a code API that the agent programs
against, so that data flows through the execution environment rather
than the context window. Varda and Pai
[4] independently arrived at the
same insight, coining the term “Code Mode” and reporting
that LLMs handle complex, multi-step tool interactions better when
writing TypeScript than when making direct tool calls. The
mcp-with-code-mode
condition in this benchmark evaluates this approach: it is the
cheapest MCP condition ($0.101/task vs. $0.147–0.148),
confirming that code execution reduces MCP overhead, but it remains
2× more expensive than AXI and introduces new failure modes
from runtime errors in generated code.
Use TOON (Token-Optimized Object Notation) format instead of JSON or tab-separated tables. TOON omits braces, quotes, and commas, yielding approximately 40% token savings over equivalent JSON while remaining unambiguous to LLMs.
[{"number":42,"title":"Fix login bug","state":"open",
"author":"alice","labels":["bug","P1"]},
{"number":43,"title":"Add dark mode","state":"open",
"author":"bob","labels":["feature"]}]
issues[2]{number,title,state}:
42,Fix login bug,open
43,Add dark mode,open
Return 3–4 fields per list item by default, not 10+. Agents
rarely need all available fields and can request additional ones
explicitly via a --fields flag.
Truncate large text fields to a configurable limit, appending a
size hint such as
(truncated, 2847 chars total — use --full to see complete
body). This prevents a single verbose response from consuming the
agent’s context budget while preserving enough content for
most tasks.
Running a command with no arguments should display live, actionable data rather than help text.
$ gh issue
Work with GitHub issues.
USAGE
gh issue <command> [flags]
AVAILABLE COMMANDS
close, create, list, view, ...
$ gh-axi issue
count: 14 of 8771 total
issues[14]{number,title,state}:
51815,"[Bug]: Telegram plugin fails...",open
...
help[2]:
Run `gh-axi issue view <number>`
Run `gh-axi issue create --title "..."`
Append help[] lines after output, suggesting logical
next steps as complete, copy-pasteable commands. This eliminates
tool-discovery turns and guides agents through multi-step
workflows.
Each subcommand should offer a concise --help flag as
a fallback when contextual hints are insufficient.
Include aggregate or derived fields that eliminate round trips.
The most impactful example is totalCount: always
report the total number of items, not just the page size. Other
examples include computed CI status summaries (e.g., “27
passed, 0 failed, 10 skipped”) inline in PR views.
$ gh label list
bug Something isn't working #d73a4a
docs Improvements or additions #0075ca
... (30 rows -- default page, no total)
$ gh-axi label list
count: 126
labels[126]{name}:
bug
docs
...
$ gh-axi pr view 51772
pull_request:
title: "refactor(plugins): route Telegram..."
state: merged
checks: "27 passed, 0 failed, 10 skipped"
When a query returns no results, output an explicit zero-result message rather than empty output. Agents cannot distinguish “no output” from “command failed silently” without this signal.
Mutations should be idempotent, errors should be structured and written to stdout (not stderr), and commands must never prompt for interactive input.
Reserve stdout for structured data and stderr for debug/log output. Use clean exit codes: 0 for success, 1 for errors.
The benchmark evaluates five agent-tool interface conditions:
| Condition | Interface | Description |
|---|---|---|
| axi |
gh-axi
|
AXI-compliant wrapper over gh CLI |
| cli |
gh CLI
|
Raw GitHub CLI baseline |
| mcp-no-toolsearch | GitHub MCP | All tool schemas loaded upfront |
| mcp-with-toolsearch | GitHub MCP + ToolSearch | Tools discovered on demand via ToolSearch |
| mcp-with-code-mode | TypeScript wrappers over Github MCP |
Agent writes .ts scripts to call tools
|
The benchmark includes 17 tasks across four categories:
list_open_issues, view_pr,
list_releases,
repo_overview—single-command tasks.
issue_then_comments, pr_then_checks,
release_then_body, run_then_jobs,
ci_failure_investigation,
pr_review_prep,
merged_pr_ci_audit—tasks requiring 2–5
sequential commands.
list_labels, bug_triage_search,
weekly_catchup,
find_fix_for_bug—tasks requiring counts or
aggregations.
invalid_issue,
nonexistent_repo—tasks where the target does
not exist.
Each run uses a fresh shallow clone of
openclaw/openclaw. Agent: Claude Sonnet 4.6. Judge:
Claude Sonnet 4.6. Five repeats per condition–task pair: 17
× 5 × 5 = 425 total runs.
| Condition | Success | Avg Cost | Avg Duration | Avg Turns |
|---|---|---|---|---|
| gh-cli | 86% | $0.054 | 17.4s | 3 |
| gh-mcp | 87% | $0.148 | 34.2s | 6 |
| gh-mcp-search | 82% | $0.147 | 41.1s | 8 |
| gh-mcp-code | 84% | $0.101 | 43.4s | 7 |
| gh-axi | 100% | $0.050 | 15.7s | 3 |
AXI achieves 100% task success at the lowest average cost ($0.050/task) and shortest average duration (15.7s). The raw CLI baseline achieves 86% success at comparable cost. All three MCP conditions achieve 82–87% success at 2–3× the cost and 2–3× the duration.
Success rate (out of 5 runs), average cost, and average duration per condition–task pair.
| Task | axi | cli | mcp | mcp-search | mcp-code | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ✓ | $ | ⏱ | ✓ | $ | ⏱ | ✓ | $ | ⏱ | ✓ | $ | ⏱ | ✓ | $ | ⏱ | |
| Simple lookups | |||||||||||||||
| list_open_issues | 5/5 | .028 | 10 | 5/5 | .024 | 8 | 5/5 | .100 | 25 | 5/5 | .046 | 18 | 5/5 | .086 | 37 |
| view_pr | 5/5 | .039 | 8 | 5/5 | .038 | 9 | 5/5 | .045 | 9 | 5/5 | .031 | 12 | 5/5 | .072 | 28 |
| list_releases | 5/5 | .039 | 11 | 5/5 | .038 | 9 | 5/5 | .096 | 8 | 5/5 | .055 | 11 | 5/5 | .035 | 18 |
| repo_overview | 5/5 | .038 | 9 | 5/5 | .038 | 10 | 5/5 | .056 | 9 | 5/5 | .042 | 14 | 5/5 | .033 | 16 |
| Multi-step investigations | |||||||||||||||
| issue_then_comments | 5/5 | .049 | 13 | 5/5 | .046 | 11 | 5/5 | .065 | 14 | 5/5 | .050 | 20 | 5/5 | .090 | 39 |
| pr_then_checks | 5/5 | .050 | 14 | 5/5 | .055 | 15 | 5/5 | .112 | 32 | 5/5 | .077 | 35 | 5/5 | .087 | 46 |
| release_then_body | 5/5 | .053 | 14 | 5/5 | .052 | 12 | 5/5 | .060 | 11 | 5/5 | .041 | 16 | 5/5 | .103 | 43 |
| run_then_jobs | 5/5 | .048 | 13 | 5/5 | .050 | 14 | 4/5 | .063 | 24 | 0/5 | .061 | 24 | 5/5 | .047 | 20 |
| ci_failure_investigation | 5/5 | .065 | 21 | 5/5 | .061 | 22 | 2/5 | .758 | 177 | 2/5 | .568 | 151 | 5/5 | .194 | 94 |
| pr_review_prep | 5/5 | .055 | 19 | 5/5 | .070 | 25 | 5/5 | .100 | 32 | 5/5 | .087 | 31 | 5/5 | .091 | 35 |
| merged_pr_ci_audit | 5/5 | .064 | 26 | 5/5 | .076 | 42 | 3/5 | .340 | 63 | 3/5 | .614 | 136 | 5/5 | .205 | 104 |
| Aggregate/count tasks | |||||||||||||||
| list_labels | 5/5 | .044 | 11 | 0/5 | .041 | 10 | 0/5 | .319 | 108 | 0/5 | .353 | 125 | 0/5 | .041 | 18 |
| bug_triage_search | 5/5 | .088 | 47 | 3/5 | .114 | 52 | 5/5 | .092 | 21 | 5/5 | .088 | 19 | 5/5 | .132 | 53 |
| weekly_catchup | 5/5 | .050 | 15 | 0/5 | .046 | 14 | 5/5 | .123 | 17 | 5/5 | .185 | 37 | 0/5 | .132 | 53 |
| find_fix_for_bug | 5/5 | .067 | 21 | 5/5 | .091 | 26 | 5/5 | .084 | 15 | 5/5 | .093 | 29 | 5/5 | .124 | 43 |
| Error handling | |||||||||||||||
| invalid_issue | 5/5 | .038 | 8 | 5/5 | .038 | 8 | 5/5 | .053 | 8 | 5/5 | .051 | 10 | 5/5 | .167 | 69 |
| nonexistent_repo | 5/5 | .039 | 8 | 5/5 | .039 | 10 | 5/5 | .052 | 8 | 5/5 | .050 | 10 | 1/5 | .069 | 21 |
✓ = successes/5 runs. $ = avg cost (USD). ⏱ = avg duration (seconds).
AXI is the only condition that passes all 85 runs (100% success)
at an average cost of $0.050/task—7% less than CLI’s
$0.054 and 66% less than the cheapest MCP condition ($0.101 for
mcp-with-code-mode). AXI also has the fastest average
duration (15.7s vs. 17.4s for CLI, 34–43s for MCP). The
reliability advantage is most pronounced on aggregate tasks (where
pre-computed totalCount fields eliminate pagination
errors) and complex investigations (where contextual suggestions
prevent wrong turns).
The MCP cost premium is not uniform—it concentrates on
complex tasks where schema overhead compounds across many turns.
On ci_failure_investigation,
mcp-no-toolsearch averages $0.758/task across 15
turns and 177 seconds, vs. AXI’s $0.065 across 3 turns and
21 seconds—a 12× cost difference. The
agent re-sends the full tool schema catalog (176K tokens) on every
turn; by turn 15, schema tokens dominate the context.
On simple tasks where agents need few turns, MCP’s overhead
is more modest:
view_pr costs $0.045 for MCP vs. $0.039 for AXI
(1.2×).
ToolSearch starts with a smaller context (~50K vs. ~83K tokens) but needs 1–2 extra turns per task for tool discovery. Each extra turn re-sends the growing conversation, so the savings are consumed by accumulation.
| mcp (eager) | mcp-search | |
|---|---|---|
| Avg input tokens | 175,757 | 153,621 |
| Avg turns | 6 | 8 |
| Total cost | $12.59 | $12.45 |
| Success rate | 87% | 82% |
On merged_pr_ci_audit,
mcp-with-toolsearch takes 24 turns at $0.614/task vs.
mcp-no-toolsearch’s 18 turns at
$0.340—1.8× more expensive and 2.2× slower, with
identical success (3/5). ToolSearch also introduces a new failure
mode: on run_then_jobs, the agent cannot find the
right tool via search (0/5), while eager loading succeeds 4/5.
The takeaway: lazy tool loading is a net negative when the agent needs most of the tools anyway. The 2-turn discovery overhead per task exceeds the context savings, and the indirection introduces a new failure mode (can’t find the right tool). Eager-loading pays a fixed upfront cost but gives the agent immediate access to every tool, which is more reliable and no more expensive in practice.
The mcp-with-code-mode condition, where the agent
writes TypeScript scripts calling typed wrapper functions,
achieves the lowest cost among MCP conditions ($0.101/task).
Writing code amortizes schema costs by batching multiple API calls
per script. But the code-generation overhead makes it the
slowest condition (43.4s avg, the longest of all five).
On ci_failure_investigation,
mcp-with-code-mode costs $0.194/task—3×
AXI’s $0.065—but achieves 5/5 success, outperforming
both direct MCP conditions (2/5 each). Code-mode also introduces a
unique failure mode: unhandled runtime errors cause 4/5 failures
on nonexistent_repo, where the agent writes scripts
that throw instead of gracefully reporting the error.
AXI’s advantages vary by task complexity:
Simple lookups. All conditions achieve 100% success. AXI and CLI cost $0.024–0.039/task; MCP costs $0.031–0.100/task. The 1.5–3× MCP premium is pure schema overhead.
Multi-step investigations.
AXI and CLI both achieve 100% success on most tasks. The cost
difference emerges on the hardest tasks:
ci_failure_investigation costs $0.065 for AXI vs.
$0.061 for CLI (both 5/5), but $0.194–0.758 for MCP
conditions (2–5/5). AXI’s pre-computed job status
fields (checks: "27 passed, 0 failed") eliminate the
need for separate job-listing API calls that MCP agents must make
turn by turn.
Aggregate tasks.
This is where AXI separates from all conditions.
list_labels: AXI 5/5 at $0.044; every other condition
0/5. weekly_catchup: AXI 5/5 at $0.050; CLI 0/5;
mcp-with-code-mode 0/5. The pre-computed
totalCount field is the difference—without it,
agents report the default page size (30) as the total.
Error handling.
All CLI-based conditions handle errors well.
mcp-with-code-mode scores only 1/5 on
nonexistent_repo ($0.069/task) because generated
scripts throw unhandled exceptions rather than reporting
structured errors.
Identify the 5 most recent failed CI runs and report which jobs failed in each.
Turn 1 (agent → tool):
$ gh-axi run list --status failure --limit 5
Turn 1 (tool → agent):
runs[5]{id,title,status,conclusion,workflow,branch}:
23386042824,"fix(daemon): repair stale LaunchAgent...",completed,failure,Install Smoke,...
23385977677,"feat(auth): add models auth clean...",completed,failure,CI,...
... (3 more)
help[1]:
Run `gh-axi run view <id>` to view details
Turn 2: Agent views all 5 runs in parallel
Turn 2 (tool → agent, for each run):
run:
id: 23385977677
conclusion: failure
workflow: CI
jobs[29]{name,status,conclusion}:
check,completed,failure
check-additional,completed,failure
build-smoke,completed,failure
... (26 more with pass/skip/fail)
help[2]:
Run `gh-axi run view 23385977677 --log-failed` to see failure logs
Turn 3: Agent summarizes patterns.
Turn 1: list_workflow_runs(status:"failure",per_page:5)
→ 5 run objects (~2KB JSON each)
Turn 2: list_workflow_run_jobs(run_id:23385977677)
→ 29 job objects (~1.5KB JSON each)
(context now: 176K schema + growing conversation)
Turn 3: list_workflow_run_jobs(run_id:23385977675)
→ 2 job objects
Turn 4–6: ... (remaining runs)
(context exceeding 400K tokens by turn 6)
Turn 7–15: Agent re-reads earlier results,
attempts to correlate, loses track of which
jobs belong to which run.
Result: 2/5 runs produce correct answer.
3/5 runs misattribute job failures or omit runs.
AXI completes in 3 turns. The key advantage: each
run view
response includes the full job-level breakdown
(jobs[29]{name,status,conclusion}) as structured data,
plus a suggestion to view failure logs. The agent can immediately
identify which jobs failed without additional API calls. MCP
requires separate API calls for run listing and job listing (no
inline job status). Each call re-sends the full schema catalog. By
turn 6, the context exceeds 400K tokens, and the agent begins
confusing job data across runs.
List the 10 most recently merged PRs and report the CI status of each.
Turn 1:
$ gh-axi pr list --state merged --limit 10
count: 10 of 3549
pull_requests[10]{number,title,state,...}:
51772,"refactor(plugins): route Telegram...",merged
...
help: Run `gh-axi pr view <number>`
Turns 2–5: views each PR (some in parallel)
pr:
number: 51772
state: merged
checks: "27 passed, 0 failed, 10 skipped"
^ pre-computed — no extra API call
Agent summarizes: “6 out of 10 fully green CI.”
Turn 1:
$ gh pr list --state merged --limit 10 \
--json number,title,author,mergedAt,statusCheckRollup
→ 10 PR objects with nested check arrays (~2KB JSON each)
Turns 2–3: parse JSON, count pass/fail per PR
Some runs call `gh pr checks` again to double-check
Turn 4: Agent summarizes results.
19% more expensive, 62% slower — but succeeds.
Turns 1–3: ToolSearch for PR listing tool
Turns 4–6: list merged PRs, paginate
Turns 7–24: for each PR, search for check/status
tool, invoke it, parse response.
Context grows to >300K tokens.
Agent loses track of PR-to-check mapping.
Result: 3/5 runs succeed. 2/5 misreport CI status.
9.6× more expensive than AXI.
AXI’s pre-computed checks field provides a
one-line CI summary inline in each PR view. The agent never needs to
call a separate checks or workflow API. CLI also succeeds (5/5) by
using the --json flag to get structured output in a
single call, but costs 19% more ($0.076 vs. $0.064) and takes 62%
longer (42s vs. 26s). The CLI agent must know to use
--json with statusCheckRollup—a
non-obvious field name that requires prior knowledge of the
gh CLI’s JSON schema.
openclaw/openclaw. Results may vary on repositories
with different sizes or structures.
The debate between CLI and MCP as agent-tool interfaces misses the deeper question: what design principles make any interface effective for agents? This evaluation shows that a principled CLI design (AXI) outperforms both raw CLI and MCP on every metric—success, cost, duration, and turns.
AXI achieves 100% reliability at $0.050/task, compared to 86% at
$0.054 for raw CLI and 82–87% at $0.101–0.148 for MCP.
The cost gap widens dramatically on complex tasks:
ci_failure_investigation
costs $0.065 for AXI vs. $0.758 for MCP (12×), with AXI
achieving 100% success vs. MCP’s 40%.
The 10 AXI principles provide a concrete, testable framework for designing agent-ergonomic interfaces that are both reliable and cost-efficient.