The Coding Agent Shootout and the Best Harness We Could Not Read

This is the fourth and final post in our series on AI agent architecture. The first three were about how agents are built — we compared three of them, dissected what a harness does, and looked at skills and memory. This one answers the question everybody actually asks first: which coding agent should I use?

We read the source of five open coding agents end to end — OpenAI's Codex CLI (the Rust harness is open even though the models aren't), Pi, OpenCode, Moonshot's Kimi Code, and Google's Gemini CLI. Below is the honest shootout: strengths, weaknesses, and a ranking. But there's an asterisk we owe you up front, so here it is in plain sight — the best harness in the field belongs to a tool we deliberately did not put under the microscope: Claude Code. We'll get to why at the end.

(We're leaving Hermes Agent mostly out of this one. It can code, but it's an assistant-first agent, and judging it as a coding tool would be judging a Swiss Army knife by its corkscrew.)

How to judge a coding agent

A coding agent's quality is the product of two things that multiply rather than add: the model doing the reasoning, and the harness that decides what the model sees, what it can touch, and what happens when it's wrong. A great model in a sloppy harness flails on long tasks; a great harness can't make a weak model reason past its ceiling. So the axes that matter are:

Model strength on hard tasks — can it actually fix the gnarly bug, not just the toy one?
Harness reliability — does it survive a forty-minute run: compaction, recovery from loops, checkpoints?
Safety — what stops it running rm -rf or leaking a secret?
Extensibility & ownership — can you bend it, script it, run your own models?
Cost and friction — what does a day of heavy use actually cost?

No single agent wins all five. That's the whole point.

The contenders

OpenAI Codex CLI — the strongest engineer in the room. Codex pairs the best coding models available (the GPT-5.x-Codex family) with a harness that takes the hard problems seriously. It's the only one of the five where sandboxing is architecture, not an afterthought: every command runs inside an OS-enforced jail (Seatbelt on macOS, bubblewrap on Linux) with network denied by default, and the model can request escalation as part of its tool schema. The core is a protocol-first Rust engine, so one codebase drives the terminal, the IDE extension, codex exec and the cloud — and because OpenAI trains the models with this harness (they speak its custom patch format natively), model and harness reinforce each other. It's also the most token-efficient of the lot. Weaknesses: the models are closed and get deprecated on OpenAI's schedule, the cloud model isn't selectable, and its extension story (hooks, skills, plugins) is younger than its rivals'.

Gemini CLI — the most feature-complete, and free. Google's CLI is the most generous on-ramp in the field: a real free tier, a 1-million-token context window, and a genuinely impressive harness — loop detection that injects corrective guidance instead of just giving up, automatic fallback to a faster model on quota exhaustion, shadow-git checkpointing with /rewind, an A2A server for multi-agent setups, a VS Code companion, skills, hooks and subagents. It has quietly absorbed the best ideas from everyone. Weaknesses: it's at its best on Gemini models, the feature surface is sprawling, and on the hardest reasoning tasks community sentiment still places it a notch below Codex and Claude.

OpenCode — the architecture astronaut's pick, in a good way. OpenCode is built server-first: an HTTP server with an OpenAPI spec, and the TUI, desktop app, web UI and GitHub bot are all thin generated-SDK clients of it. It's aggressively provider-agnostic (bring any model), expresses "plan mode" and sub-agents as plain permission rulesets rather than hard-coded modes, ships git-based checkpoint/revert, and is mid-flight on an event-sourced V2 redesign. Weaknesses: that rewrite means churn, there's no sandbox, and your experience is only as good as the model you plug in.

Kimi Code — the cleanest layering, and a cost story. Moonshot's CLI is the best-factored codebase of the group: three crisp layers — kosong (LLM abstraction), kaos (OS abstraction), agent-core (the loop) — with event-sourced persistence you can replay in a visual debugger and cache-aware compaction that weighs prompt-cache economics before squashing history. Kimi models are strong coders and cheap, and it speaks ACP for editors like Zed. (A fun detail: its TUI is built on Pi's own UI library.) Weaknesses: it's tuned best for Kimi models, it's newer, and the community around it is still small.

Pi — not really a product; a masterclass. Pi ships zero models and refuses features on principle — no sandbox, no permission popups, no plan mode — pushing everything to a TypeScript extension system. What it does ship is the cleanest minimal agent core we read (about 2,700 lines), the best multi-provider abstraction (40-plus providers behind one interface), and a session-as-a-tree design that makes branching and time-travel free. Weaknesses as a daily driver: no safety rails, no cross-session memory, a bus factor of roughly one, and quality that depends entirely on the model you bring. You run Pi to understand agents, or to build your own on top of it — which plenty of people, including the OpenClaw and Kimi Code teams, have done.

Agent

Strongest at

Weakest at

Open?

Codex CLI

Model strength, sandboxing, reliability

Closed models, younger extensibility

Harness yes, models no

Gemini CLI

Free tier, features, recovery, 1M context

Best only on Gemini; sprawling

Fully open

OpenCode

Client/server architecture, ownership

Mid-rewrite churn, no sandbox

Fully open

Kimi Code

Clean layering, event sourcing, cost

Kimi-first, young, small community

Fully open

Minimal core, provider abstraction

No safety, no memory, bus factor ~1

Fully open

A note on lineage: who built on whom

It's tempting to read these as five isolated inventions. They aren't. This is a small field whose teams read — and reuse — each other's code, and the family tree is worth tracing because it tells you where the reusable ideas actually live.

The clearest root is Pi. Its "toolkit, not product" philosophy isn't just rhetoric — other agents are literally built from its parts. Kimi Code did not write its own terminal UI; it depends on @earendil-works/pi-tui, Pi's own TUI library, and built its component layer on top (you can see it right there in apps/kimi-code/package.json). And OpenClaw, one of the most popular open personal assistants, is built on Pi's agent packages — its whole agent stack is Pi underneath. So while Pi loses a head-to-head as a daily driver, it quietly wins as infrastructure: when you use Kimi Code or OpenClaw, you're partly using Pi.

Then there's the softer lineage of shared conventions, where an idea from one agent becomes a standard everyone adopts:

AGENTS.md — the project-instructions file — was originated by Codex and is now read by Pi, OpenCode, Gemini CLI (as GEMINI.md) and others.
The agentskills.io SKILL.md format is shared, independently, by Pi, Hermes, OpenCode, Kimi Code and Gemini CLI — five teams converging on one open standard.
Shadow-git checkpointing (snapshot before risky edits, /rewind to undo) shows up in both OpenCode and Gemini CLI using the same underlying git trick.
The read / edit / grep / glob / bash tool ergonomics that all five share trace largely back to Claude Code — which is part of why, as we'll see, it casts a long shadow over a comparison it isn't formally in.

The lesson for anyone building an agent: you are not starting from a blank page. The good patterns are already open, already converging, and in Pi's case already packaged as libraries you can import.

The verdict among the five

Of the agents we actually read the code for, Codex is the best coding agent — and it isn't especially close on the axis that matters most for serious work. Hard engineering is where the model-times-harness product is most punishing, and Codex maxes both terms: the strongest coding models in the industry, co-designed with a harness that sandboxes by default and is built to grind through long autonomous runs. If your test is "fix the difficult bug in an unfamiliar codebase without supervision and without rm -rf-ing my home directory," Codex is the safe bet.

That doesn't make it the right pick for everyone. If you want zero cost and maximum features, Gemini CLI. If you want to own the whole stack and run your own models, OpenCode or Pi. If you live in Kimi's ecosystem and care about cost, Kimi Code. "Best" is conditional on what you're optimising for — but if you force a single ranking on raw coding capability, Codex tops the five.

The asterisk: the best harness we couldn't read

Here's the honesty the title promised. We left Claude Code out of the code study for a simple reason — its harness isn't open source the way the others are, so there was nothing to dissect at the same depth. But leaving it out of the code study is not the same as leaving it out of the conversation, because by reputation, by community blind-comparison, and by the simple fact that everyone else keeps adopting its conventions, Claude Code arguably has the best harness in the field.

Consider the evidence that doesn't require reading its source:

It set the conventions the others copied. The ergonomics of the modern coding-agent tool suite — the read / edit / grep / glob / bash design, the AGENTS.md-style project memory, plan mode, sub-agents, the skills format, MCP support — Claude Code either originated or popularised much of this. When you see five independent harnesses converge on the same patterns, a lot of those patterns trace back to one place.
Blind output comparisons keep favouring it. In the community write-ups that pit agents head-to-head on real tasks, Claude Code's output is preferred more often than not — frequently over Codex — even when Codex uses fewer tokens. The recurring verdict is "Codex is more efficient, Claude Code is more right."
Its harness sophistication is visible from the outside. Sub-agent orchestration, a genuinely good permission UX, hooks, output styles, tight IDE integration, careful context management — the surface area you interact with is the most polished in the category, and polish in a harness is exactly what determines how much of the model's capability you actually realise.

So the fair summary is not "Codex is best, full stop." It's: Codex and Claude Code sit at the top tier, trading blows — Codex winning on raw model strength, sandboxing and token efficiency, Claude Code winning on harness craft and output quality — and the five open agents below them serve different masters (cost, ownership, learnability, ecosystem). The reason Codex "wins" our shootout is partly an artefact of the rules: we ranked the agents whose code we could read, and the strongest harness in the field wasn't one of them.

There's a small irony worth admitting: the research behind this entire series — cloning six repos, reading tens of thousands of lines, drafting these posts — was itself done with a coding agent. The tools have gotten good enough to study, and explain, each other.

What to actually use

You want the strongest autonomous coding agent and don't need to own it: Codex, or Claude Code. Try both on your own hard task and trust your eyes over any benchmark — these two are close enough that fit-to-your-workflow decides it.
You want maximum capability at zero cost: Gemini CLI.
You want to own the stack, pick your models, or self-host: OpenCode (as a product) or Pi (as a foundation).
You're cost-sensitive or already in Moonshot's ecosystem: Kimi Code.
You're building your own agent: read Pi for the core, Codex for sandboxing, OpenCode for the client/server split, Kimi Code for event sourcing, Gemini CLI for recovery engineering — and study how Claude Code behaves, because it's the reference the others are quietly chasing.

The models will keep leapfrogging each other quarter by quarter; the rankings above have a short shelf life. What won't change is the lesson under all four posts in this series: the model sets the ceiling, but the harness decides how close to it you get to live. Choose the harness you'll actually enjoy operating — that, more than any benchmark, is what determines whether the agent helps you.

That closes our four-part series on AI agent architecture. All six harnesses we studied are open source; if these posts made you curious, the code is the best teacher we found.

The Coding Agent Shootout and the Best Harness We Could Not Readeric

Keywords

How to judge a coding agent

The contenders

A note on lineage: who built on whom

The verdict among the five

The asterisk: the best harness we couldn't read

What to actually use

Comments (0)

Leave a Comment

Latest Articles

Measurable Beats Powerful: Grounding AI in Tangible Checks

Shipping a native Go binary inside an Android APK: ELF linkers, 16 KB pages, and the memfd trick

Introducing TYO Reach: A Travel Companion for Your Browser

Introducing reach-dl: a Local, Private Download Manager for TYO Reach

The Model That Aced Olympiad Math Can't Write a Tweet

Tags

Previous Article

How a Coding Agent Actually Writes Code

Next Article

Skills and Memory: How AI Agents Actually Learn

The Coding Agent Shootout and the Best Harness We Could Not Readeric

Share This:

Keywords

How to judge a coding agent

The contenders

A note on lineage: who built on whom

The verdict among the five

The asterisk: the best harness we couldn't read

What to actually use

Share This:

Comments (0)

Leave a Comment

Latest Articles

Measurable Beats Powerful: Grounding AI in Tangible Checks

Shipping a native Go binary inside an Android APK: ELF linkers, 16 KB pages, and the memfd trick

Introducing TYO Reach: A Travel Companion for Your Browser

Introducing reach-dl: a Local, Private Download Manager for TYO Reach

The Model That Aced Olympiad Math Can't Write a Tweet

Tags

Previous Article

How a Coding Agent Actually Writes Code

Next Article

Skills and Memory: How AI Agents Actually Learn