
Last Update: June 13, 2026
BY
eric
Keywords
In the previous post we compared three very different AI agents and arrived at a claim that deserves its own article: the model is a commodity; the harness is the product. This post is the deep dive we promised.
Since then we've gone further and read the source code of six agent harnesses end to end: Pi, Hermes Agent, OpenCode, Moonshot's Kimi Code, OpenAI's Codex CLI (the Rust harness is fully open even though the models are not), and Google's Gemini CLI. Six independent teams, four languages, wildly different products — and, as we'll see, they have all converged on the same set of organs. That convergence is the strongest evidence there is for what a harness actually has to do.
The naive agent is four lines long
Strip everything away and an AI agent is this:
while not done:
response = model(context)
results = execute(response.tool_calls)
context += response + results
You can write that in an afternoon, point it at a frontier model, and it will genuinely work — for about ten minutes. Then reality arrives. The context window fills up. A tool dumps 80,000 tokens of build log into the conversation. The model edits the wrong file because two files have similar names. A network blip kills a forty-minute run. The model gets stuck retrying the same failing command forever. The user wants to say "no, not that way" while it's working. And someone, eventually, asks the agent to rm -rf something it shouldn't.
Every line of the six codebases we read exists because of that second ten minutes. The model supplies the reasoning per turn; the harness decides what the model sees, what it can touch, what survives, and what happens when things go wrong. Those four decisions are most of an agent, and none of them are AI.
Job 1: Deciding what the model sees
The single biggest misconception about agents is that the model "has" the conversation. It doesn't. Every turn, the harness constructs what the model sees from scratch — and context is a scarce, expensive, attention-diluting resource. The engineering here has three layers.
Construction. The system prompt is assembled from parts: identity, tool descriptions, project instruction files (the AGENTS.md convention Codex originated, Gemini's GEMINI.md hierarchy), date, working directory. Skills — procedural knowledge files — are injected by name and one-line description only; the agent reads the full file on demand. This "progressive disclosure" pattern, now an open standard (agentskills.io) adopted by Pi, Gemini CLI and others, is how an agent carries a large library of know-how at a standing cost of a few dozen tokens each.
Compaction. When the window fills, history must be squashed — and doing this badly is the number one cause of agents "forgetting" what they were doing. The mature harnesses treat it as a real algorithm, not a hand-wave. Pi only cuts at valid cut points (never in the middle of a tool call/result pair), summarises into a structured schema — goal, constraints, progress, decisions, next steps — and carries cumulative lists of every file read and modified across any number of squashes, so "what have I touched" is never lost. Kimi Code goes further with cache-aware micro-compaction that weighs the cost of invalidating the prompt cache before deciding what to squash.
Cache discipline. That last point deserves emphasis because it's invisible from the outside: LLM providers cache prompt prefixes, and a harness that rewrites earlier context casually pays full price for every turn. Hermes keeps its system prompt byte-stable for the lifetime of a conversation as a hard design invariant; Codex's own contributor guide codifies prompt-cache rules. Context isn't just curated — it's curated append-only whenever possible.
Job 2: The tool contract
Tools are where the model's text meets the real world, and the contract is fussier than it looks.
Output is budgeted. Every harness truncates tool output with hard limits — Pi caps file reads at 2,000 lines or 50KB, grep at 100 matches, with the full output spilled to a temp file the model can fetch if it actually needs it. Without this, one chatty command destroys the context budget (see Job 1).
Edits are engineered. The deceptively hard tool is "edit a file." The naive approach — let the model rewrite the whole file — is slow and destructive. The six harnesses showcase three generations of answers: exact string replacement with fuzzy fallback (Pi normalises smart quotes, Unicode dashes and trailing whitespace before giving up); a custom patch grammar the model is trained on (Codex's apply_patch); and self-correcting replace, where a failed match triggers an LLM side-call to repair the edit (Gemini CLI). Each is a harness answering the same question: how do you make a probabilistic text generator perform a deterministic operation reliably?
Execution is stateful. Codex's "unified exec" gives the model a persistent PTY session it can write stdin to — so it can run a REPL, answer an interactive prompt, keep a dev server running. Gemini CLI parses shell commands with tree-sitter to block command substitution and enforce allowlists before anything runs. The shell tool in a mature harness is closer to a tiny operating system interface than a subprocess call.
Scheduling matters. Tool calls within a turn run in parallel by default in Pi — unless any tool in the batch declares itself sequential, because parallel file edits to the same file are a race. Kimi Code has a conflict-aware scheduler for exactly this. These are classic concurrency problems, just with a language model as the caller.
Job 3: The safety boundary
Here the six harnesses genuinely diverge, and the divergence is instructive because it's philosophical, not technical.
Codex treats sandboxing as architecture. Every command runs inside an OS-enforced sandbox — Seatbelt on macOS, bubblewrap + seccomp on Linux — with the filesystem scoped and network denied by default. The clever part is the layering: the sandbox is the technical boundary, the approval policy is the human boundary, and they are configured independently. The model can request escalation, and that request is itself part of the tool schema.
Pi and Hermes refuse to pretend. Pi ships no sandbox and no permission popups at all, on the stated grounds that in-process "sandboxing" of an agent that runs your toolchain is security theatre — real isolation comes from the OS or a container, so use one. Hermes's SECURITY.md says the same thing in writing: OS-level isolation is the only boundary; everything else is best-effort. You may disagree with shipping that default, but the honesty is clarifying — and it's the correct analysis even for the harnesses that do more.
OpenCode and Gemini CLI turn policy into data. OpenCode's "plan mode" is not a mode at all — it's a permission ruleset (edit: deny, plus a carve-out directory for writing plans). Sub-agents are permission rulesets too. Gemini CLI has a TOML policy engine behind its approval modes. The lesson: don't hard-code behavioural modes; express them as declarative policy and the modes fall out for free.
Job 4: Surviving — persistence and recovery
A forty-minute agent run accumulates state that must outlive crashes, restarts, and the model's own mistakes.
Sessions are logs, not snapshots. The convergent design is append-only: Pi's session is a JSONL tree (every entry has an id and parentId, so branching and time-travel are free); Kimi Code event-sources everything to a wire.jsonl you can literally replay in a visual debugger; OpenCode's v2 redesign is an event-sourced SQLite core. Nobody who has operated an agent at scale stores "current state" — they store what happened and derive state from it.
The escape hatch is git. Both OpenCode and Gemini CLI maintain a shadow git repository (using git's objects/info/alternates mechanism, so it's cheap even on huge repos) and snapshot before risky operations — giving the user /rewind and the agent a guilt-free undo. The agent will sometimes do the wrong thing; the harness's job is to make the wrong thing cheap.
Failure is classified, not caught. Mature loops distinguish rate limits from auth failures from transient network errors, with different policies for each — exponential backoff here, model fallback there (Gemini CLI automatically falls back to Flash on quota exhaustion), a hard stop elsewhere. And the most human failure mode — the agent stuck in a loop, trying the same fix forever — gets dedicated machinery: Gemini CLI detects loops and injects corrective guidance rather than just aborting; OpenCode escalates doom-loops into a permission ask, putting the human back in the loop exactly when the model has stopped making progress.
Steering without breaking. Users want to redirect a running agent without killing it. Pi's answer is two queues with different semantics: steering messages are injected between turns (the agent course-corrects mid-task), follow-ups are delivered only when it would otherwise stop. A small design, but it's the difference between an agent you collaborate with and one you can only fire-and-forget.
Job 5: The engine/face split
The last organ is structural. Every mature harness has separated the engine (the loop, tools, state) from the face (whatever the user is looking at) behind an explicit protocol: Codex's core speaks a typed submission-queue/event-queue protocol, and the TUI, the IDE extension, codex exec, and the SDK are all just clients of it; OpenCode is an HTTP server with an OpenAPI spec, and the terminal UI, desktop app, web UI and GitHub bot are all generated-SDK clients; Gemini CLI routes tool confirmations through a serialisable message bus, which is what lets the same agent run interactively, headless in CI, or behind its Agent2Agent server.
This is the harness decision with the highest long-term leverage, because every future surface — IDE plugin, chat bot, CI job, another agent — is a client you haven't written yet.
Convergent evolution, and the two real differences
Six teams, working independently, in Rust, TypeScript, and Python, shipped the same five organs:
- Context engineering — constructed views, progressive disclosure, structured compaction, cache discipline
- A tool contract — schemas, output budgets, engineered edits, concurrency rules
- A policy boundary — permissions and/or sandboxing, increasingly expressed as data
- Append-only persistence with cheap undo — event logs plus git checkpoints
- An engine/face protocol — one core, many clients
When independent evolution produces the same anatomy, that anatomy is the job description. If you are building an agent, this list is your roadmap.
What actually differentiates the six is just two things. Sandboxing depth — only Codex treats OS-level isolation as load-bearing architecture; everyone else delegates it to you. And what persists across sessions — only Hermes ships a learning loop (self-written skills, a curating background process, searchable memory of every past conversation); the rest start each session from zero plus instruction files. Our bet, for what it's worth: the second gap is where the next round of competition happens, because per-session capability is converging fast.
Why the model doesn't make this go away
The obvious objection: models keep getting better — won't a sufficiently smart model just not need all this scaffolding?
The evidence so far says no, for a structural reason: the harness governs the realised fraction of the model's capability, and that fraction is set by information logistics, not intelligence. A smarter model still cannot reason about a file that was truncated out of its context, still cannot remember a decision that compaction discarded, still cannot un-delete a directory, and still burns money if its cache is invalidated every turn. Model progress raises the ceiling; the harness determines how close to the ceiling you operate. If anything, better models raise the stakes on the harness — a model capable of an eight-hour autonomous run is only useful inside a harness whose compaction, checkpointing and recovery can survive eight hours.
The strongest proof is the vendors themselves. OpenAI co-designs the Codex models with the Codex harness — the models are trained to use its specific patch grammar and escalation protocol — and the same model measurably performs differently in different harnesses. If the harness were incidental, none of that investment would make sense.
What this means in practice
If you're choosing an agent: stop comparing models and start comparing harnesses. Ask how it compacts, what it does when an edit fails, whether you can rewind, what's sandboxed, and what survives a restart. Those answers predict your daily experience far better than a benchmark score.
If you're building one — as we are — the encouraging news is that the harness is the one layer where a small team can genuinely compete. You cannot train a frontier model; you can absolutely write a better loop. Pi's entire agent runtime is about 2,700 lines of TypeScript. The five organs above are a few thousand lines of careful, testable, boring code — and all six of the codebases we studied are open. Read Pi for the cleanest minimal core, Codex for sandboxing and the protocol-first split, OpenCode for the client/server architecture, Kimi Code for event sourcing, Gemini CLI for recovery engineering, and Hermes for memory and self-improvement.
The model reasons. Everything else is the harness — and everything else, it turns out, is most of the work.
Next in the series: skills and memory — how an agent captures knowledge, why a human still has to write some of it down, and what "self-improving" actually means in code.
Previous Article
Jun 13, 2026





Comments (0)
Leave a Comment