How a Coding Agent Actually Writes Code

We called the previous post the last in the series. That was the telescope view — what agents are, what a harness does, how they learn, which one wins. But a developer doesn't experience an agent as an architecture diagram. You experience it one edit at a time: it changes three lines, runs the tests, reads the failure, tries again. So here is the microscope view — how a coding agent actually turns an instruction into a working, tested change, held up against how a human programmer does the same thing.

The human inner loop

Picture your own setup. You have an editor where you type, a terminal where you compile and run, and some way to see the result — a test runner's green bar, a printed value, a web UI you reload in a browser. Around it sits the big lifecycle everyone draws on whiteboards: idea → plan → architecture → design → write. But that's the outer loop. The thing you actually do all day is the inner loop, and it's tight: write a little, run it, look at what happened, adjust. Seconds per cycle.

Two habits define how humans fill that loop. First, we mostly build small and grow — a few lines, run, a few more — keeping a working program at almost every step, rather than generating a whole finished file and hoping. Second, we choose a direction: vertically (one feature sliced end-to-end through every layer until it runs) or horizontally (one layer built out across features). Either way the program is alive in front of us the whole time. We glance at the screen and instantly know if it's broken.

That last point is the one that matters most for what follows, because it's exactly the sense an agent doesn't have.

The agent inner loop

An agent runs the same shape of loop — instruction in, change out, observe, repeat — but through a fundamentally narrower keyhole. Its loop is: the harness sends the model the context, the model emits a tool call (edit this file, run this command), the harness executes it and feeds the text result back, and the model decides what to do next. That's it. The agent has no eyes. It cannot glance at a screen. Its only senses are the text that tools return — file contents, stdout, stderr, an exit code. Everything a human takes in visually — the rendered web page, the debugger paused on a line, the red squiggle under a typo — has to be converted into text and handed back, or it doesn't exist for the agent at all.

This single fact explains almost everything distinctive about how agents code. It's why output truncation matters (an 80,000-line log blinds it as surely as a wall of noise), why edit-acceptance is fiddly, why verification has to be deliberate, and why debugging — the most visual, interactive thing a human does — is the hardest part to translate.

It also shifts the economics of "small and grow." Each model round-trip costs real tokens and seconds, so an agent can't afford the human's keystroke-level incrementalism; it tends to write a whole function or file in one shot, then verify. The good ones pull back toward small steps anyway — write, run, read, fix — because a 200-line blob that fails gives the model almost nothing to reason about, whereas a 20-line change that fails points right at the problem. The tension between "big chunks are cheaper per round-trip" and "small chunks are debuggable" is the central rhythm of agentic coding.

The crux: turning intent into a changed file

Here is the most under-appreciated micro-mechanic in the whole stack: how does the model's intention to change code become an actually-changed file on disk? This step — call it output acceptance — is where coding agents most visibly differ, and where they most often fail. The five harnesses we read take three distinct approaches:

Exact string replacement. The model supplies the old text and the new text; the tool finds the old verbatim and swaps it. This is what Claude Code's Edit and Pi's edit do. It's surgical and the resulting diff is trivially reviewable, but it's brittle: if the model's remembered snippet is off by a space, the match fails. Pi softens this with a fuzzy fallback that normalises smart quotes, Unicode dashes and trailing whitespace before giving up.
A patch grammar. The model emits a structured diff in a format it was trained on — Codex's *** Begin Patch … *** End Patch apply_patch syntax. This handles multi-hunk edits cleanly and reads like a real patch, at the cost of the model having to produce exactly-valid patch syntax.
Self-correcting replace. Gemini CLI's replace tool, on a failed match, fires a follow-up model call to repair its own edit rather than just erroring — trading tokens for a higher landing rate.

A failed edit is the single most common micro-failure in agentic coding, and each of those designs is one harness's answer to it. It's also exactly where your "native programmer versus scripter" observation enters — but that deserves its own section below.

Calling the tools: compile, run, build, spawn

Once a file is changed, the agent needs hands to do something with it, and those hands are a single tool: shell execution (Bash in Claude Code and Pi, exec/unified-exec in Codex, a PTY shell in Gemini and Kimi Code). Through that one door it compiles, runs the test suite, starts a build, installs a dependency, spawns a server.

The interesting wrinkles are all about state and blindness:

Statefulness. A naive subprocess.run can't hold a REPL open or answer an interactive prompt. Codex's "unified exec" gives the model a persistent PTY it can write stdin to; the mature shells track background processes so the agent can start a dev server and keep working while it runs.
Truncation. Every harness caps tool output (Kimi Code's read tool stops at 1,000 lines or 100 KB; Pi spills overflow to a temp file). Without it, one chatty build destroys the context budget.
The blind spot, literally. The agent can npm run dev and start a web UI — but it cannot look at the page. So it curls the endpoint, greps the logs, or, increasingly, drives a headless browser and feeds itself a screenshot through a vision model. Watching an agent start a server it fundamentally cannot see is the clearest reminder of the text-keyhole problem.

Tests and verification

Because the agent can't eyeball correctness, running things is its substitute for seeing — which makes tests and verification not a virtue but a necessity.

On creating tests, agents are comfortable both ways: writing the test first and coding to green (some skills enforce test-first explicitly), or writing tests after to lock in behaviour. Their real edge is that they are tireless about running the full suite, the linter and the type-checker on every change — the chores humans skip when tired.

Their matching weakness is that a language model will cheerfully announce "done" without having run anything. This is why mature harnesses and skills bolt on a verification discipline: re-read the file you just edited, run the build, run the tests, and only then claim success — evidence before assertion. The agent's superpower (re-run the world twenty times without complaint) and its failure mode (declare victory unverified) are two sides of the same coin, and verification gates exist to keep it on the right side.

Debugging: the widest gap of all

Now the part of the inner loop that translates worst. Here is the canonical human bug hunt, in *nix muscle memory:

grep the error string across the tree.
Open the file at the offending line, read the surrounding context.
Set a breakpoint.
Run, and when execution pauses, step through it — step over, step into, step out — watching variables change.
Catch the moment the state goes wrong, and you've found the cause.

An agent does steps 1 and 2 almost exactly as a human does — grep the error, read the context around the line. Then the methods fork hard, because step 3 is mostly unavailable to it. Agents rarely drive an interactive debugger; they can't easily sit at a paused frame and inspect live state. (Some can speak the Debug Adapter Protocol through an MCP server, but it's the exception, not the daily path.) So instead of stepping, the agent debugs the way we all did before debuggers existed, just faster: it inserts print/log statements, re-runs, reads the output, reasons about it, removes them. Printf debugging at machine speed.

The deep difference is one of substrate. A human debugs a live, paused, inspectable program — you can poke it while it's frozen mid-thought. An agent debugs a dead transcript of what the program already emitted. It loses the ability to interrogate a running frame, and it compensates with two things humans can't match: it will happily re-run the failing case fifty times with different instrumentation, and it can hold the entire call path in context at once rather than stepping through it line by line. Slower per insight, but relentless and wide.

The observation: native programmer vs. scripter

Which brings us to the thing you actually notice after watching these agents work for a while: they have different programming temperaments, and Claude Code behaves the most like a native *nix programmer.

By "native" we mean it reaches, by default, for the same surgical tools a seasoned terminal user does — grep to locate, a targeted read of the exact line range, a precise string-replace edit, the right small command for the job — and it leaves behind a clean, reviewable trail of minimal diffs. Watching it is like watching someone who has grep, sed, cut, head and tail in their fingers and uses them with intent.

The contrast: when faced with a bulk change — rename this symbol across forty files, rewrite every import — some agents (Codex and Gemini among them, in our observation) are more inclined to write a throwaway script to do it: a python3 -c "...", a perl -pi -e, a sed -i one-liner run through the shell, rather than a series of surgical edits. It works, and for genuinely mechanical mass-edits it's even the right call. But it has a cost: a thrown-away script's effect is harder to review than a diff — you have to reconstruct what it did — and a slightly wrong regex can silently mangle forty files at once, where a string-replace edit simply fails loudly on the one it couldn't match.

Two honest caveats, because this is the part most easily overstated. First, it's a tendency, not a wall: every one of these agents can use native tools, and every one can write a script; the difference is which it reaches for first, and that's shaped as much by the model's temperament as by the harness's tool design (a patch-grammar editor nudges toward batch thinking; a string-replace editor nudges toward surgical thinking). Second, these behaviours drift from model version to model version, so any snapshot like this one has a short shelf life. The durable point isn't "Claude Code wins the bug hunt" — it's that there is a real, observable spectrum between surgical and scripted code-editing, and it changes how reviewable and how safe the agent's work is. When you pick an agent, that temperament is something you can feel within an hour, and it matters more to the day-to-day than any benchmark.

There's a fitting irony to end on: this post about how agents wield grep and surgical edits was itself drafted by one, reaching for exactly those tools as it went. The microscope, it turns out, works on the thing holding it.

Where micro meets macro

Across five posts the same theme keeps surfacing from different distances. The harness (macro) decides which tools exist and how their results come back; the model (micro) decides which tool to reach for and how big a bite to take. A great coding experience is when those align — a native-feeling toolset in the harness, and a model with the judgement to be surgical when a scalpel is called for and to script when a bulldozer is. The model sets the ceiling; the harness and the editing temperament decide how close to it you actually get to live — one edit, one test, one bug at a time.

This is a micro-level companion to our four-part series on AI agent architecture. All six harnesses behind these observations are open source; the best way to see the difference between a surgical edit and a thrown-away script is to watch two agents solve the same task and read what each left behind.

How a Coding Agent Actually Writes Codeeric

Keywords

The human inner loop

The agent inner loop

The crux: turning intent into a changed file

Calling the tools: compile, run, build, spawn

Tests and verification

Debugging: the widest gap of all

The observation: native programmer vs. scripter

Where micro meets macro

Comments (0)

Leave a Comment

Latest Articles

Measurable Beats Powerful: Grounding AI in Tangible Checks

Shipping a native Go binary inside an Android APK: ELF linkers, 16 KB pages, and the memfd trick

Introducing TYO Reach: A Travel Companion for Your Browser

Introducing reach-dl: a Local, Private Download Manager for TYO Reach

The Model That Aced Olympiad Math Can't Write a Tweet

Tags

Previous Article

The 4 GB Experiment: What Actually Makes a Small Model a Good Coding Agent

Next Article

The Coding Agent Shootout and the Best Harness We Could Not Read

How a Coding Agent Actually Writes Codeeric

Share This:

Keywords

The human inner loop

The agent inner loop

The crux: turning intent into a changed file

Calling the tools: compile, run, build, spawn

Tests and verification

Debugging: the widest gap of all

The observation: native programmer vs. scripter

Where micro meets macro

Share This:

Comments (0)

Leave a Comment

Latest Articles

Measurable Beats Powerful: Grounding AI in Tangible Checks

Shipping a native Go binary inside an Android APK: ELF linkers, 16 KB pages, and the memfd trick

Introducing TYO Reach: A Travel Companion for Your Browser

Introducing reach-dl: a Local, Private Download Manager for TYO Reach

The Model That Aced Olympiad Math Can't Write a Tweet

Tags

Previous Article

The 4 GB Experiment: What Actually Makes a Small Model a Good Coding Agent

Next Article

The Coding Agent Shootout and the Best Harness We Could Not Read