preloader
post-thumb

Last Update: June 14, 2026


BYauthor-thumberic

|Loading...

Keywords

The first five posts in this series were the telescope and the microscope — what agents are, what a harness does, how they learn, which one wins, and how one actually writes a line of code. All of it was reading — code and behaviour. This post is different: we stopped reading and ran the experiment.

The question: on cheap, local hardware — a 4GB laptop GPU — what actually makes a small model a good coding agent? Not "which frontier model wins" (you can't run those at home), but: given a model that fits in 4 GB, which harness choices matter? Edit format? Retry? Task decomposition? A planner model? We built a measurement harness, wrote a suite of coding tasks with automated pass/fail oracles, and ran roughly 900 trials. The results overturned most of our intuitions.

The setup

A gaming laptop with an RTX 3050 (4 GB VRAM) serving models via Ollama; a separate machine running the agent and holding the code. The harness runs each task in a fresh copy of its workspace, lets the model attempt it, then grades it with a hidden pytest oracle — the test is the only judge, so every result is automated and reproducible.

The task suite is 28 self-contained coding tasks: add a function, fix a bug, small refactors, implement-from-spec, multi-file changes, and a deliberately hard tier (a bowling scorer, a circular buffer, rational-number arithmetic, a 24-hour clock, a book-store discount optimiser, variable-length-quantity encoding). Each oracle was checked two ways — it must fail on the starting code and pass on a correct solution. Calibrated this way, the simplest configuration solves about 71%, leaving real room to measure whether the fancy techniques help.

We varied one lever at a time around that baseline (qwen2.5-coder:3b, whole-file edits, single-shot), ran each task/config combination three times to smooth out the noise of a non-deterministic model, and recorded success rate, tokens, and wall-time.

The results

Config
Success
Sec/run
What changed vs baseline
baseline (coder-3b, whole-file)
71%
2.6
reference
retry (run test, feed failure back)
70%
8.6
no change
decomposition (plan then execute)
64%
5.3
hurt
role-split (planner + editor)
66%
134
hurt + 50x slower
model: qwen3:1.7b
43%
50
much worse coder
model: granite3.3:2b
39%
2.2
much worse coder
edit format: diff
24%
5.9
far worse
edit format: native tool-calls
0%
2.1
impossible

Read that table for a minute, because almost every row is a surprise.

Finding 1: the edit format is the dominant lever

How the model is asked to express a change matters more than anything else. Whole-file rewrite: 71%. Search/replace diff: 24%. Native function-calling: 0%.

That 0% deserves a pause. We gave the model proper OpenAI-style tools and asked it to call them — and across 84 attempts it never once produced a usable tool call. It knew what to do: it emitted perfectly correct JSON like {"name": "read_file", "arguments": {"path": "config.py"}} — but as a markdown code block in its message text, never in the structured tool_calls field the protocol requires. A native-tool agent (Codex, Pi, anything function-calling-based) sees an empty tool-call list and does nothing. The small coder model literally cannot drive a native-tool agent.

Diff (24%) is a softer version of the same lesson: the model produces well-formed search/replace blocks, but on longer files the "search" text drifts from the source by a character and the edit won't apply. Whole-file editing sidesteps both problems — there's nothing to match, nothing to format. For a small model, make the easy thing the only thing.

Finding 2: "best coder" is not "best tool-caller"

We tried three models that fit in 4 GB. The coding-specialised qwen2.5-coder:3b (71%) crushed the newer general models qwen3:1.7b (43%) and granite3.3:2b (39%) at producing correct edits.

Here's the twist: qwen3:1.7b is the better tool-caller — it can emit native tool_calls and drive an agent loop where the coder model can't — yet it's the worse coder. Tool-calling ability and coding ability are different axes, and a model can be strong on one and weak on the other. If your harness leans on native tool-calling, you're implicitly selecting for the tool-caller and possibly against the coder. (qwen3:1.7b is also strikingly slow and verbose here — ~50 s/run, ten times the output tokens — despite being told not to "think.")

Finding 3: the clever agentic scaffolding didn't help — and mostly hurt

This is the one that stings, because it's the stuff everyone reaches for.

  • Retry (run the test, feed the failure back, let it fix): neutral (70% vs 71%). Retry recovers near-misses. But this suite's failures are the hard tier — bowling, VLQ — which are capability-bound: the 3B can't write a correct bowling scorer no matter how many times you show it the failing test. Retry can't buy capability the model doesn't have.
  • Decomposition (plan the task into steps, then execute): −7 points. The planning step adds tokens and structure a small model handles poorly; a slightly wrong plan actively misleads it.
  • Role-split (a stronger qwen3:4b planner directing the coder): −5 points and ~50× slower. On 4 GB the two models can't co-reside, so the GPU thrashes swapping them in and out. A separate planner didn't help and cost two minutes a task.
  • Stacking the "improvements" together regressed below baseline, because the negative levers compound.

The pattern: orchestration tricks that help a frontier model coordinate do not rescue a small model whose failures are a capability ceiling, and several make it worse by adding surface area for a limited model to trip over.

The follow-up: can we craft tool-calls the small model can do?

Finding 1 left a loose end. Native tool-calling scored 0% only because the model emits tool intent as text, not in the tool_calls field. So we asked: what if we meet the model where it is — build our own tool-call protocol that parses the JSON-in-content it naturally produces — and give it a real agent loop (read → edit → run → fix)?

We did. A lenient parser (accepting fenced JSON, <tool_call> tags, or bare JSON), a small tool set, and a loop that hides the grading test until the end. Two results:

  1. A custom text protocol rescues tool-calling: 0% → ~56%. Meeting the model where it is works where the rigid native protocol is impossible. The 3B genuinely drives a multi-step loop.
  2. But the loop still loses to single-shot — and its value depends on the model. On the strong coder it hurt (71% single-shot → 56% in the loop: more turns, more chances to go wrong). On the weaker-but-tool-capable qwen3:1.7b it helped (43% → 58%). Neither beat plain coder-3b doing one whole-file edit.

So you can engineer a small model into being an agent at all — and that's a genuinely useful capability — but the engineering doesn't beat the simplest possible setup.

What it adds up to

For a small model on scoped coding tasks, the high-leverage choices are the editor model and the edit format — not the agentic scaffolding. The recommended 4 GB configuration is almost embarrassingly plain: qwen2.5-coder:3b, whole-file edits, single-shot. 71% at under three seconds a task. Everything fancier was neutral, slower, or worse.

This both confirms and sharpens the series' running thesis. We've said all along that the harness, not the model, is where the engineering lives — and that's still true: the harness's edit-format and model choices made the difference between 0% and 71%. But there's a hard limit the experiment made vivid. A harness can fix problems of interface — give the model an edit format and a tool-call protocol it can actually use — and the payoff is enormous. It cannot fix problems of capability — when the model simply can't write the bowling scorer, no amount of retry, planning, or multi-model orchestration conjures the answer, and the cleverer the scaffolding, the more often it backfires. With a small model you feel that ceiling directly, and the winning move is to stop adding machinery and get out of the model's way.

There's a nice corollary for anyone running local AI on modest hardware: you don't need an elaborate agent framework to get real value from a 4 GB GPU. You need the right small coder model, whole-file editing, and the discipline to keep it simple.

The harness for this study — the task suite, the levers, the custom tool-call loop, and the full results — is open source. This experiment grew out of trying to give a real coding agent a local brain; the surprise was how little the brain needed around it.

Comments (0)

Leave a Comment
Your email won't be published. We'll only use it to notify you of replies to your comment.
Loading comments...
Previous Article
post-thumb

Jun 15, 2026

What a Native Tool Call Actually Is

An LLM only ever outputs text — so how does an AI agent run a tool? The answer is the native tool call, a precise, trained, parseable protocol that most people hand-wave over. Here is exactly what it is, what happens on the wire, and why a small model that knows what to do still cannot make one.

Next Article
post-thumb

Jun 13, 2026

How a Coding Agent Actually Writes Code

The first four posts in this series took the telescope view of AI agents. This one takes the microscope: how a coding agent turns an instruction into a working, tested change — how it edits, compiles, runs, tests and debugs — and why Claude Code behaves like a native *nix programmer while others reach for a throwaway script.

agico

We transform visions into reality. We specializes in crafting digital experiences that captivate, engage, and innovate. With a fusion of creativity and expertise, we bring your ideas to life, one pixel at a time. Let's build the future together.

Copyright ©  2026  TYO Lab