The Small Model's Real Job Is Dispatch, Not Work

The 4 GB experiment ended on an honest wall: a small local model fits in a 4 GB laptop GPU and solves easy coding tasks, but it hits a ceiling on hard ones and cannot drive a native tool call at all. The natural reaction is to shrug and reach for a cloud model.

But that conclusion assumed the small model's job is to do the work. A reader pushed back with a better question: what if its job is to decide who does the work? Judging a task is cheaper than solving it. So we ran another experiment — same 4 GB laptop, same hidden-grader harness — to find two things: where a small model's limit actually is across general tasks, and whether a small model can earn its keep as a router instead of a worker.

The setup

The harness is the same one from the coding study: each task runs against a hidden grader, so every result is automated and reproducible. The new task suite is 48 general tasks — three domains (reasoning, extraction, writing) × four difficulty tiers (easy → extreme) × four tasks each — graded by exact answer, number match, JSON-field match, or a constraint checker (word counts, lipograms, acrostics). We ran six models × 48 tasks × 3 repetitions = 864 trials, plus a separate batch of open-ended writing prompts judged by hand.

A quick but important detour: our first run was contaminated by the grader, not the models. Tasks failed because a model answered $1,240.50 where the grader wanted 1240.50, or The Widget Pro where it wanted Widget Pro — correct extraction, punished for formatting. That is the same trap the coding study taught us to watch for, and a calibration pass caught it. We normalised currency, commas, and a leading article before comparison. Measure the capability, never the formatting.

Finding 1: reasoning-tuning beats size

Here is the capability matrix — overall, and broken out by domain:

Model

Overall

Extraction

Reasoning

Writing

qwen3:4b

94%

100%

94%

88%

qwen3:1.7b

88%

98%

90%

77%

granite3.3:2b

72%

94%

50%

73%

qwen2.5-coder:7b

72%

88%

71%

56%

qwen2.5-coder:3b

69%

94%

54%

58%

qwen2.5-coder:1.5b

63%

88%

50%

52%

Read the top two rows against the bottom four. qwen3:1.7b scores 88% and beats qwen2.5-coder:7b at 72% — while being four times smaller. The coder models, which won the coding study, are the weakest generalists here. The lesson is blunt: capability is not monotonic in size, and it is not even consistent across model families. A model tuned for reasoning runs circles around a bigger model tuned for code the moment you step off the code.

That detail matters more than it looks, and it comes back to bite us later: the biggest model in your fleet is not necessarily the best one.

Finding 2: the limit is real, and it has a shape

Small models are not uniformly weak — they fail in specific, predictable places:

Easy and medium extraction and reasoning are solved. A 1.5B model pulls fields out of text and does two-step arithmetic at near-100%. This is genuinely usable work.
Hard reasoning caps at ~75% even for the best model. The classic trick — "you just overtook the person in second place; what position are you in?" — still catches them.
Lipograms are the single hardest thing we measured. "Describe the sky without using the letter e" sits at 50% across the entire fleet. Writing under a hard mechanical constraint needs planning the models do not have.

So the small model has a real, shaped limit. Which sets up the actual question.

The pivot: can it route?

If a cheap model cannot reliably do hard work, can it reliably sort hard work — keep the easy stuff and hand the hard stuff up to a bigger model? We measured three routing signals, from the most idealised to the most deployable.

Signal 1 — the oracle router (the ceiling)

First, the upper bound. Imagine a perfect router that always sends each task to the cheapest model that can solve it. It cannot exist (it would need to know the answer in advance), but it tells you how much routing could possibly save. Using model size as a cost proxy:

Coverage 98% — some model in the fleet solves all but one task.
The oracle router's mean cost is 1.79B params vs 7B for "always use the biggest" — 74% cheaper at identical coverage.

And the cumulative coverage as you add tiers, cheapest first, is the punchline:

Models available up to

Size (B)

Cumulative coverage

qwen2.5-coder:1.5b

1.5

65%

qwen3:1.7b

1.7

90%

granite3.3:2b

2.0

94%

qwen2.5-coder:3b

3.0

94%

qwen3:4b

4.0

98%

qwen2.5-coder:7b

7.0

98%

Adding the 1.7B model jumps coverage from 65% to 90%. Everything above it buys 8 more points, and the 3B and 7B coders buy nothing for general work. The whole strategy falls out of one row: run qwen3:1.7b as the workhorse, escalate the ~10% it can't handle to qwen3:4b, and never load the 7B at all.

Signal 2 — self-consistency (proactive, and deployable)

The oracle cheats by peeking at the grader. A real router has no grader at dispatch time — it only sees the model's own output. So here is a signal it can use: sample the small model a few times and watch whether it agrees with itself. Confident, stable answers are accepted; answers that wobble across samples get escalated.

The appeal is that it sidesteps the obvious failure mode of asking a model "is this hard?" — a weak model is exactly the one that can't tell a hard question is hard, and will cheerfully label it easy. Instability is harder to fake than a self-rating. We sampled qwen3:1.7b five times per task and scored two definitions of "agree":

Agreement defined as

Accept precision

Escalation rate

Missed errors

Same text (any wording differs = disagree)

100%

60%

Same answer (8 = "8 days" = "the answer is 8")

97%

38%

"Accept precision" is the chance an accepted answer is actually right; "missed errors" are confident-but-wrong answers that slip through. The first row is remarkable: when this model agrees with itself word-for-word, it is right every single time — zero bad accepts. The catch is it over-escalates, because it counts "8" and "8 days" as a disagreement. Comparing the answer instead of the wording cuts wasted escalation from 60% down to 38%, at the cost of exactly one confident-wrong slip. That is not a bug to fix — it is a dial. Turn it toward text-agreement when mistakes are expensive, toward answer-agreement when compute is.

Signal 3 — feedback escalation, and the corpus it leaves behind

The first two signals act before the user sees anything. The third one is the human in the loop: a thumbs-down re-routes the same question to a stronger model, and — just as importantly — every disliked question is saved to a deduplicated corpus for later.

Here is where Finding 1 comes back to bite. The naive way to "escalate to a stronger model" is to pick a bigger one — and that is wrong. Escalating qwen3:1.7b's failures by size would send them to granite or the 7B coder, both of which are worse at general tasks. Escalation has to follow measured capability, not parameter count. Routed correctly to qwen3:4b, the bigger model recovered 2 of the 5 disliked answers.

The other 3 are the interesting part. They were the marble-fraction trick and both lipograms — the exact hard tail from Finding 2, the tasks no tier in the fleet can solve. Escalation can't fix those, and that is precisely why the corpus matters: it automatically collects the questions that routing alone will never rescue. That pile is your eval set, your fine-tuning data, and your prompt-engineering to-do list — the questions worth a human's attention, surfaced for free.

A note on the writing

The open-ended writing prompts told the same story from a different angle. Every model, down to 1.5B, writes fluent prose. What separates them is discipline — obeying the instruction. The small coders wrote four lines when asked for a two-line rhyme, three sentences when asked for two. Only two of six models produced an actual rhyme. The best writer, qwen3:4b, also had the worst obedience, cheerfully appending an emoji checklist to a "short email." Fluency is free at 1.5B; following the constraint is the thing that scales with reasoning-tuning — the same axis the whole routing argument rides on.

The takeaway

The small model's ceiling is real, but the framing was wrong. Asked to be the worker, a 4 GB model is a mediocre generalist. Asked to be the dispatcher, it is excellent — and the architecture writes itself:

qwen3:1.7b as the always-on workhorse, covering ~90% of general traffic on hardware you already own.
Self-consistency as the escalator — when the small model disagrees with itself, hand the task up. A tunable dial between cost and safety.
Escalate by measured capability, not size — and capture every thumbs-down into a hard-case corpus, because that corpus is where the next round of real improvement comes from.

Stop asking the small model to do the work. Ask it to route the work, and it will save you most of your compute while quietly building the dataset that makes everything downstream better.

This closes the small-model thread of the series — from which agent wins, to what a tool call really is, to the 4 GB model's real job. Next, we turn the lens back on the big agents.

The Small Model's Real Job Is Dispatch, Not Workeric

Keywords

The setup

Finding 1: reasoning-tuning beats size

Finding 2: the limit is real, and it has a shape

The pivot: can it route?

Signal 1 — the oracle router (the ceiling)

Signal 2 — self-consistency (proactive, and deployable)

Signal 3 — feedback escalation, and the corpus it leaves behind

A note on the writing

The takeaway

Comments (0)

Leave a Comment

Latest Articles

Measurable Beats Powerful: Grounding AI in Tangible Checks

Shipping a native Go binary inside an Android APK: ELF linkers, 16 KB pages, and the memfd trick

Introducing TYO Reach: A Travel Companion for Your Browser

Introducing reach-dl: a Local, Private Download Manager for TYO Reach

The Model That Aced Olympiad Math Can't Write a Tweet

Tags

Previous Article

The 4 GB Card You Already Own Can Reason Now

Next Article

What a Native Tool Call Actually Is

The Small Model's Real Job Is Dispatch, Not Workeric

Share This:

Keywords

The setup

Finding 1: reasoning-tuning beats size

Finding 2: the limit is real, and it has a shape

The pivot: can it route?

Signal 1 — the oracle router (the ceiling)

Signal 2 — self-consistency (proactive, and deployable)

Signal 3 — feedback escalation, and the corpus it leaves behind

A note on the writing

The takeaway

Share This:

Comments (0)

Leave a Comment

Latest Articles

Measurable Beats Powerful: Grounding AI in Tangible Checks

Shipping a native Go binary inside an Android APK: ELF linkers, 16 KB pages, and the memfd trick

Introducing TYO Reach: A Travel Companion for Your Browser

Introducing reach-dl: a Local, Private Download Manager for TYO Reach

The Model That Aced Olympiad Math Can't Write a Tweet

Tags

Previous Article

The 4 GB Card You Already Own Can Reason Now

Next Article

What a Native Tool Call Actually Is