preloader
post-thumb

Last Update: June 15, 2026


BYauthor-thumberic

|Loading...

Keywords

The 4 GB experiment ended on an honest wall: a small local model fits in a 4 GB laptop GPU and solves easy coding tasks, but it hits a ceiling on hard ones and cannot drive a native tool call at all. The natural reaction is to shrug and reach for a cloud model.

But that conclusion assumed the small model's job is to do the work. A reader pushed back with a better question: what if its job is to decide who does the work? Judging a task is cheaper than solving it. So we ran another experiment — same 4 GB laptop, same hidden-grader harness — to find two things: where a small model's limit actually is across general tasks, and whether a small model can earn its keep as a router instead of a worker.

The setup

The harness is the same one from the coding study: each task runs against a hidden grader, so every result is automated and reproducible. The new task suite is 48 general tasks — three domains (reasoning, extraction, writing) × four difficulty tiers (easy → extreme) × four tasks each — graded by exact answer, number match, JSON-field match, or a constraint checker (word counts, lipograms, acrostics). We ran six models × 48 tasks × 3 repetitions = 864 trials, plus a separate batch of open-ended writing prompts judged by hand.

A quick but important detour: our first run was contaminated by the grader, not the models. Tasks failed because a model answered $1,240.50 where the grader wanted 1240.50, or The Widget Pro where it wanted Widget Procorrect extraction, punished for formatting. That is the same trap the coding study taught us to watch for, and a calibration pass caught it. We normalised currency, commas, and a leading article before comparison. Measure the capability, never the formatting.

Finding 1: reasoning-tuning beats size

Here is the capability matrix — overall, and broken out by domain:

Model
Overall
Extraction
Reasoning
Writing
qwen3:4b
94%
100%
94%
88%
qwen3:1.7b
88%
98%
90%
77%
granite3.3:2b
72%
94%
50%
73%
qwen2.5-coder:7b
72%
88%
71%
56%
qwen2.5-coder:3b
69%
94%
54%
58%
qwen2.5-coder:1.5b
63%
88%
50%
52%

Read the top two rows against the bottom four. qwen3:1.7b scores 88% and beats qwen2.5-coder:7b at 72% — while being four times smaller. The coder models, which won the coding study, are the weakest generalists here. The lesson is blunt: capability is not monotonic in size, and it is not even consistent across model families. A model tuned for reasoning runs circles around a bigger model tuned for code the moment you step off the code.

That detail matters more than it looks, and it comes back to bite us later: the biggest model in your fleet is not necessarily the best one.

Finding 2: the limit is real, and it has a shape

Small models are not uniformly weak — they fail in specific, predictable places:

  • Easy and medium extraction and reasoning are solved. A 1.5B model pulls fields out of text and does two-step arithmetic at near-100%. This is genuinely usable work.
  • Hard reasoning caps at ~75% even for the best model. The classic trick — "you just overtook the person in second place; what position are you in?" — still catches them.
  • Lipograms are the single hardest thing we measured. "Describe the sky without using the letter e" sits at 50% across the entire fleet. Writing under a hard mechanical constraint needs planning the models do not have.

So the small model has a real, shaped limit. Which sets up the actual question.

The pivot: can it route?

If a cheap model cannot reliably do hard work, can it reliably sort hard work — keep the easy stuff and hand the hard stuff up to a bigger model? We measured three routing signals, from the most idealised to the most deployable.

Signal 1 — the oracle router (the ceiling)

First, the upper bound. Imagine a perfect router that always sends each task to the cheapest model that can solve it. It cannot exist (it would need to know the answer in advance), but it tells you how much routing could possibly save. Using model size as a cost proxy:

  • Coverage 98% — some model in the fleet solves all but one task.
  • The oracle router's mean cost is 1.79B params vs 7B for "always use the biggest" — 74% cheaper at identical coverage.

And the cumulative coverage as you add tiers, cheapest first, is the punchline:

Models available up to
Size (B)
Cumulative coverage
qwen2.5-coder:1.5b
1.5
65%
qwen3:1.7b
1.7
90%
granite3.3:2b
2.0
94%
qwen2.5-coder:3b
3.0
94%
qwen3:4b
4.0
98%
qwen2.5-coder:7b
7.0
98%

Adding the 1.7B model jumps coverage from 65% to 90%. Everything above it buys 8 more points, and the 3B and 7B coders buy nothing for general work. The whole strategy falls out of one row: run qwen3:1.7b as the workhorse, escalate the ~10% it can't handle to qwen3:4b, and never load the 7B at all.

Signal 2 — self-consistency (proactive, and deployable)

The oracle cheats by peeking at the grader. A real router has no grader at dispatch time — it only sees the model's own output. So here is a signal it can use: sample the small model a few times and watch whether it agrees with itself. Confident, stable answers are accepted; answers that wobble across samples get escalated.

The appeal is that it sidesteps the obvious failure mode of asking a model "is this hard?" — a weak model is exactly the one that can't tell a hard question is hard, and will cheerfully label it easy. Instability is harder to fake than a self-rating. We sampled qwen3:1.7b five times per task and scored two definitions of "agree":

Agreement defined as
Accept precision
Escalation rate
Missed errors
Same text (any wording differs = disagree)
100%
60%
0%
Same answer (8 = "8 days" = "the answer is 8")
97%
38%
2%

"Accept precision" is the chance an accepted answer is actually right; "missed errors" are confident-but-wrong answers that slip through. The first row is remarkable: when this model agrees with itself word-for-word, it is right every single time — zero bad accepts. The catch is it over-escalates, because it counts "8" and "8 days" as a disagreement. Comparing the answer instead of the wording cuts wasted escalation from 60% down to 38%, at the cost of exactly one confident-wrong slip. That is not a bug to fix — it is a dial. Turn it toward text-agreement when mistakes are expensive, toward answer-agreement when compute is.

Signal 3 — feedback escalation, and the corpus it leaves behind

The first two signals act before the user sees anything. The third one is the human in the loop: a thumbs-down re-routes the same question to a stronger model, and — just as importantly — every disliked question is saved to a deduplicated corpus for later.

Here is where Finding 1 comes back to bite. The naive way to "escalate to a stronger model" is to pick a bigger one — and that is wrong. Escalating qwen3:1.7b's failures by size would send them to granite or the 7B coder, both of which are worse at general tasks. Escalation has to follow measured capability, not parameter count. Routed correctly to qwen3:4b, the bigger model recovered 2 of the 5 disliked answers.

The other 3 are the interesting part. They were the marble-fraction trick and both lipograms — the exact hard tail from Finding 2, the tasks no tier in the fleet can solve. Escalation can't fix those, and that is precisely why the corpus matters: it automatically collects the questions that routing alone will never rescue. That pile is your eval set, your fine-tuning data, and your prompt-engineering to-do list — the questions worth a human's attention, surfaced for free.

A note on the writing

The open-ended writing prompts told the same story from a different angle. Every model, down to 1.5B, writes fluent prose. What separates them is discipline — obeying the instruction. The small coders wrote four lines when asked for a two-line rhyme, three sentences when asked for two. Only two of six models produced an actual rhyme. The best writer, qwen3:4b, also had the worst obedience, cheerfully appending an emoji checklist to a "short email." Fluency is free at 1.5B; following the constraint is the thing that scales with reasoning-tuning — the same axis the whole routing argument rides on.

The takeaway

The small model's ceiling is real, but the framing was wrong. Asked to be the worker, a 4 GB model is a mediocre generalist. Asked to be the dispatcher, it is excellent — and the architecture writes itself:

  • qwen3:1.7b as the always-on workhorse, covering ~90% of general traffic on hardware you already own.
  • Self-consistency as the escalator — when the small model disagrees with itself, hand the task up. A tunable dial between cost and safety.
  • Escalate by measured capability, not size — and capture every thumbs-down into a hard-case corpus, because that corpus is where the next round of real improvement comes from.

Stop asking the small model to do the work. Ask it to route the work, and it will save you most of your compute while quietly building the dataset that makes everything downstream better.

This closes the small-model thread of the series — from which agent wins, to what a tool call really is, to the 4 GB model's real job. Next, we turn the lens back on the big agents.

Comments (0)

Leave a Comment
Your email won't be published. We'll only use it to notify you of replies to your comment.
Loading comments...
Previous Article
post-thumb

Jun 17, 2026

The 4 GB Card You Already Own Can Reason Now

The dispatch experiment ended on a question: route the hard tail to a more capable model — but which one, if you do not want to reach for the cloud? VibeThinker-3B, a 3B reasoning specialist, runs entirely on the 4 GB laptop GPU you probably already own, solves competition math the workhorse cannot, and sips about 50 watts doing it.

Next Article
post-thumb

Jun 15, 2026

What a Native Tool Call Actually Is

An LLM only ever outputs text — so how does an AI agent run a tool? The answer is the native tool call, a precise, trained, parseable protocol that most people hand-wave over. Here is exactly what it is, what happens on the wire, and why a small model that knows what to do still cannot make one.

agico

We transform visions into reality. We specializes in crafting digital experiences that captivate, engage, and innovate. With a fusion of creativity and expertise, we bring your ideas to life, one pixel at a time. Let's build the future together.

Copyright ©  2026  TYO Lab