
Last Update: June 17, 2026
BY
eric
Keywords
The dispatch experiment ended on a clean idea and a loose end. The idea: a small local model earns its keep not by doing the work but by routing it — a 1.7B workhorse covers about 90% of a general workload, and you escalate the hard ~10% tail to something more capable. The loose end: escalate to what? "More capable" almost always means reaching past the laptop for a cloud API — which means a bill, a network round-trip, and your prompts leaving the building.
This post closes that loop with a model that landed in the middle of that open question and that the internet spent a week arguing about: VibeThinker-3B. And it does it on the cheapest hardware in the series so far.
You already own the hardware
Here is the part worth saying plainly. The machine in these experiments is a gaming laptop with an RTX 3050 and 4 GB of VRAM — not a workstation, not a rented H100, a laptop you might have bought for games years ago. Most of us already have a card roughly like this, gathering dust under a pile of Chrome tabs. The interesting question is not "what can a $30,000 server do." It is: what can the thing you already own do, right now, for free?
And there is a second half to that question that people skip: what does it cost to run? I sampled the GPU's power draw during inference. Idle it sits near 14 W; flat-out solving a hard problem it averages about 50 W and peaks at 60 W. That is a bright lightbulb — not a space heater, and an order of magnitude under a desktop 4090 (~450 W) or a datacenter accelerator (~700 W). A hard answer that takes ~50 W for ~20 seconds is about 0.3 watt-hours; you could run a thousand of them for roughly the cost of a cup of coffee. You do not need to worry about your electricity bill. That changes the calculus of "just keep it running locally."
So: a card you already own, drawing a lightbulb of power. The only thing missing was a model worth pointing it at.
What VibeThinker-3B is
VibeThinker-3B is a 3-billion-parameter reasoning model from WeiboAI (Weibo's AI lab), and — conveniently for this series — it is a fine-tune of the very same Qwen2.5-3B base we have been running all along. That makes it an unusually fair comparison: same architecture, same footprint, different training.
The claims are what made it go viral: 94.3 on AIME 2026, 80.2 Pass@1 on LiveCodeBench v6, and parity with much larger frontier reasoning models on verifiable benchmarks — from a model small enough to fit in 4 GB. Predictably, the benchmarks got argued over (generous pass@k budgets, among other things), so the honest move is the one this series always makes: stop reading the leaderboard and run it ourselves.
One caveat up front, straight from the model card: this is a specialist. It is trained for verifiable reasoning — competition math, coding, STEM — where an answer can be checked. Its own authors say that for broad open-domain knowledge you still want a general model. That caveat is not a footnote; it is exactly why this model fits the escalation slot and not the workhorse slot.
The test
Same laptop, same spirit as the 4 GB experiment: a hidden, automated grader is the only judge. I pulled the Q4_K_M GGUF, applied the paper's recommended sampling (temperature 1.0, top-p 0.95), and ran a four-task battery chosen so every answer is mechanically checkable — two competition-math problems, one genuinely hard AIME-level problem, and a classic LeetCode coding task:
Four for four. The code I ran against a brute-force oracle on a batch of strings — it matched everywhere. And the whole time the model sat 100% on the GPU, using 2.3–2.7 GB of the 4 GB, generating at 50–63 tokens per second. No CPU offload, no thrashing. The asterisk on the AIME row is the one real gotcha, and it is worth a section of its own.
The gotcha: token budget, not VRAM, is the wall
On my first AIME attempt the model failed — but not because it could not solve the problem. With a 16K-token context it generated 16,299 tokens of reasoning and ran straight into the context wall mid-thought, never reaching an answer. Reasoning models think in the open, and a hard problem is a long think.
I raised the context to 32K and re-ran the exact same prompt. This time it converged — 601, correct — in about 12,000 tokens. And here is the lovely part: even at 32K context it still sat entirely on the GPU, at 2.7 GB.
That is the counter-intuitive lesson for this class of model on small hardware: the binding constraint is the token budget, not the VRAM. The KV cache (quantized, with flash attention) is cheap; what these models need is room to think. If you run a reasoning model on a 4 GB card and it "fails," check whether it actually finished before you blame the model.
The head-to-head
Solving four problems is nice, but the real question is comparative: against the models we already run on this laptop, is VibeThinker actually better, or just different? I ran the identical battery against qwen2.5-coder:3b (our coding workhorse, same 3B size) and qwen3:4b (the reasoning-tuned model one size up).
Read that table for a minute, because each column tells a different story.
qwen2.5-coder:3b is the fastest thing here and stays fully on the GPU — but it has no reasoning chain, so it answers from the hip and falls apart on anything non-trivial. It missed the dominoes problem (a recurrence a reasoning model finds in seconds) and was nowhere near the AIME answer. Superb at code, useless at hard math. Exactly the profile that made it our editor in the earlier studies, and exactly why it cannot be the escalation target.
qwen3:4b can reason — notice it got the dominoes problem the coder missed. But it is 4B, and 4B does not fit in 4 GB. It spills onto the CPU and collapses to 3–18 tokens per second. The AIME problem ran for two hours and twelve minutes and still did not finish inside a generous token cap. It is the right kind of model at the wrong size for this card — technically capable, practically unusable here.
VibeThinker-3B is the only column that is both right and runnable. It reasons like the 4B but fits and flies like the 3B. That is the whole point: it gives you qwen3:4b's reasoning at qwen2.5-coder:3b's speed and footprint — because it is the 3B, just trained to think.
The payoff: the tail finally has a target
Bring it back to the dispatch thesis. That study argued you should route by measured capability, not by parameter count: let a small workhorse handle the bulk, and escalate the hard tail to whatever model actually clears it. The unsatisfying gap was that "whatever clears it" pointed off the laptop and into the cloud.
VibeThinker-3B fills that gap for one specific, valuable slice of the tail: hard, verifiable reasoning. A math or coding problem your 1.7B workhorse flubs and your 4B can't run in time, this 3B specialist solves — locally, on the GPU you already own, for a lightbulb of electricity. You escalate sideways into a specialist instead of up into the cloud.
It is not a general assistant. Ask it for open-domain knowledge or to drive an agent's tool calls and it will lose to qwen3 — it is a scalpel, not a Swiss Army knife. But a scalpel is exactly what an escalation slot wants. The dispatch post said the small model's job is to know who should do the work. This is one more capable "who" — and remarkably, it is one that still fits in 4 GB.
Reproduce it yourself
The whole point of this series is that you can run it on the machine you already have. Here is the entire setup.
Pull the quantized model straight from Hugging Face into Ollama (no separate download step needed):
ollama pull hf.co/mradermacher/VibeThinker-3B-GGUF:Q4_K_M
Wrap it in a Modelfile so the recommended sampling and a generous context are baked in. The 32K context is the important part — give the reasoner room to think:
FROM hf.co/mradermacher/VibeThinker-3B-GGUF:Q4_K_M
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 0
PARAMETER num_ctx 32768
ollama create vibethinker3b -f VibeThinker.Modelfile
Then point a problem at it and confirm it lands fully on the GPU:
curl -s http://localhost:11434/api/generate -d '{
"model": "vibethinker3b",
"prompt": "In how many ways can a 2x12 rectangle be tiled by 1x2 dominoes? End with the final integer on its own line.",
"stream": false
}' | python3 -c 'import json,sys; print(json.load(sys.stdin)["response"])'
ollama ps # PROCESSOR should read 100% GPU
If a hard problem comes back without a final answer, you have hit the token wall, not a reasoning failure — raise num_ctx and run it again.
The series has gone from reading agents to running them, and from "what fits in 4 GB" to "what's worth pointing 4 GB at." Next, the obvious question once you have a workhorse and a specialist on the same laptop: can a small model reliably decide, on its own, which of the two a given problem needs — and hand off without a human in the loop?
Previous Article
Next Article
Jun 15, 2026





Comments (0)
Leave a Comment