
Last Update: May 12, 2026
BY
eric
Keywords
I've been building a local AI inference server at home. Not for any grand reason — mostly curiosity, a bit of privacy paranoia about sending everything to OpenAI, and the guilty pleasure of watching a model load onto my own GPU. After a few weeks of downloading models, writing benchmark scripts, and debugging scoring bugs at midnight, I have real numbers to share.
This is not a theoretical comparison. Every number here came from actual inference on my machine.
The Hardware
- CPU: Intel Core i7-12700F (12 cores, 20 threads)
- RAM: 64GB DDR4
- GPU: NVIDIA GeForce RTX 3060 — 12GB VRAM
- OS: Debian 13
- Inference engine: llama.cpp in router (multi-model) mode
The RTX 3060 with 12GB VRAM is the constraint that shapes everything. It's a solid consumer card — fast, quiet, widely available — but 12GB means you need to think carefully about which models you can load, at what quantization, and at what context length.
The Setup: llama.cpp Router Mode
Rather than running a separate server process for each model, I use llama.cpp's --models-dir router mode. Point it at a directory of GGUF files, set --models-max 2 (keep up to 2 models hot in VRAM at once), and it handles loading and unloading automatically per request. It exposes an OpenAI-compatible API at http://localhost:8080/v1.
The systemd service looks like this:
[Service]
ExecStart=/opt/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 --port 8080 \
--models-dir /data/models \
--models-max 2 \
--threads 16 \
--ctx-size 8192 \
--n-gpu-layers 99 \
--cont-batching \
--flash-attn auto \
--log-disable
--n-gpu-layers 99 pushes every layer it can onto the GPU. --flash-attn auto enables Flash Attention when the model supports it. Models that don't fit entirely in VRAM will spill some layers to CPU — that's slow, and it shows up in the benchmark numbers.
The Benchmark
I wrote a small Python script that tests each model on three task categories:
- GSM8K — 10 grade-school math word problems. Tests arithmetic reasoning.
- MMLU — 10 multiple-choice questions across medicine, history, philosophy, science, and CS. Tests general knowledge.
- HumanEval — 5 Python coding tasks (sum_to_n, palindrome check, fizzbuzz, count vowels, flatten). Tests code generation.
Each request uses temperature=0 for determinism. Token speed (t/s) is measured from the timings.predicted_per_second field in the llama.cpp response.
Handling Thinking Models
Several models — Qwen3, DeepSeek-R1-Distill, Qwen3.5 — are thinking or reasoning models that generate an internal chain of thought before giving a final answer. This created two real scoring headaches:
-
llama.cpp routes Qwen3's thinking to a separate
reasoning_contentfield in the API response, leavingcontentempty. A naive scorer reading onlycontentscores every Qwen3 response as 0%. -
256 max_tokens is nowhere near enough. Thinking models use most of their budget on reasoning and never output a final answer. I bumped to 1024 tokens for GSM8K/MMLU and 2048 for coding tasks.
-
MMLU first-match failure. A reasoning model might mention option "A" in its thinking ("let's rule out A...") before concluding "the answer is B". Using
re.searchto find the first letter match gets the wrong answer. I switched tore.findalland take the last[ABCD]match.
The fixed scoring functions:
def chat(model, prompt, max_tokens=512):
r = requests.post(BASE_URL, json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0,
}, timeout=120)
data = r.json()
msg = data["choices"][0]["message"]
# Qwen3 thinking models: answer is in reasoning_content, content is empty
text = msg.get("content") or msg.get("reasoning_content", "")
tps = data.get("timings", {}).get("predicted_per_second", 0)
return text, tps
def score_mmlu(response, answer):
stripped = strip_thinking(response) # remove <think>...</think>
for text in [stripped, response]: # fallback to full text if strip leaves nothing
matches = re.findall(r'\b([ABCD])\b', text.strip().upper())
if matches and matches[-1] == answer:
return True
return False
Getting these details right took most of the iteration time. The first run showed Qwen3-8B at 0% on everything — not because it's a bad model, but because the scoring was broken.
The Results
All models run as Q4_K_M quantization unless noted. 14B models were tested at --ctx-size 4096 (reduced from 8192) to fit the KV cache alongside the ~8.4GB weights within 12GB VRAM.
Model Params GSM8K MMLU Code Avg t/s
─────────────────────────────────────────────────────────────────────────────────
Meta-Llama-3.1-8B-Instruct-Q4_K_M 8B 80% 80% 100% 87% 64.5
Qwen2.5-Coder-14B-Instruct-Q4_K_M 14B 80% 80% 100% 87% 35.8
Llama-3.2-3B-Instruct-Q4_K_M 3B 70% 80% 100% 83% 128.3
Qwen2.5-Coder-7B-Instruct-Q4_K_M 7B 40% 100% 100% 80% 69.3
Qwen_Qwen3-14B-Q4_K_M 14B 50% 70% 100% 73% 33.4
Llama-3.2-1B-Instruct-Q8_0 1B 50% 50% 100% 67% 211.9
Qwen_Qwen3-8B-Q4_K_M 8B 30% 70% 100% 67% 59.5
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M 7B 80% 50% 20% 50% 66.2
Qwen3.5-9B-Q4_K_M 9B 20% 70% 20% 37% 49.9
Qwen3.5-4B-Q4_K_M 4B 30% 70% 0% 33% 77.1
MiMo-VL-7B-RL-q4_k_m 7B 10% 50% 20% 27% 65.6
MiMo-7B-RL-Q4_K_M 7B 10% 60% 0% 23% 64.2
THUDM_GLM-Z1-9B-0414-Q4_K_M 9B 0% 0% 0% 0% 51.3 †
─────────────────────────────────────────────────────────────────────────────────
DeepSeek-R1-Distill-Qwen-14B-Q4_K_M 14B FAIL — wouldn't load at ctx=4096
openai_gpt-oss-20b-Q4_K_M 20B FAIL — ~11.5GB, exceeds 12GB VRAM
Kimi-Linear-48B-A3B-BPW2.6 48B FAIL — corrupt download
† GLM-Z1 uses non-standard thinking tags that our scorer doesn't strip. Its real capability is unknown from this run.
What Surprised Me
The 3B Llama Punches Hard
Llama-3.2-3B-Instruct at 128 t/s and 83% average is the sleeper hit. It's nearly as capable as the 8B model, runs at twice the speed, and fits comfortably in VRAM. If you're building something latency-sensitive — a coding assistant, a quick-reply bot — this is the model I'd reach for first.
Qwen2.5-Coder-7B Is Surprisingly Well-Rounded
This model scored 100% on both MMLU and Code, and 40% on math. Named "Coder" but it's not just good at code — it's one of the most reliable general-purpose models in the lineup. At 69 t/s it's fast too.
DeepSeek-R1-7B Is a Math Specialist, Not a Generalist
80% on GSM8K math — tied with the best models in the field. But 50% on MMLU and only 20% on coding. The reasoning focus pays off for math and pays a price everywhere else. Use it when you specifically need step-by-step numerical reasoning.
20B Models Don't Fit
The openai_gpt-oss-20b model is about 11.5GB at Q4_K_M quantization. That leaves only 500MB for the KV cache — not viable. It loaded partially with CPU offload, but responses were too slow and wrong. If you want a 20B+ model, you either need more VRAM or a heavily quantized version (Q2 or less), which carries its own quality tradeoffs.
14B Models Need Careful Context Sizing
At --ctx-size 8192, 14B Q4_K_M models fail to load — the weights (~8.4GB) plus the KV cache exceed 12GB. Dropping to --ctx-size 4096 solved it for Qwen3-14B and Coder-14B. DeepSeek-R1-14B refused to cooperate even at 4096 — likely a memory fragmentation or architecture-specific issue.
GLM-Z1 Needs a Different Scorer
GLM-Z1 almost certainly isn't as bad as our 0% score suggests. It uses thinking token formats that aren't <think>...</think> tags, so our strip function leaves garbage in the text. This is on the benchmark, not the model. Something to fix in a future run.
Kimi-Linear-48B: So Close, Yet So Far
Kimi-Linear-48B is a Mixture-of-Experts model with 48B total parameters but only 3B active per token — meaning it should fit in 12GB VRAM and run at reasonable speed. At BPW2.6 quantization the file is about 8.1GB. We downloaded it twice. The first download (via hf download) produced a file with 6.9MB of zeros at the start — invalid GGUF magic. The second download via wget direct URL is in progress. Watch this space.
The Hy-MT1.5 Translation Models
We also downloaded the full set of Hy-MT1.5 translation models from Tencent — 1.8B and 7B variants in multiple quantizations. These aren't benchmarked here because GSM8K/MMLU/HumanEval aren't the right tests for a translation model. A separate post will cover those with bilingual evaluation tasks.
Practical Takeaways
If you're setting up a similar server, here's what I'd do differently from day one:
1. Start with Llama-3.2-3B. It's the best quality-per-token-per-second model in this lineup. Keep it hot in VRAM for low-latency tasks.
2. Add Qwen2.5-Coder-7B for general use. It's well-rounded across knowledge, code, and fast enough for interactive use.
3. Use llama.cpp router mode from the start. Running one server that hot-swaps models is far better than spinning up individual servers. You get an OpenAI-compatible API, automatic model loading, and a simple /v1/models endpoint to see what's available.
4. Budget VRAM carefully for 14B models. You need --ctx-size 4096 (not 8192) to fit 14B Q4_K_M models. That's fine for most tasks, but be aware of the tradeoff.
5. Validate GGUF downloads before benchmarking. Check the first four bytes: they should be 47 47 55 46 (ASCII "GGUF"). A quick od -A x -t x1z model.gguf | head -1 saves hours of debugging.
6. Fix your thinking-model scorer before running anything. The reasoning_content split in llama.cpp, the max_tokens floor, and the last-match MMLU scoring all need to be right or you'll think every thinking model is broken.
What's Next
- Re-test GLM-Z1 with a proper thinking-tag stripper
- Kimi-Linear-48B, if the second download produces a valid file
- Translation quality tests for Hy-MT1.5
- A comparison post once larger VRAM cards (24GB+) are in reach
The benchmark script is straightforward Python using the OpenAI-compatible API — if you want to run it against your own setup, the core logic is about 100 lines. Happy to share it.
Tested on Debian 13, llama.cpp built from source (May 2026), RTX 3060 12GB, i7-12700F, 64GB RAM.





Comments (0)
Leave a Comment