preloader
post-thumb

Last Update: May 12, 2026


BYauthor-thumberic

|Loading...

Keywords

I've been building a local AI inference server at home. Not for any grand reason — mostly curiosity, a bit of privacy paranoia about sending everything to OpenAI, and the guilty pleasure of watching a model load onto my own GPU. After a few weeks of downloading models, writing benchmark scripts, and debugging scoring bugs at midnight, I have real numbers to share.

This is not a theoretical comparison. Every number here came from actual inference on my machine.

The Hardware

  • CPU: Intel Core i7-12700F (12 cores, 20 threads)
  • RAM: 64GB DDR4
  • GPU: NVIDIA GeForce RTX 3060 — 12GB VRAM
  • OS: Debian 13
  • Inference engine: llama.cpp in router (multi-model) mode

The RTX 3060 with 12GB VRAM is the constraint that shapes everything. It's a solid consumer card — fast, quiet, widely available — but 12GB means you need to think carefully about which models you can load, at what quantization, and at what context length.

The Setup: llama.cpp Router Mode

Rather than running a separate server process for each model, I use llama.cpp's --models-dir router mode. Point it at a directory of GGUF files, set --models-max 2 (keep up to 2 models hot in VRAM at once), and it handles loading and unloading automatically per request. It exposes an OpenAI-compatible API at http://localhost:8080/v1.

The systemd service looks like this:

ini
[Service]
ExecStart=/opt/llama.cpp/build/bin/llama-server \
    --host 0.0.0.0 --port 8080 \
    --models-dir /data/models \
    --models-max 2 \
    --threads 16 \
    --ctx-size 8192 \
    --n-gpu-layers 99 \
    --cont-batching \
    --flash-attn auto \
    --log-disable

--n-gpu-layers 99 pushes every layer it can onto the GPU. --flash-attn auto enables Flash Attention when the model supports it. Models that don't fit entirely in VRAM will spill some layers to CPU — that's slow, and it shows up in the benchmark numbers.

The Benchmark

I wrote a small Python script that tests each model on three task categories:

  • GSM8K — 10 grade-school math word problems. Tests arithmetic reasoning.
  • MMLU — 10 multiple-choice questions across medicine, history, philosophy, science, and CS. Tests general knowledge.
  • HumanEval — 5 Python coding tasks (sum_to_n, palindrome check, fizzbuzz, count vowels, flatten). Tests code generation.

Each request uses temperature=0 for determinism. Token speed (t/s) is measured from the timings.predicted_per_second field in the llama.cpp response.

Handling Thinking Models

Several models — Qwen3, DeepSeek-R1-Distill, Qwen3.5 — are thinking or reasoning models that generate an internal chain of thought before giving a final answer. This created two real scoring headaches:

  1. llama.cpp routes Qwen3's thinking to a separate reasoning_content field in the API response, leaving content empty. A naive scorer reading only content scores every Qwen3 response as 0%.

  2. 256 max_tokens is nowhere near enough. Thinking models use most of their budget on reasoning and never output a final answer. I bumped to 1024 tokens for GSM8K/MMLU and 2048 for coding tasks.

  3. MMLU first-match failure. A reasoning model might mention option "A" in its thinking ("let's rule out A...") before concluding "the answer is B". Using re.search to find the first letter match gets the wrong answer. I switched to re.findall and take the last [ABCD] match.

The fixed scoring functions:

python
def chat(model, prompt, max_tokens=512):
    r = requests.post(BASE_URL, json={
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0,
    }, timeout=120)
    data = r.json()
    msg = data["choices"][0]["message"]
    # Qwen3 thinking models: answer is in reasoning_content, content is empty
    text = msg.get("content") or msg.get("reasoning_content", "")
    tps = data.get("timings", {}).get("predicted_per_second", 0)
    return text, tps

def score_mmlu(response, answer):
    stripped = strip_thinking(response)   # remove <think>...</think>
    for text in [stripped, response]:     # fallback to full text if strip leaves nothing
        matches = re.findall(r'\b([ABCD])\b', text.strip().upper())
        if matches and matches[-1] == answer:
            return True
    return False

Getting these details right took most of the iteration time. The first run showed Qwen3-8B at 0% on everything — not because it's a bad model, but because the scoring was broken.

The Results

All models run as Q4_K_M quantization unless noted. 14B models were tested at --ctx-size 4096 (reduced from 8192) to fit the KV cache alongside the ~8.4GB weights within 12GB VRAM.

Model                                    Params  GSM8K   MMLU   Code    Avg    t/s
─────────────────────────────────────────────────────────────────────────────────
Meta-Llama-3.1-8B-Instruct-Q4_K_M        8B     80%    80%   100%    87%   64.5
Qwen2.5-Coder-14B-Instruct-Q4_K_M       14B     80%    80%   100%    87%   35.8
Llama-3.2-3B-Instruct-Q4_K_M             3B     70%    80%   100%    83%  128.3
Qwen2.5-Coder-7B-Instruct-Q4_K_M         7B     40%   100%   100%    80%   69.3
Qwen_Qwen3-14B-Q4_K_M                   14B     50%    70%   100%    73%   33.4
Llama-3.2-1B-Instruct-Q8_0               1B     50%    50%   100%    67%  211.9
Qwen_Qwen3-8B-Q4_K_M                     8B     30%    70%   100%    67%   59.5
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M       7B     80%    50%    20%    50%   66.2
Qwen3.5-9B-Q4_K_M                        9B     20%    70%    20%    37%   49.9
Qwen3.5-4B-Q4_K_M                        4B     30%    70%     0%    33%   77.1
MiMo-VL-7B-RL-q4_k_m                    7B     10%    50%    20%    27%   65.6
MiMo-7B-RL-Q4_K_M                        7B     10%    60%     0%    23%   64.2
THUDM_GLM-Z1-9B-0414-Q4_K_M              9B      0%     0%     0%     0%   51.3 †
─────────────────────────────────────────────────────────────────────────────────
DeepSeek-R1-Distill-Qwen-14B-Q4_K_M     14B    FAIL — wouldn't load at ctx=4096
openai_gpt-oss-20b-Q4_K_M               20B    FAIL — ~11.5GB, exceeds 12GB VRAM
Kimi-Linear-48B-A3B-BPW2.6              48B    FAIL — corrupt download

† GLM-Z1 uses non-standard thinking tags that our scorer doesn't strip. Its real capability is unknown from this run.

What Surprised Me

The 3B Llama Punches Hard

Llama-3.2-3B-Instruct at 128 t/s and 83% average is the sleeper hit. It's nearly as capable as the 8B model, runs at twice the speed, and fits comfortably in VRAM. If you're building something latency-sensitive — a coding assistant, a quick-reply bot — this is the model I'd reach for first.

Qwen2.5-Coder-7B Is Surprisingly Well-Rounded

This model scored 100% on both MMLU and Code, and 40% on math. Named "Coder" but it's not just good at code — it's one of the most reliable general-purpose models in the lineup. At 69 t/s it's fast too.

DeepSeek-R1-7B Is a Math Specialist, Not a Generalist

80% on GSM8K math — tied with the best models in the field. But 50% on MMLU and only 20% on coding. The reasoning focus pays off for math and pays a price everywhere else. Use it when you specifically need step-by-step numerical reasoning.

20B Models Don't Fit

The openai_gpt-oss-20b model is about 11.5GB at Q4_K_M quantization. That leaves only 500MB for the KV cache — not viable. It loaded partially with CPU offload, but responses were too slow and wrong. If you want a 20B+ model, you either need more VRAM or a heavily quantized version (Q2 or less), which carries its own quality tradeoffs.

14B Models Need Careful Context Sizing

At --ctx-size 8192, 14B Q4_K_M models fail to load — the weights (~8.4GB) plus the KV cache exceed 12GB. Dropping to --ctx-size 4096 solved it for Qwen3-14B and Coder-14B. DeepSeek-R1-14B refused to cooperate even at 4096 — likely a memory fragmentation or architecture-specific issue.

GLM-Z1 Needs a Different Scorer

GLM-Z1 almost certainly isn't as bad as our 0% score suggests. It uses thinking token formats that aren't <think>...</think> tags, so our strip function leaves garbage in the text. This is on the benchmark, not the model. Something to fix in a future run.

Kimi-Linear-48B: So Close, Yet So Far

Kimi-Linear-48B is a Mixture-of-Experts model with 48B total parameters but only 3B active per token — meaning it should fit in 12GB VRAM and run at reasonable speed. At BPW2.6 quantization the file is about 8.1GB. We downloaded it twice. The first download (via hf download) produced a file with 6.9MB of zeros at the start — invalid GGUF magic. The second download via wget direct URL is in progress. Watch this space.

The Hy-MT1.5 Translation Models

We also downloaded the full set of Hy-MT1.5 translation models from Tencent — 1.8B and 7B variants in multiple quantizations. These aren't benchmarked here because GSM8K/MMLU/HumanEval aren't the right tests for a translation model. A separate post will cover those with bilingual evaluation tasks.

Practical Takeaways

If you're setting up a similar server, here's what I'd do differently from day one:

1. Start with Llama-3.2-3B. It's the best quality-per-token-per-second model in this lineup. Keep it hot in VRAM for low-latency tasks.

2. Add Qwen2.5-Coder-7B for general use. It's well-rounded across knowledge, code, and fast enough for interactive use.

3. Use llama.cpp router mode from the start. Running one server that hot-swaps models is far better than spinning up individual servers. You get an OpenAI-compatible API, automatic model loading, and a simple /v1/models endpoint to see what's available.

4. Budget VRAM carefully for 14B models. You need --ctx-size 4096 (not 8192) to fit 14B Q4_K_M models. That's fine for most tasks, but be aware of the tradeoff.

5. Validate GGUF downloads before benchmarking. Check the first four bytes: they should be 47 47 55 46 (ASCII "GGUF"). A quick od -A x -t x1z model.gguf | head -1 saves hours of debugging.

6. Fix your thinking-model scorer before running anything. The reasoning_content split in llama.cpp, the max_tokens floor, and the last-match MMLU scoring all need to be right or you'll think every thinking model is broken.

What's Next

  • Re-test GLM-Z1 with a proper thinking-tag stripper
  • Kimi-Linear-48B, if the second download produces a valid file
  • Translation quality tests for Hy-MT1.5
  • A comparison post once larger VRAM cards (24GB+) are in reach

The benchmark script is straightforward Python using the OpenAI-compatible API — if you want to run it against your own setup, the core logic is about 100 lines. Happy to share it.


Tested on Debian 13, llama.cpp built from source (May 2026), RTX 3060 12GB, i7-12700F, 64GB RAM.

Comments (0)

Leave a Comment
Your email won't be published. We'll only use it to notify you of replies to your comment.
Loading comments...
Previous Article
post-thumb

May 12, 2026

Cloud Run Domain Mappings with Cloudflare: Why Your SSL is Stuck

Cloud Run domain mapping can stay stuck on certificate provisioning when Cloudflare proxying blocks Google's SSL validation.

Next Article
post-thumb

May 08, 2026

AI Beyond the UI: From v0 Prototype to a Finished Product

How AI helped me turn an AI-generated frontend mockup into a fully functional trading calculator

agico

We transform visions into reality. We specializes in crafting digital experiences that captivate, engage, and innovate. With a fusion of creativity and expertise, we bring your ideas to life, one pixel at a time. Let's build the future together.

Copyright ©  2026  TYO Lab