preloader
post-thumb

Last Update: June 15, 2026


BYauthor-thumberic

|Loading...

Keywords

In our 4 GB experiment the most jarring result was a flat 0%: a small coder model, given proper tools and asked to use them, never once produced a usable native tool call — across 84 attempts. It clearly knew what to do; it just couldn't do the one specific thing the protocol required.

That result only makes sense if you know exactly what a "native tool call" is. Most explanations wave at it — "the model calls a function" — which hides the part that actually matters. So let's be precise: what is a native tool call, what happens on the wire, and why is it a thing a model can simply fail to produce?

Start from the uncomfortable fact: an LLM only outputs text

A language model does exactly one thing: given some text, it predicts more text, one token at a time. It cannot open a file, run a command, or call an API. It has no hands. So when you hear "the AI used a tool," something had to bridge the gap between text the model wrote and an action your computer took.

That something is the harness — the program you write around the model. One note on pronouns, because the split is the whole point of this post: throughout, "you" and "your code" mean that harness, never the model itself. The model only ever produces text; "you" are the code that reads it and acts. The model asks; you do. (A "coding agent" as a product — Claude Code, Cursor, Codex — is the two halves together; here, "you" is the harness half.)

With that straight, the question becomes a practical one: how does the harness turn the model's text into a real action? It turns out there are only two ways to build that bridge.

Do-it-yourself (prompt-based). You write in the system prompt: "When you want to read a file, reply with READ: <path> and nothing else." The model writes READ: config.py; your code watches the output, recognises the pattern, runs the read, and pastes the result back into the conversation. This works, but you invented the format, you parse it, and the model is just following instructions in prose. The model has no special knowledge that READ: means anything.

Native tool calling. Instead of inventing a format, you use a standardised protocol that the API provider defines and the model was specifically trained to speak. This is the thing people mean by "native tool calls" (a.k.a. "function calling"). It's worth seeing the whole exchange, because the magic is in the structure.

What actually happens on the wire

Step 1 — you declare the tools in the request. Alongside the messages, you send a list of tool definitions: a name, a description, and a JSON Schema for the arguments.

json
{
  "model": "...",
  "messages": [{"role": "user", "content": "What's in config.py?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "read_file",
      "description": "Read a file from disk.",
      "parameters": {
        "type": "object",
        "properties": {"path": {"type": "string"}},
        "required": ["path"]
      }
    }
  }]
}

Step 2 — the model replies with a structured tool call, not prose. This is the crux. The response does not put the request in the text content field. It puts it in a separate, dedicated tool_calls field:

json
{
  "role": "assistant",
  "content": null,
  "tool_calls": [{
    "id": "call_abc123",
    "type": "function",
    "function": {
      "name": "read_file",
      "arguments": "{\"path\": \"config.py\"}"
    }
  }]
}

Notice: content is null. The model isn't talking; it's requesting an action, in a machine-readable slot with a name, JSON arguments, and a unique id.

Step 3 — your code executes it. Your harness reads tool_calls, sees read_file, parses the arguments, actually reads the file, and captures the contents. The model did not run anything — it asked, and your code obeyed.

Step 4 — you hand the result back, tied to the call's id.

json
{"role": "tool", "tool_call_id": "call_abc123", "content": "DEBUG = True\nPORT = 8080"}

Step 5 — the model continues, now with the file contents in context — either answering the user, or emitting another tool call. Loop until it stops calling tools. That loop is the entire mechanism behind every coding agent you've used.

(Anthropic's API uses a slightly different shape — tool_use content blocks and a tool_result block instead of a tool_calls array and a tool role — but it is the same protocol: structured request, you execute, structured result back.)

So what does "native" actually mean?

Three things, and all three matter:

  1. A standardised wire format. The tool_calls field is part of the API contract. Every agent framework — Codex, Pi, Cursor, LangChain — reads that field. You don't invent or parse anything; the structure is guaranteed.
  2. The model was trained to emit it. During fine-tuning, the model was taught a special internal format for tool calls — usually delimited by control tokens (Qwen and many others wrap calls in <tool_call> … </tool_call> tags; other model families use other schemes). It learned when to call a tool versus when to talk, and to put valid JSON in the arguments. This is a learned skill, not a parsing trick.
  3. The serving layer parses (and often constrains) it. The inference server applies the model's chat template, watches for those control tokens in the stream, and converts them into the clean tool_calls field you receive. Some servers go further and use constrained decoding — a grammar that forces the model to emit only tokens that keep the JSON valid.

Put together: the model speaks a format it was trained on, the server translates it into a standard field, and your code reads that field. "Native" means the whole stack agrees on the protocol, end to end.

Prompt-based (DIY)
Native tool call
Format
you invent it
standard, defined by the API
Where the call appears
in the text content
in the structured tool_calls field
Who parses it
your code, with regex/string matching
the model server, reliably
Model awareness
just following prose instructions
trained on the format; knows when to call
Arguments
whatever the model writes
JSON validated against a schema
Multiple/parallel calls
you design it yourself
built in

Why a model can fail to make one — the 0%

Here's the punchline that the experiment delivered. Native tool calling is a learned behaviour. A model that wasn't trained heavily on the format will, when handed tools, do the intuitively-reasonable thing: it writes the tool call as text, in the content field, often as a tidy JSON code block:

text
I'll read the file.
```json
{"name": "read_file", "arguments": {"path": "config.py"}}
```

That is correct intent in the wrong place. The content is perfect — but it's prose, not a tool_calls entry. The harness looks at the tool_calls field, finds it empty, and does nothing. The model "called the tool" the way a person points at a door instead of opening it. That is precisely why qwen2.5-coder:3b scored 0/84 on native tools while sailing through whole-file editing: it can write the JSON, it just hasn't been trained to put it in the structured channel. Tool-calling is a separate capability from coding — a model can be excellent at one and incapable of the other.

The fix: if the call is in the text, read it from the text

The natural response to that 0% is: the JSON is sitting right there — why not just grab it and run it? You can. And doing exactly that is prompt-based tool calling — the do-it-yourself bridge from the top of this post. You stop trusting the tool_calls field, scrape the {"name": ..., "arguments": ...} object out of the message content, and execute it yourself. In our experiment that one change took the small model from 0% to ~56%: it could suddenly drive a real read → edit → run loop.

So native and DIY aren't rival technologies — they're two implementations of one idea: turn the model's output into an executed function. Native puts the call in a guaranteed field the model was trained to fill; DIY scrapes it from wherever the model happened to put it. The trade-off is clean:

  • Native — rock-solid (standard field, schema-checked arguments, call IDs, the model's trained sense of when to call) — but only works if the model was trained for the format.
  • DIY scraping — works with any model that can write JSON — but you own all the fragility. "Just grab the JSON" stays trivial right up until the model wraps it in a code fence one turn, in <tool_call> tags the next, bare in prose the next, emits two of them, or writes malformed JSON. A real scraper is a careful little parser, not a one-line regex.

And the honest punchline from the experiment: scraping rescued the capability (0% → 56%), but the small model still scored higher — 71% — by skipping tools entirely and just writing the whole file. For a single edit, the deepest fix isn't a smarter parser; it's not asking for a tool call at all.

The one-paragraph version

A native tool call is not the AI running sed, and it's not the model "deciding" to do something magical. It is a structured, schema-validated request that the model emits in a dedicated field — a field it was specifically trained to produce and that the inference server reliably parses — saying "please run function X with arguments Y." Your code runs it and feeds the result back. The whole loop of read → edit → run → fix that powers every coding agent is just this protocol, repeated. And because it's a trained behaviour living in a specific channel, it's something a model can know the answer for and still completely fail to deliver — which is exactly why, for small local models, you're often better off skipping it and letting the model just write the file.

Comments (0)

Leave a Comment
Your email won't be published. We'll only use it to notify you of replies to your comment.
Loading comments...
Previous Article
post-thumb

Oct 03, 2021

Setting up Ingress for a Web Service in a Kubernetes Cluster with NGINX Ingress Controller

A simple tutorial that helps configure ingress for a web service inside a kubernetes cluster using NGINX Ingress Controller

Next Article
post-thumb

Jun 14, 2026

The 4 GB Experiment: What Actually Makes a Small Model a Good Coding Agent

We stopped theorizing and ran ~900 trials on a 4 GB laptop GPU to measure what makes a small local model a good coding agent. The result overturns intuition: the edit format and the editor model decide almost everything, the clever agentic scaffolding mostly hurts, and the simplest setup won.

agico

We transform visions into reality. We specializes in crafting digital experiences that captivate, engage, and innovate. With a fusion of creativity and expertise, we bring your ideas to life, one pixel at a time. Let's build the future together.

Copyright ©  2026  TYO Lab