preloader
post-thumb

Last Update: June 13, 2026


BYauthor-thumberic

|Loading...

Keywords

In the first post we compared three agents; in the second we read six harnesses and found they had all evolved the same five organs, differing in only two places. One was how much they sandbox. The other was what survives between sessions — and we bet that's where the next round of competition happens, because raw per-session capability is converging fast.

This post is about that second gap. It has two halves that get conflated constantly: skills (knowledge the agent can apply) and memory (facts that persist). We'll look at what each actually is in the code, why a capable model still needs both, why a human still has to write some of it by hand, and what "self-improving" means once you delete the marketing copy. Same six codebases as before, plus a news story you've probably seen.

A skill is just a file the agent reads when it needs to

Start with the mechanism, because it's simpler than the word suggests. Across Pi, Hermes, OpenCode, Kimi Code, Gemini CLI and Codex, a "skill" is the same thing: a directory containing a SKILL.md file — a bit of YAML frontmatter (a name, a one-line description) followed by markdown instructions, optionally bundled with scripts/, templates/ and references/. All six landed on a shared open format, agentskills.io, independently enough that the convergence is the story.

The clever part is how a skill enters the model's attention. You cannot paste a hundred procedures into every prompt — it would blow the context budget and dilute the model's focus (the context-engineering problem from the last post). So skills use progressive disclosure: only the skill's name and one-line description sit in the system prompt — a few dozen tokens each — and the agent loads the full file on demand when a task matches. Pi calls it with /skill:name; Hermes ships 73 skills in its prompt index and reads the body via skill_view; Gemini gates the load behind a consent prompt. Either way, an agent can carry a library of hundreds of procedures at a standing cost of almost nothing, paying the full token price only for the one it actually opens.

Why a capable model still needs them

The obvious objection: today's models can already write a deploy script or summarise a PDF. Why bother? Because capability and reliability are different things, and skills buy four things the raw model doesn't have:

Knowledge it cannot possibly know. The model knows how to deploy a web app. It does not know that yours is make release ENV=prod, that you bump the changelog first, and that staging DNS takes five minutes so don't panic. None of that is in any training set. A skill is where private, environment-specific knowledge lives — and in practice that's most of what real skills contain.

Variance reduction. Ask a model to do a ten-step task cold and it succeeds most of the time, a different way each run. A skill pins the procedure — the exact commands, the order, the known failure modes — turning a fuzzy capability into a repeatable one. It's the difference between a brilliant temp and an employee with the operations manual.

Bundled executables, not regenerated code. A skill ships its scripts/. Instead of the model re-deriving a 200-line conversion script every time — slow, token-hungry, occasionally subtly wrong — it runs the vetted one that worked last month. Its job shrinks from author the solution to invoke the solution.

Punching above the model's weight. A mid-tier model with a precise skill routinely beats a frontier model improvising, because the hard part — knowing the right procedure — has been moved out of the weights and into a file. That's a direct cost lever: routine work can run on cheaper models if the skill carries the expertise.

The honest caveat: for a one-off task well within the model's competence, a skill is pure overhead. Skills pay off on the recurring, local, and high-variance work — which is exactly why every harness puts "add a skill" as the cheapest rung on its capability ladder, below tools, plugins and core changes.

Memory, surprisingly, is also just files

If skills are how to do things, memory is what's true — your name, your stack, the decision you made last Tuesday. And here the codebases delivered the most counterintuitive finding of the whole study: the industry is converging on memory being plain markdown files the agent edits.

The sharpest example is Gemini CLI. Earlier versions had a save_memory tool. It's gone — the current system prompt literally states "There is no save_memory tool" and instructs the model to edit GEMINI.md / MEMORY.md directly, like a person updating a notes file. Hermes does the same with a thin memory tool writing to a hard-capped MEMORY.md (2,200 characters) and a separate USER.md profile (1,375 characters) — the caps are deliberate, forcing the agent to curate rather than hoard. Context files like AGENTS.md/GEMINI.md are the read side of the same idea: human- or agent-authored facts, loaded into every relevant prompt (Gemini even does it just-in-time, injecting a directory's GEMINI.md only when a tool touches that directory).

Why did files win over databases and vector stores? Because they're inspectable, editable, diffable, version-controllable, and they fail in obvious ways. You can open your agent's memory in a text editor and see exactly what it thinks it knows — and fix it. The one prominent exception proves the rule: Hermes can plug in Honcho, a dialectic user-modelling service that builds a theory-of-mind of the user behind the scenes. It's more powerful and far less legible — and notably it's an optional provider, not the default. For most purposes, a capped markdown file the agent rewrites is the whole feature.

Why a human still has to write some of it

Now the news story. Through 2026 there were reports of laid-off engineers being asked, as a condition of leaving, to write up their skills — to serialise what they knew into the kind of files we've just described. It sounds dystopian, and partly it is, but it also points at something real about the limits of self-extraction.

An agent can only learn from what flowed through its own context — its transcripts, its tool outputs, its own successes and failures. That's a tiny, biased slice of what's actually known. The departing engineer holds the rest, and four kinds of it can never be mined from logs:

  • Tacit knowledge with no digital trace. Why service X must restart before service Y, learned during a 2 a.m. outage. Which "refactorable-looking" module is secretly load-bearing. It lives in a head, not a file.
  • Negative knowledge. The migration that was tried and rolled back; the flag that caused a near-miss. Failures get cleaned up, not documented, so they leave nothing for an agent to learn from. The most valuable expertise is often what not to do — and it's invisible.
  • Distillation versus induction. When an agent writes a skill from experience, it's inducting a rule from the handful of episodes it happened to live through, unable to tell essential from incidental. A veteran writes the rule from a thousand episodes, exception clauses included.
  • Cost of learning the hard way. An agent could rediscover much of this by trial and error — but in production those trials are outages and corrupted data. Human-authored skills are one-shot transfer at zero production risk.

So the realistic division of labour — and, again, the one the codebases actually implement — is three-part: humans seed, agents accumulate, something curates. Humans write the foundational skills carrying tacit and hard-won knowledge; the agent appends what it learns within its own experience; and a review step stops the library filling with confident-but-wrong inductions. The grim irony of the layoff story is that the engineers are being asked to do the seeding step for their own replacement — and the quality of that handoff depends entirely on knowledge that was never extractable by force in the first place.

What "self-improving" actually means in code

This is the phrase that sells agents and means the least — so here is what it reduces to once you read the implementation. Only two of the six harnesses ship a genuine learning loop, and neither is magic.

Hermes is the fullest version. Its loop has three moving parts, all visible in agent/curator.py and tools/skill_usage.py:

  1. Write. After a complex task, a post-task nudge encourages the agent to crystallise the workflow into a new skill via a skill_manage tool. Skills it later reuses get patched in place — "skills self-improve during use." Every load, view and patch bumps a counter in a .usage.json file. That usage signal is the raw material for everything downstream.
  2. Age. Pure bookkeeping, no AI: a skill goes activestale after 30 days unused → archived after 90, and any use resurrects it. Nothing is ever deleted, only moved aside.
  3. Curate. A background "librarian" wakes roughly weekly, but only when the agent has been idle for hours, and forks a headless review agent over the agent-created skills. Its prompt mandates consolidation over pruning — merge five narrow skills into one umbrella skill, demote details into references/ — and it must emit a structured plan that's reconciled against the actual audit trail. Before it touches anything it snapshots every skill and cron job for rollback.

Read that list again and notice what "self-improving" turned out to be: a usage counter, a state machine on timestamps, and a periodic LLM review pass with a backup-and-rollback safety net. It's genuinely valuable — the agent's knowledge compounds across months instead of resetting every session — but it's engineering, not emergence.

Gemini CLI shows the safer, human-gated variant. Its "Auto Memory" runs a background miner over the conversation transcript, extracts reusable procedures, and writes them out as draft SKILL.md files into a project-local inbox — where a human reviews and approves them before they ever become active. That's the three-part division of labour made literal: the agent proposes, the human disposes.

Capability
Where it lives
Who writes it
Self-improves?
Skills (procedures)
SKILL.md files (agentskills.io)
Human-seeded, agent-extended
Hermes: yes, via curator. Others: manual
Memory (facts)
Capped markdown (MEMORY.md / USER.md)
Mostly the agent, editable by hand
Re-curated on a cap, not grown
User model
Honcho service (optional)
Inferred from interaction
Yes, but opaque
Context files
AGENTS.md / GEMINI.md
Human, loaded per-project/dir
No — static input

The part nobody markets: curation and trust

Two unglamorous truths fall out of all this. First, a knowledge store that only grows, rots. Without the aging-and-consolidation machinery, a self-writing agent fills its own library with near-duplicates and stale procedures until the index is noise. Half of Hermes's "learning loop" is really a forgetting loop — the state machine that ages skills out and the curator that merges them. Memory's character caps are the same instinct: a small, curated store beats a large, sprawling one.

Second, a skill is executable knowledge, which makes it an attack surface. A SKILL.md can carry scripts/ and shell snippets; an agent that installs skills from the internet is running other people's instructions with your permissions. Hermes takes this seriously enough to ship a dedicated scanner (tools/skills_guard.py) that runs 90-plus regex checks over any skill before install — for exfiltration, prompt injection ("ignore previous instructions"), reverse shells, credential reads, even invisible Unicode — and refuses anything that scores "dangerous." The moment knowledge becomes self-writing and shareable, provenance and review stop being optional.

What this means

For agent users: the question that predicts whether an agent gets better over time isn't "how smart is the model" — it's "what survives between sessions, and who curates it." An agent with great memory hygiene and a few well-written skills will, within a domain, quietly outperform a smarter agent with amnesia. And you should write some skills yourself; the highest-value ones encode exactly the local, tacit knowledge the model can never infer.

For agent builders — us — the takeaways are concrete and cheap to implement: store skills and memory as plain files, not a database; use progressive disclosure so the library costs nothing until used; cap and curate, because a forgetting loop matters as much as a learning one; gate self-written and imported skills behind review; and design for the three-part division of labour from day one — humans seed, the agent accumulates, a curator keeps it honest. None of it requires a better model. All of it is the difference between an agent that resets every morning and one that's been working for you for a year.

This is the third post in our series on AI agent architecture, drawn from reading six open-source agent harnesses. If you build agents, the source is the best teacher we found — start with Pi for the minimal core and Hermes for the learning loop.

Comments (0)

Leave a Comment
Your email won't be published. We'll only use it to notify you of replies to your comment.
Loading comments...
Previous Article
post-thumb

Jun 13, 2026

The Coding Agent Shootout and the Best Harness We Could Not Read

After reading the source of five open coding agents — OpenAI's Codex, Pi, OpenCode, Kimi Code and Gemini CLI — here is how they actually stack up, strengths and weaknesses, which one wins, and why the best harness of all belongs to a tool we deliberately left out of the code study: Claude Code.

Next Article
post-thumb

Jun 13, 2026

What an AI Agent Harness Actually Does

We read the source code of six AI agent harnesses — Pi, Hermes, OpenCode, Kimi Code, OpenAI's Codex CLI and Google's Gemini CLI. This is what the harness really does for the model, why almost every agent failure is a harness failure, and why the engineering lives there, not in the model.

agico

We transform visions into reality. We specializes in crafting digital experiences that captivate, engage, and innovate. With a fusion of creativity and expertise, we bring your ideas to life, one pixel at a time. Let's build the future together.

Copyright ©  2026  TYO Lab