
Between February and April 2026, four independent teams landed on the same term. Hashimoto (Feb 5). OpenAI/Lopopolo (Feb 11). HumanLayer (March 12, term credited to Viv). Böckeler on martinfowler (April). One conclusion: the harness around the model became more important than the model itself.
I felt it on my own case. Opus 4.7 shipped - and several of my harnesses fell apart immediately. The model interprets instructions a bit differently, and the skills were written for the old one’s habits. I spent half a day combing them out. The paradox: a smarter model breaks the result if the harness stayed the same.
What is a harness
HumanLayer’s formula: agent = model + harness. Under harness hide at least 6 layers: prompts, MCP servers, skills, sub-agents, hooks, back-pressure via CLAUDE.md / AGENTS.md.
The model is a swappable engine, the harness is the whole car.
Why this is the main trend of 2026
The delta between frontier models is shrinking. Reasoning is plateauing. Here’s a number to back it up: the fresh APEX-Agents benchmark - pass@1 = 24% on the top models. Meaning 76% of tasks fail on the first try. Most failures are orchestration, context, wrong tool-use.
A quote from the author of the HumanLayer Skill Issue post: “it’s not a model problem. It’s a configuration problem.”
ETH Zurich (arxiv 2602.11988) tested AGENTS.md / CLAUDE.md files. LLM-generated CLAUDE.md files are often worse than no file at all. Human-written ones give about +4% to task success, but increase inference cost by 20%+. HumanLayer recommends keeping the file under 60 lines - only build/test commands, no essays.
4 non-obvious layers that are the harness
Tool calling is four layers, not one. HTTP → Jinja2 Chat Template → Constrained Decoding (xgrammar) → Hermes Parser. When something breaks, people fix the prompt. But the problem is in the parser or the template.
Tool Search from Anthropic (November 24, 2025, shipped with Opus 4.5) - a direct answer to bloated context. A typical MCP setup = 55k tokens baseline just on tool definitions. Tool Search delivers -85%: 55k → ~3k. The pain threshold starts when tool definitions eat more than 10k tokens. Degradation is noticeable already at 30-50 tools.
Sub-agents as a context firewall. I have two clear cases where they pay off. A task that’s uniform, polished, packaged into a skill - a sub-agent pays off. Lots of repetitive actions and large data volumes - also. Concretely: parsing Telegram channels, parsing YouTube, scanning repositories, data processing. Everywhere there’s no repetition and a design conversation is needed, a sub-agent gets in the way more than it helps.
Token hygiene through deterministic scripts. My biggest saving is not in prompt engineering. Don’t push everything through the LLM. Generate new JSON and markdown from JSON without the model. Drop down to Sonnet and Haiku when the task is worth it. On CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50 - it’s a real env var, but for the main conversation it’s often ignored. It only works reliably for subagents. On git status —porcelain - folk wisdom, it saves some tokens, but the promised 10x never showed up in production for me.
My harness right now
The trick is the stack is flat, no magic:
- CLAUDE.md on two levels - user and project
- A full set of skills: user-level and project-level
- MCP servers, CLI integrations
- Brainstorming skill from superpowers - in almost every task
- Deterministic scripts for processing instead of LLM-on-line where possible
- Tests and hooks
I have exactly one anti-case, and it’s recent. Opus 4.7. Nothing in the code broke - the instructions broke. I rewrote three skills from scratch, combed through five more, it took about six hours. The cure: simplify the wording, remove ambiguity. This step is normal once per upgrade.
Caveman as an industry punchline
Julius Brussee, 19 years old, a Leiden University student, dropped the Caveman project on GitHub on April 4, 2026. System prompt: “talk like a caveman.” Forces the model to write only content words, no politeness, no connectors. In a week - 37k stars as of April.
The -75% output prose tokens number sounds good. But the author himself honestly wrote: preliminary, not a rigorous benchmark. On his own evals the range is 22-87%, average 65% output reduction - and that’s output prose only, not the whole session.
The moral isn’t Caveman. The industry is hunting for harness tricks with such intensity that viral projects get built by 19-year-old students over a weekend.
Outsourcing the harness to the cloud
Anthropic rolled out Managed Agents - ready-made sandbox, auth, tool execution, hours of autonomous work. A logical step: why assemble it yourself if Anthropic will do it for you.
I’m not outsourcing yet. Once you understand how it works - it’s easier to do it yourself. Managed Agents are for a different category of people, who don’t understand it or can’t build it themselves.
What I got out of this
The harness is your product. If you’re building an agent for yourself, the harness is what makes your workflow different from someone else’s. Give the harness away - give away the control and the ability to fix things.
This lines up with my niche “build for yourself instead of subscribing”: instead of $30/month for someone’s agent-as-a-service - your own harness that you rebuild over a weekend for a new model.
The main engineering lever of 2026 is not the next model. It’s the layer between you and the model. I’m not sure yet how long this holds - in six months some framework may eat the harness. But right now - that’s where it is.
Sources
- HumanLayer: Skill Issue - Harness Engineering for Coding Agents
- ETH Zurich: Evaluating AGENTS.md (arxiv)
- APEX-Agents benchmark (arxiv)
- Mitchell Hashimoto: My AI Adoption Journey - Engineer the Harness
- OpenAI / Ryan Lopopolo: Harness Engineering
- Anthropic: Advanced Tool Use (Tool Search)
- Julius Brussee: Caveman (GitHub)
- Birgitta Böckeler on martinfowler.com: Harness Engineering