5 Maturity Levels of AI Agents: Where Are You?

Sergey Golubev 2026-03-11 7 min read
🌐 Читать на русском

agents maturity levels improve claude code

Three independent discussions in the AI community collided in one week - and formed a single framework. One was about how a single line in a prompt improves output quality. Another about automating agent creation. A third about deterministic agent testing.

Turns out, they all describe the same ladder. 5 maturity levels of AI agents.

Level 1: Manual Control

You write a prompt, the agent returns a result, you review it. Bad output - tweak the prompt and try again.

Anyone who’s used ChatGPT or Claude knows this. Classic scenario: you ask the agent to write an email, get a draft, read it, tweak the prompt (“make it shorter”, “be more specific”), get a new version, read again.

What it looks like:

You: "Write an email to a client about a delay"
Agent: [generates text]
You: [reads] "Too formal, redo it"
Agent: [new version]
You: [reads] "This works"

You’re the sole quality inspector. The agent can’t ship without you.

The problem: it doesn’t scale. Every task needs your attention. 10 agents = 10 reviews.

Level 2: Reflection (Self-Review)

You add to the instructions: “Check your output before sending and fix any mistakes.”

A real example: a product engineer at a startup added one phrase to their presentation-generation prompt - “Generate previews of all slides, review them yourself, and fix any issues.” Quality improved noticeably. One phrase - and the agent starts catching its own errors.

Andrew Ng calls reflection one of the four core agentic patterns. In the Agentic AI course he demonstrates: an agent that checks itself performs better even without changing the underlying model.

What it looks like:

System prompt:
"You are an email-writing agent.
BEFORE sending:
1. Check that the tone matches the audience
2. Make sure there's a concrete CTA
3. If length > 200 words - trim it
Only then return the result."

Why it works:

There’s a pattern called Self-Critique Agent: draft → critique → revise. The agent writes a draft, then critiques itself against a checklist, then produces one revised version.

Interesting observation: even when the self-review catches nothing, overall error rate drops. As if the instruction “check your work” improves generation quality before the review even happens. A Hawthorne effect for AI - when they know they’ll be evaluated, they perform better.

Try it now: add “Before responding, review your answer for errors and fix them” to your next prompt. The result will surprise you.

Level 3: Agent Testing

Now the agent operates in a deterministic environment with a set of automated checks. You can reproduce behavior, understand where it broke, and fix it systematically.

This is the approach used in AI agent competitions:

  1. Set up a virtual environment where you control everything (deterministic simulation)
  2. Add entropy (randomness) so the agent can’t memorize answers. Save the seed.
  3. Describe the scenario - populate the environment with details
  4. Know the correct answers - because you built the environment
  5. Write checks - compare expected behavior against agent actions

Key insight: the agent can be non-deterministic, but the testing is deterministic. Fix the seed + environment = reproducible tests.

Real-world example:

Testing an order-processing agent. Create a virtual environment with 10 test orders. For each, we know: what status is correct, what email to send, what amount to charge. Run the agent 100 times with different seeds. Measure accuracy.

Microsoft Research is advancing this with Agent-Pex - a tool for automated evaluation of agentic traces. It parses prompts and traces, extracts checkable rules, and automatically determines whether they’ve been violated.

The engineering shift: not “let’s try a different prompt” but “let’s write a test case.” Like unit tests, but for AI agents.

Level 4: Meta-Agents

An agent creates or improves other agents.

A typical path to this level:

  1. Ask the agent to solve a task. Learn yourself how it should be solved.
  2. Show colleagues, iterate.
  3. Show the client, iterate.
  4. Repeat with a second and third client.
  5. Bring in Claude Code or another agent and have it run the workflow 10 times to extract quality criteria.
  6. Launch an agent to run agents and refine the instruction until it’s 10 out of 10.
  7. Ask the agent to create a workflow for adapting workflows.

Step 6 is where the meta-agent appears: one agent tests another and edits its instructions.

How it works:

Researchers describe this as a “Meta-Agent that designs other agents automatically from task description.” The meta-agent:

  • Analyzes the task
  • Selects tools
  • Configures memory
  • Sets up the planner
  • Creates the worker agent

AOrchestra (a framework from researchers in China) does this through an orchestrator: it receives a task, creates a sub-agent for it, passes context, selects tools and model, and runs it. Then creates the next sub-agent for the next step.

Level 5: Self-Adaptation

The agent autonomously adapts its architecture and context to new conditions. Not just creating agents - changing itself.

Microsoft Research introduced ACE (Agentic Context Engineering) - a framework where contexts are “evolving playbooks”: they accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.

Key difference from Level 4:

Level 4 (Meta-agents)Level 5 (Self-adaptation)
Creates other agentsModifies its own architecture
Fixed meta-agentEvolving agent
Works from a templateLearns from execution feedback

ACE delivers results: +10.6% on agentic benchmarks, +8.6% on finance tasks. Importantly, adaptation works without labelled supervision - the agent learns from natural execution feedback.

Amazon AWS in their agentic systems guide writes: self-reflection requires evaluation at every stage - reasoning, tool use, memory, action execution. At Level 5, this evaluation is built into the agent’s operational loop.

Diagnostic: Where Are You?

Check yourself:

  • Level 1: You tweak prompts after every result. Every agent run requires your attention.
  • Level 2: Your agents self-check before returning output. You’ve added “review your result” to the system prompt.
  • Level 3: You have tests for your agents. You can reproduce a bug with the same seed.
  • Level 4: Agents create agents. A meta-agent coordinates sub-agents.
  • Level 5: Agents adapt autonomously. The system evolves without your involvement.

What I Learned

The trend is clear: we’re moving from “prompt engineering” to “agent engineering.”

Prompt engineering is “how to ask to get a good answer.” Agent engineering is “how to build a system that improves quality on its own.”

The first skill is becoming table stakes - like knowing how to Google. The second is the competitive edge.

Good news: starting is easy. Level 2 is available right now, no tooling or infrastructure needed. Add one line - “check yourself” - to your next prompt. Minimum effort, measurable result.

Level 3 requires engineering thinking - thinking in tests, not prompts. Levels 4 and 5 are advanced territory where few people operate today. But that’s where the trend is heading.

Sources

  1. Andrew Ng - Agentic AI Course - 4 core patterns including Reflection
  2. Self-Critique Agent Pattern - draft → critique → revise architecture
  3. Simulation for Agentic Evaluation - deterministic agent testing
  4. Agent-Pex - Microsoft Research - automated evaluation of agentic traces
  5. ACE - Agentic Context Engineering - self-adapting agents from Microsoft
  6. AOrchestra - Agentic Orchestration - automatic sub-agent creation
  7. Amazon AWS - Evaluating AI Agents - self-reflection in production