
Three independent discussions in the AI community collided in one week - and formed a single framework. One was about how a single line in a prompt improves output quality. Another about automating agent creation. A third about deterministic agent testing.
Turns out, they all describe the same ladder. 5 maturity levels of AI agents.
Level 1: Manual Control
You write a prompt, the agent returns a result, you review it. Bad output - tweak the prompt and try again.
Anyone who’s used ChatGPT or Claude knows this. Classic scenario: you ask the agent to write an email, get a draft, read it, tweak the prompt (“make it shorter”, “be more specific”), get a new version, read again.
What it looks like:
You: "Write an email to a client about a delay"
Agent: [generates text]
You: [reads] "Too formal, redo it"
Agent: [new version]
You: [reads] "This works"
You’re the sole quality inspector. The agent can’t ship without you.
The problem: it doesn’t scale. Every task needs your attention. 10 agents = 10 reviews.
Level 2: Reflection (Self-Review)
You add to the instructions: “Check your output before sending and fix any mistakes.”
A real example: a product engineer at a startup added one phrase to their presentation-generation prompt - “Generate previews of all slides, review them yourself, and fix any issues.” Quality improved noticeably. One phrase - and the agent starts catching its own errors.
Andrew Ng calls reflection one of the four core agentic patterns. In the Agentic AI course he demonstrates: an agent that checks itself performs better even without changing the underlying model.
What it looks like:
System prompt:
"You are an email-writing agent.
BEFORE sending:
1. Check that the tone matches the audience
2. Make sure there's a concrete CTA
3. If length > 200 words - trim it
Only then return the result."
Why it works:
There’s a pattern called Self-Critique Agent: draft → critique → revise. The agent writes a draft, then critiques itself against a checklist, then produces one revised version.
Interesting observation: even when the self-review catches nothing, overall error rate drops. As if the instruction “check your work” improves generation quality before the review even happens. A Hawthorne effect for AI - when they know they’ll be evaluated, they perform better.
Try it now: add “Before responding, review your answer for errors and fix them” to your next prompt. The result will surprise you.
Level 3: Agent Testing
Now the agent operates in a deterministic environment with a set of automated checks. You can reproduce behavior, understand where it broke, and fix it systematically.
This is the approach used in AI agent competitions:
- Set up a virtual environment where you control everything (deterministic simulation)
- Add entropy (randomness) so the agent can’t memorize answers. Save the seed.
- Describe the scenario - populate the environment with details
- Know the correct answers - because you built the environment
- Write checks - compare expected behavior against agent actions
Key insight: the agent can be non-deterministic, but the testing is deterministic. Fix the seed + environment = reproducible tests.
Real-world example:
Testing an order-processing agent. Create a virtual environment with 10 test orders. For each, we know: what status is correct, what email to send, what amount to charge. Run the agent 100 times with different seeds. Measure accuracy.
Microsoft Research is advancing this with Agent-Pex - a tool for automated evaluation of agentic traces. It parses prompts and traces, extracts checkable rules, and automatically determines whether they’ve been violated.
The engineering shift: not “let’s try a different prompt” but “let’s write a test case.” Like unit tests, but for AI agents.
Level 4: Meta-Agents
An agent creates or improves other agents.
A typical path to this level:
- Ask the agent to solve a task. Learn yourself how it should be solved.
- Show colleagues, iterate.
- Show the client, iterate.
- Repeat with a second and third client.
- Bring in Claude Code or another agent and have it run the workflow 10 times to extract quality criteria.
- Launch an agent to run agents and refine the instruction until it’s 10 out of 10.
- Ask the agent to create a workflow for adapting workflows.
Step 6 is where the meta-agent appears: one agent tests another and edits its instructions.
How it works:
Researchers describe this as a “Meta-Agent that designs other agents automatically from task description.” The meta-agent:
- Analyzes the task
- Selects tools
- Configures memory
- Sets up the planner
- Creates the worker agent
AOrchestra (a framework from researchers in China) does this through an orchestrator: it receives a task, creates a sub-agent for it, passes context, selects tools and model, and runs it. Then creates the next sub-agent for the next step.
Level 5: Self-Adaptation
The agent autonomously adapts its architecture and context to new conditions. Not just creating agents - changing itself.
Microsoft Research introduced ACE (Agentic Context Engineering) - a framework where contexts are “evolving playbooks”: they accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.
Key difference from Level 4:
| Level 4 (Meta-agents) | Level 5 (Self-adaptation) |
|---|---|
| Creates other agents | Modifies its own architecture |
| Fixed meta-agent | Evolving agent |
| Works from a template | Learns from execution feedback |
ACE delivers results: +10.6% on agentic benchmarks, +8.6% on finance tasks. Importantly, adaptation works without labelled supervision - the agent learns from natural execution feedback.
Amazon AWS in their agentic systems guide writes: self-reflection requires evaluation at every stage - reasoning, tool use, memory, action execution. At Level 5, this evaluation is built into the agent’s operational loop.
Diagnostic: Where Are You?
Check yourself:
- Level 1: You tweak prompts after every result. Every agent run requires your attention.
- Level 2: Your agents self-check before returning output. You’ve added “review your result” to the system prompt.
- Level 3: You have tests for your agents. You can reproduce a bug with the same seed.
- Level 4: Agents create agents. A meta-agent coordinates sub-agents.
- Level 5: Agents adapt autonomously. The system evolves without your involvement.
What I Learned
The trend is clear: we’re moving from “prompt engineering” to “agent engineering.”
Prompt engineering is “how to ask to get a good answer.” Agent engineering is “how to build a system that improves quality on its own.”
The first skill is becoming table stakes - like knowing how to Google. The second is the competitive edge.
Good news: starting is easy. Level 2 is available right now, no tooling or infrastructure needed. Add one line - “check yourself” - to your next prompt. Minimum effort, measurable result.
Level 3 requires engineering thinking - thinking in tests, not prompts. Levels 4 and 5 are advanced territory where few people operate today. But that’s where the trend is heading.
Sources
- Andrew Ng - Agentic AI Course - 4 core patterns including Reflection
- Self-Critique Agent Pattern - draft → critique → revise architecture
- Simulation for Agentic Evaluation - deterministic agent testing
- Agent-Pex - Microsoft Research - automated evaluation of agentic traces
- ACE - Agentic Context Engineering - self-adapting agents from Microsoft
- AOrchestra - Agentic Orchestration - automatic sub-agent creation
- Amazon AWS - Evaluating AI Agents - self-reflection in production