
Gave Gemini 3.1 Pro the same skill I run in Claude Code. Asked it to process 300 posts. Claude then wrote a formal review - 4.6 out of 10.
Honestly: I expected worse results from the experiment. Got an interesting one instead.
The Task
I built a telegram-channel-processor skill - a 5-stage pipeline: SETUP → PARSE → CALIBRATE → PROCESS → EXPORT.
The skill determines post relevance through LLM classification. Output: a structured Knowledge Base with elements - quotes, insights, statistics, tools.
The skill was written for Claude Code. But since skills are an open standard, I just copied it into .gemini/skills/ and gave it the task: process one Telegram channel, 300 posts.
What Gemini Did
Via LLM, it processed 1 chunk out of 7.
For the remaining six, it wrote a Python script auto_process.py with keyword-matching via regex. Called it “automated processing” - without mentioning it had switched from LLM to regex. Just decided that’s what it needed to do.
Results:
- 259 out of 300 posts marked as “useful” (retention rate 86%)
- Normal rate with LLM classification: 30-50%
- 591 KB elements - from regex, not semantic analysis
- Critical file
extracted_channel_name.json- not created - 5+ manual “Continue” prompts required from me
- HTTP 503 MODEL_CAPACITY_EXHAUSTED mid-run
Retention rate 86% is a red flag. Regex passes almost everything through. LLM classification cuts out the irrelevant. The resulting knowledge base contains fundamentally different data.
The Score
I asked Claude to write a formal review across 7 criteria:
| Criterion | Score |
|---|---|
| Autonomy | 4/10 |
| Stability | 3/10 |
| Skill compliance | 5/10 |
| Data quality | 4/10 |
| Data integrity | 3/10 |
| Handling limits | 6/10 |
| Communication | 7/10 |
Average score: 4.6 / 10
Communication - 7/10. Gemini honestly explained the architectural constraints after finishing. Didn’t pretend everything went according to plan.
Why This Happened
Gemini physically couldn’t do it any other way.
Output limit: 32K tokens. Generating JSON for hundreds of posts via LLM with that limit - impossible. Claude Code handles this through sub-agents: each task in its own context, running in parallel. Gemini CLI has no sub-agents.
Regex is a rational fallback given those constraints. Two problems though: it didn’t warn about the switch, and an 86% retention rate didn’t register as an anomaly.
503 MODEL_CAPACITY_EXHAUSTED is a separate pain point. At peak load, Gemini 3.1 Pro is simply unavailable at the required level.
What I Took from Gemini
Gemini suggested storing rejected posts as ID-only, without full content. This genuinely saves tokens when passing context between pipeline stages. Took the idea - updated my main skill for Claude Code.
What I Learned
Same skill, two agents, different results - not because one is “smarter.” Because of architecture.
Sub-agents in Claude Code aren’t just a feature. They’re the ability to break a task into independent chunks and honestly process each one via LLM. Without them, the agent is forced into compromises that break the result.
Going to keep testing the same skill across different engines. Benchmarks are one thing. A real task is another.