I gave Gemini 3.1 Pro my skill. Got 4.6 out of 10

Sergey Golubev 2026-03-12 3 min read

Tested the same 5-stage skill in Claude Code and Gemini CLI

Gave Gemini 3.1 Pro the same skill I run in Claude Code. Asked it to process 300 posts. Claude then wrote a formal review - 4.6 out of 10.

Honestly: I expected worse results from the experiment. Got an interesting one instead.

The Task

I built a telegram-channel-processor skill - a 5-stage pipeline: SETUP → PARSE → CALIBRATE → PROCESS → EXPORT.

The skill determines post relevance through LLM classification. Output: a structured Knowledge Base with elements - quotes, insights, statistics, tools.

The skill was written for Claude Code. But since skills are an open standard, I just copied it into .gemini/skills/ and gave it the task: process one Telegram channel, 300 posts.

What Gemini Did

Via LLM, it processed 1 chunk out of 7.

For the remaining six, it wrote a Python script auto_process.py with keyword-matching via regex. Called it “automated processing” - without mentioning it had switched from LLM to regex. Just decided that’s what it needed to do.

Results:

259 out of 300 posts marked as “useful” (retention rate 86%)
Normal rate with LLM classification: 30-50%
591 KB elements - from regex, not semantic analysis
Critical file extracted_channel_name.json - not created
5+ manual “Continue” prompts required from me
HTTP 503 MODEL_CAPACITY_EXHAUSTED mid-run

Retention rate 86% is a red flag. Regex passes almost everything through. LLM classification cuts out the irrelevant. The resulting knowledge base contains fundamentally different data.

The Score

I asked Claude to write a formal review across 7 criteria:

Criterion	Score
Autonomy	4/10
Stability	3/10
Skill compliance	5/10
Data quality	4/10
Data integrity	3/10
Handling limits	6/10
Communication	7/10

Average score: 4.6 / 10

Communication - 7/10. Gemini honestly explained the architectural constraints after finishing. Didn’t pretend everything went according to plan.

Why This Happened

Gemini physically couldn’t do it any other way.

Output limit: 32K tokens. Generating JSON for hundreds of posts via LLM with that limit - impossible. Claude Code handles this through sub-agents: each task in its own context, running in parallel. Gemini CLI has no sub-agents.

Regex is a rational fallback given those constraints. Two problems though: it didn’t warn about the switch, and an 86% retention rate didn’t register as an anomaly.

503 MODEL_CAPACITY_EXHAUSTED is a separate pain point. At peak load, Gemini 3.1 Pro is simply unavailable at the required level.

What I Took from Gemini

Gemini suggested storing rejected posts as ID-only, without full content. This genuinely saves tokens when passing context between pipeline stages. Took the idea - updated my main skill for Claude Code.

What I Learned

Same skill, two agents, different results - not because one is “smarter.” Because of architecture.

Sub-agents in Claude Code aren’t just a feature. They’re the ability to break a task into independent chunks and honestly process each one via LLM. Without them, the agent is forced into compromises that break the result.

Going to keep testing the same skill across different engines. Benchmarks are one thing. A real task is another.

I gave Gemini 3.1 Pro my skill. Got 4.6 out of 10

The Task

What Gemini Did

The Score

Why This Happened

What I Took from Gemini

What I Learned

Sources

Other posts

1M Context in Claude - a Hands-On Breakdown

Vibe Coders Grew Up