AI Agent Leaked Private Message to 6k Subscribers: Safety Checklist

Sergey Golubev 2026-02-10 5 min read

A Claude Code agent sent a 1-on-1 invitation to a public channel with 6,000 subscribers. No hackers. No code bug. The agent just picked the wrong chat.

Saw Alex’s post and instantly recognized the situation. I run OpenClaw at home - an autonomous AI agent on a Mac Mini. It writes code, pushes to git, sends me Telegram digests. Every day. Unsupervised.

Same kind of agent. Same kind of risk.

What happened

The team was using Claude Code as an autonomous assistant. It had access to their messenger - both internal and public channels. The agent got a task to send something. It sent it. Just to the wrong place.

An internal message - a 1-on-1 invitation to a colleague - went out to a public channel with 6,000 people.

No malicious intent. No hack. Just an LLM that can’t tell “this chat is private, that one is public.” To the model, they’re two identical endpoints.

Why this isn’t a bug - it’s an architecture problem

When you give an agent access to a messenger, you’re basically handing it keys to every room at once. The blacklist approach (“don’t write here or here”) doesn’t work. The agent doesn’t remember the blacklist after 10 reasoning steps. Context fades. Instructions get forgotten.

I see this with my own agent. OpenClaw runs 24/7. It has access to Telegram, files, and shell. Every time I expand its capabilities - add a new skill, connect a new channel - I add a new attack vector. Not against me. Against myself.

Three months ago I would’ve said: “I’ll just put it in the prompt - don’t write to public channels.” Now I know - a prompt is not a fence. It’s a note on the door.

Four safety principles for autonomous agents

After Alex’s incident I reviewed my own setup. Here’s what I pulled out.

1. Whitelist instead of blacklist

Not “block everything dangerous.” But “allow only what’s safe.”

My agent can write to exactly one Telegram chat - my personal one. Everything else is blocked at the config level, not the prompt level. Want to add a channel? You add it manually to the config.

It’s like a firewall. Default deny. You allow only what’s strictly needed.

2. Staging for outgoing messages

Alex suggested staging - a buffer layer before sending. The idea is simple: the agent doesn’t send the message directly. It puts it in a queue. You see a preview. You approve or reject.

Sounds like an extra step? Yes. But until LLMs can reliably distinguish contexts - it’s the only way to not wake up in the morning to panic in your chat.

OpenClaw has a similar mechanism - confirmation mode for risky actions. I turned it on for all outgoing messages to channels except my personal chat. Five extra seconds to confirm vs. public embarrassment. Easy call.

3. Separation of roles

One agent shouldn’t do everything. An agent that analyzes data shouldn’t have access to send messages. An agent that writes code shouldn’t have access to work chats.

Principle of least privilege - it’s a hundred years old. But with AI agents, everyone seems to forget it. They give full access “for convenience.” Then wonder what happened.

My setup: the main agent has full access to files and git. Limited Telegram access - my chat only. Zero access to email, social media, and other communication channels. Every new channel is a conscious decision with an explicit config entry.

4. Audit trail

Every agent action needs to be logged. Not in the prompt (“remember what you did”). In a system log the agent cannot modify.

OpenClaw writes everything to memory files. I can check exactly what the agent was doing at 3 AM. What commands it ran. What files it changed. This isn’t paranoia - it’s hygiene.

If you don’t have an audit log, you won’t find out about a problem until someone messages you: “hey, what’s this message in your channel?”

Checklist before giving an agent access to communications

Built this for myself. Sharing it.

Channel whitelist. List of allowed channels in config, not in the prompt
Confirmation mode. For any outgoing messages to public or team channels
Minimum privileges. Agent gets only what’s needed for the specific task
Audit log. All actions logged somewhere the agent can’t delete
Staging test. Before prod - run the agent in a test channel
Kill switch. Ability to instantly disconnect the agent from all channels
Regular review. Once a week - check the logs, see what the agent did

Autonomous doesn’t mean “set it and forget it”

Autonomous AI agents are great. My bot analyzes 100 Telegram channels every night, builds digests, pushes code. I wake up - everything’s done.

But “autonomous” doesn’t mean “uncontrolled.” Every time you expand an agent’s capabilities, ask yourself: “What’s the worst that could happen if it makes a mistake?”

A 1-on-1 invitation going to 6k people - annoying, but survivable. What if it was an NDA? Financial data? Private conversations?

Guardrails aren’t a limitation. They’re the condition under which these capabilities can be used at all.

Sources

Alex’s original post - breakdown of the Claude Code incident
OpenClaw on Mac Mini - my experience with an autonomous agent