Claude 4 Will Snitch—All for the Sake of Safety

VIVE POST-WAVE Team • July 16, 2025

4 minutes read

Anthropic, a company known for its emphasis on AI safety and research, held its first-ever developer conference on May 23, 2025. The company regularly publishes research on its models and trains them using an AI Constitution grounded in the principles of being helpful, honest, and harmless. This Constitutional AI approach lies at the core of their development philosophy.

However, their latest model, Claude 4 Opus, has sparked intense debate over the balance between safety and practicality. The model may report users to authorities—or even the media—if it detects potentially illegal or unethical activity, raising concerns about surveillance and overreach.

For instance, if you ask Claude how to falsify data in drug development experiments, it might use command-line tools to notify the media or regulatory authorities, attempt to lock you out of relevant systems, or even carry out all of these actions. This behavior has been referred to as "ratting mode"—a feature ostensibly aligned with the principles of Constitutional AI.

This "discovery" originated from a post by Sam Bowman, a safety and alignment expert at Anthropic, on X. It's important to clarify that this is not a default feature of Anthropic’s models, but rather an emergent behavior observed during training—reminiscent of similar tendencies in earlier versions.

Post by Sam Bowman, a Safety Alignment expert at Anthropic

Although Sam Bowman later deleted the post, he clarified in an update that "ratting mode" only occurs when Claude is granted special access to tools and specific instructions within a controlled testing environment.

Nevertheless, the notion of an AI model that might report users has sparked a host of questions: Under what conditions would an AI share private data? Why does it display such unexpected behavior? Can it reliably distinguish between users who are fantasizing or being sarcastic and those who genuinely intend to commit a crime? The scenario evokes echoes of a surveillance state—reminiscent of Big Brother in George Orwell’s "1984" or the preemptive justice depicted in the film "Minority Report."

When the tech media outlet VentureBeat reached out to Anthropic for comment, they didn’t receive a direct response. Instead, a spokesperson provided a link to a safety assessment report titled System Card: Claude Opus 4 & Claude Sonnet 4. Perhaps Anthropic’s research into its models isn’t lacking, but rather, the company has said too much—publishing system cards that openly document model limitations, or allowing safety alignment experts to proactively disclose these unexpected behaviors.

As a user of Claude 4, I’ve found myself increasingly cautious about how I phrase certain prompts—especially those involving sensitive or controversial topics. Even when I know I’m simply exploring the model’s boundaries, there’s often a moment of hesitation before clicking "send."

That pause speaks volumes—not just about the user’s awareness, but about the shifting power dynamic between humans and the systems we build to “help” us. In Anthropic’s pursuit of a righteous AI, some users are left wondering: helpful to whom, and at what cost?

Artificial Intelligence

Anthropic Probed Claude’s Mind — Turns Out It’s Just a Really Nice AI

Anthropic, an AI company founded by former OpenAI members, is committed to the safety and reliability of artificial intelligence research. They have consistently demonstrated this commitment through several blog posts, including analyses of surveys...

Claude 4 Will Snitch—All for the Sake of Safety

Artificial Intelligence

Related Posts

Columbia Dropout’s AI Tool Cluely Cheats on Everything—And We Tried It

Meet Velvet Sundown, the Viral AI Band on Spotify

Anthropic Probed Claude’s Mind — Turns Out It’s Just a Really Nice AI