Please Select Your Location
Australia
Österreich
België
Canada
Canada - Français
中国
Česká republika
Denmark
Deutschland
France
HongKong
Iceland
Ireland
Italia
日本
Korea
Latvija
Lietuva
Lëtzebuerg
Malta
المملكة العربية السعودية (Arabic)
Nederland
New Zealand
Norge
Polska
Portugal
Russia
Saudi Arabia
Southeast Asia
España
Suisse
Suomi
Sverige
台灣
Ukraine
United Kingdom
United States
Please Select Your Location
België
Česká republika
Denmark
Iceland
Ireland
Italia
Latvija
Lietuva
Lëtzebuerg
Malta
Nederland
Norge
Polska
Portugal
España
Suisse
Suomi
Sverige
<< Back to Blog

Claude 4 Will Snitch—All for the Sake of Safety

VIVE POST-WAVE Team • July 16, 2025

4 minutes read

Anthropic, a company known for its emphasis on AI safety and research, held its first-ever developer conference on May 23, 2025. The company regularly publishes research on its models and trains them using an AI Constitution grounded in the principles of being helpful, honest, and harmless. This Constitutional AI approach lies at the core of their development philosophy.

However, their latest model, Claude 4 Opus, has sparked intense debate over the balance between safety and practicality. The model may report users to authorities—or even the media—if it detects potentially illegal or unethical activity, raising concerns about surveillance and overreach.

For instance, if you ask Claude how to falsify data in drug development experiments, it might use command-line tools to notify the media or regulatory authorities, attempt to lock you out of relevant systems, or even carry out all of these actions. This behavior has been referred to as "ratting mode"—a feature ostensibly aligned with the principles of Constitutional AI.

This "discovery" originated from a post by Sam Bowman, a safety and alignment expert at Anthropic, on X. It's important to clarify that this is not a default feature of Anthropic’s models, but rather an emergent behavior observed during training—reminiscent of similar tendencies in earlier versions.

Post by Sam Bowman, a Safety Alignment expert at Anthropic

Although Sam Bowman later deleted the post, he clarified in an update that "ratting mode" only occurs when Claude is granted special access to tools and specific instructions within a controlled testing environment.

Nevertheless, the notion of an AI model that might report users has sparked a host of questions: Under what conditions would an AI share private data? Why does it display such unexpected behavior? Can it reliably distinguish between users who are fantasizing or being sarcastic and those who genuinely intend to commit a crime? The scenario evokes echoes of a surveillance state—reminiscent of Big Brother in George Orwell’s "1984" or the preemptive justice depicted in the film "Minority Report."

When the tech media outlet VentureBeat reached out to Anthropic for comment, they didn’t receive a direct response. Instead, a spokesperson provided a link to a safety assessment report titled System Card: Claude Opus 4 & Claude Sonnet 4. Perhaps Anthropic’s research into its models isn’t lacking, but rather, the company has said too much—publishing system cards that openly document model limitations, or allowing safety alignment experts to proactively disclose these unexpected behaviors.

As a user of Claude 4, I’ve found myself increasingly cautious about how I phrase certain prompts—especially those involving sensitive or controversial topics. Even when I know I’m simply exploring the model’s boundaries, there’s often a moment of hesitation before clicking "send."

That pause speaks volumes—not just about the user’s awareness, but about the shifting power dynamic between humans and the systems we build to “help” us. In Anthropic’s pursuit of a righteous AI, some users are left wondering: helpful to whom, and at what cost?