AI Safety Gets a Stress Test With $25K Bug Bounty for Jailbreaks

Claude’s New Classifiers Face Hacker Onslaught

The AI arms race just got more interesting. A new bug bounty program is launching to test unreleased safety classifiers, aiming to identify universal jailbreaks as part of meeting the AI Safety Level-3 (ASL-3) Deployment Standard under the Responsible Scaling Policy. This isn’t your typical cybersecurity hunt—it’s a high-stakes probe into whether cutting-edge AI safeguards can withstand determined manipulation.

“We’re stress-testing the boundaries of what we think is secure,” says an insider familiar with the program. “If there’s a flaw, we want to find it before bad actors do.”

The program, in partnership with HackerOne, focuses on Constitutional Classifiers designed to prevent jailbreaks related to CBRN (chemical, biological, radiological, and nuclear) weapons. It’s a targeted approach: rather than casting a wide net, researchers are zeroing in on the most dangerous potential misuse cases. Participants will get early access to test classifiers on Claude 3.7 Sonnet, with rewards up to $25,000 for verified universal jailbreaks that bypass safeguards across multiple topics.

ASL-3 Protections Under the Microscope

The initiative targets vulnerabilities that could enable misuse on CBRN-related topics, building on months of work to refine ASL-3 protections. It’s a direct response to growing concerns that even robust safety measures might have blind spots when models reach higher capability thresholds. Researchers from a previous program are invited, and new applicants with expertise in language model jailbreaks can apply via an invite-only form; the program runs until May 18, 2025.

Why the narrow focus? Because CBRN risks represent a critical threshold in AI safety. The effort supports ongoing safety improvements for increasingly capable AI models, with detailed feedback provided to selected participants. Unlike traditional bug bounties, where submissions might languish, organizers promise submissions will receive timely responses—a nod to the urgency of the challenge.

“This isn’t about patching minor leaks. It’s about checking whether the hull holds before we sail into stormier seas,” explains a safety researcher involved in the project.

Applications open immediately, marking one of the first concerted efforts to crowdsource adversarial testing for next-gen AI safety systems. With $25K on the line and a ticking clock, the race is on to either break Claude’s new defenses—or prove they’re ready for prime time.

Anthropic Teams Up with HackerOne to Stress-Test Its AI Safeguards in New Bug Bounty Program

AI Safety Gets a Stress Test With $25K Bug Bounty for Jailbreaks

Claude’s New Classifiers Face Hacker Onslaught

ASL-3 Protections Under the Microscope

About The Author

Kris Stewart

Leave a reply Cancel reply

Anthropic Teams Up with HackerOne to Stress-Test Its AI Safeguards in New Bug Bounty Program

AI Safety Gets a Stress Test With $25K Bug Bounty for Jailbreaks

Claude’s New Classifiers Face Hacker Onslaught

ASL-3 Protections Under the Microscope

About The Author

Kris Stewart

Related Posts

Meta’s Next-Gen Ray-Ban Smart Glasses Hit India This Month

Google’s Flow AI Revolutionizes Filmmaking for the Next Generation of Creators

Jensen Huang’s AI Manifesto: How NVIDIA’s Quantum Leap Is Rewriting the Rules of Computing and Culture

NVIDIA’s New GR00T N1.5 and Blackwell AI Push Humanoid Robots Closer to Reality

Leave a reply Cancel reply