AI Safety Gets a Stress Test With $25K Bug Bounty for Jailbreaks
Claude’s New Classifiers Face Hacker Onslaught
The AI arms race just got more interesting. A new bug bounty program is launching to test unreleased safety classifiers, aiming to identify universal jailbreaks as part of meeting the AI Safety Level-3 (ASL-3) Deployment Standard under the Responsible Scaling Policy. This isn’t your typical cybersecurity hunt—it’s a high-stakes probe into whether cutting-edge AI safeguards can withstand determined manipulation.
“We’re stress-testing the boundaries of what we think is secure,” says an insider familiar with the program. “If there’s a flaw, we want to find it before bad actors do.”
The program, in partnership with HackerOne, focuses on Constitutional Classifiers designed to prevent jailbreaks related to CBRN (chemical, biological, radiological, and nuclear) weapons. It’s a targeted approach: rather than casting a wide net, researchers are zeroing in on the most dangerous potential misuse cases. Participants will get early access to test classifiers on Claude 3.7 Sonnet, with rewards up to $25,000 for verified universal jailbreaks that bypass safeguards across multiple topics.
ASL-3 Protections Under the Microscope
The initiative targets vulnerabilities that could enable misuse on CBRN-related topics, building on months of work to refine ASL-3 protections. It’s a direct response to growing concerns that even robust safety measures might have blind spots when models reach higher capability thresholds. Researchers from a previous program are invited, and new applicants with expertise in language model jailbreaks can apply via an invite-only form; the program runs until May 18, 2025.
Why the narrow focus? Because CBRN risks represent a critical threshold in AI safety. The effort supports ongoing safety improvements for increasingly capable AI models, with detailed feedback provided to selected participants. Unlike traditional bug bounties, where submissions might languish, organizers promise submissions will receive timely responses—a nod to the urgency of the challenge.
“This isn’t about patching minor leaks. It’s about checking whether the hull holds before we sail into stormier seas,” explains a safety researcher involved in the project.
Applications open immediately, marking one of the first concerted efforts to crowdsource adversarial testing for next-gen AI safety systems. With $25K on the line and a ticking clock, the race is on to either break Claude’s new defenses—or prove they’re ready for prime time.