Anthropic will pay you $15,000 if you can hack its AI safety system

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Anthropic has set out to test the robustness of its AI safety measures by offering a $15,000 reward to anyone who can successfully jailbreak their new Constitutional Classifiers system.

The challenge details: Anthropic has invited researchers to attempt bypassing their latest AI safety system, Constitutional Classifiers, which uses one AI model to monitor and improve another’s adherence to defined principles.

The challenge requires researchers to successfully jailbreak 8 out of 10 restricted queries
A previous round saw 183 red-teamers spend over 3,000 hours attempting to bypass the system, with no successful complete jailbreaks
The competition runs until February 10, offering participants a chance to win the $15,000 reward

System effectiveness: Early testing results demonstrate significant improvements in preventing unauthorized access and harmful outputs compared to traditional safeguards.

When operating alone, Claude blocked 14% of potential attacks
With Constitutional Classifiers enabled, the blocking rate increased to over 95% of unauthorized attempts
The system is guided by a “constitution” of principles that the AI model must follow

Technical implementation: Constitutional Classifiers represents an evolution in AI safety architecture, though some practical challenges remain.

The system employs Constitutional AI, where one AI model monitors and guides another
Current implementation faces high computational costs
Anthropic acknowledges that while highly effective, the system may not prevent every possible jailbreak attempt

Broader security implications: The initiative highlights the growing focus on AI safety and the importance of robust testing methodologies.

The open challenge approach encourages collaborative security testing
Results help identify potential vulnerabilities before they can be exploited
The high success rate in blocking unauthorized access demonstrates progress in AI safety measures

Future considerations: While Constitutional Classifiers shows promise in enhancing AI safety, several key challenges and opportunities lie ahead in its development and implementation.

Anthropic continues working to reduce the system’s computational requirements
New jailbreaking techniques may emerge as AI technology evolves
The balance between security and computational efficiency remains a crucial area for improvement

Jailbreak Anthropic's new AI safety system for a $15,000 reward

ZDNET

Menu

Anthropic will pay you $15,000 if you can hack its AI safety system

Recent News

Tim Cook tells Apple staff AI is “as big as the internet”

Google adds 4 new AI search features including image analysis

Take that, Oppenheimer: Meta offers AI researcher $250M over 4 years in talent war

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Anthropic will pay you $15,000 if you can hack its AI safety system

Recent News

Tim Cook tells Apple staff AI is “as big as the internet”

Google adds 4 new AI search features including image analysis

Take that, Oppenheimer: Meta offers AI researcher $250M over 4 years in talent war

Join the revolution

CO/AI

Resources

Join the revolution