From infrastructure to adoption

Conducted by the AI Security Lab and led by Nicola Franco, the study provides one of the first independent European assessments of the robustness of frontier large language models against automated jailbreak attacks.

How secure are today’s most advanced Artificial Intelligence models? And how well do they withstand attack campaigns specifically designed to bypass their safety mechanisms?
These are the questions at the heart of “Measuring the Residual Jailbreak Surface of Frontier Large Language Models”, the report published on 15 June 2026 by the AI Security Lab and led by Nicola Franco. The study provides one of the first independent assessments conducted in Europe of the robustness of frontier large language models against automated jailbreak attacks.

The research examined two of the most advanced language models currently available, Anthropic’s Fable 5 and Opus 4.8, with the aim of measuring what the authors define as the residual jailbreak surface: the set of vulnerabilities that remain accessible despite the safety, alignment and control mechanisms implemented in state-of-the-art systems.

To do so, researchers subjected the models to a large-scale automated red-teaming campaign involving 7,826 harmful intents across ten risk categories, ranging from cybersecurity and disinformation to economic crime and child safety. The evaluation generated hundreds of thousands of attack attempts using different jailbreak techniques, which were subsequently verified through an independent adjudication process based on a panel of three judge models.

Beyond Success Rates
The results show that both models successfully resist the majority of attacks. At the same time, the research highlights a residual vulnerability surface that remains exploitable when attackers adopt adaptive strategies capable of progressively modifying their behaviour in response to the system’s outputs.

The most effective attack achieved a confirmed success rate of 11.5% against Opus 4.8 and 6.1% against Fable 5. In absolute terms, the campaign identified 1,620 confirmed harmful completions for Opus 4.8 and 702 for Fable 5, spanning all risk categories considered in the study.
According to the authors, the most significant finding goes beyond the percentages themselves. What emerges most clearly is the ability of automated attacks to discover vulnerabilities without the involvement of human specialists, leveraging iterative exploration and adaptation processes that rapidly refine jailbreak strategies.

A Challenge for AI Security
As Artificial Intelligence becomes increasingly embedded in industrial processes, digital services and critical infrastructures, model robustness is emerging as a strategic concern for governments, companies and developers alike.
For this reason, the report argues that robustness assessments should be regarded as an essential component of AI governance, complementing performance evaluation, regulatory compliance and transparency mechanisms.
Understanding how models behave under adversarial conditions is becoming increasingly important for assessing their reliability in real-world environments.
The report was developed using HackAgent, the open-source framework created by the AI Security Lab to automate red-teaming activities and robustness evaluations for large language models.

The full report, including methodology, risk taxonomies and the complete analysis of the results, is available here:

Download PDF

Jailbreaks and Frontier Models: The AI Security Lab Report Measuring the Resilience of the Most Advanced AI Systems

Conducted by the AI Security Lab and led by Nicola Franco, the study provides one of the first independent European assessments of the robustness of frontier large language models against automated jailbreak attacks.

Innovation, talent and productivity: AI4I joins Bank of Italy discussion on Piedmont’s economic outlook

The New “Res Novae” of the GPU Age: Fabio Pammolli Reflects on Power, Work and Freedom

Energy, Chips, Data: The New Frontier of Competition in Artificial Intelligence

Jailbreaks and Frontier Models: The AI Security Lab Report Measuring the Resilience of the Most Advanced AI Systems

Conducted by the AI Security Lab and led by Nicola Franco, the study provides one of the first independent European assessments of the robustness of frontier large language models against automated jailbreak attacks.

Related Posts

Innovation, talent and productivity: AI4I joins Bank of Italy discussion on Piedmont’s economic outlook

The New “Res Novae” of the GPU Age: Fabio Pammolli Reflects on Power, Work and Freedom

Energy, Chips, Data: The New Frontier of Competition in Artificial Intelligence