Logo SafeDialBench

A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

1National Key Laboratory for Novel Software Technology, Nanjing University 2University of Liverpool 3University of California, San Diego 4School of Intelligence Science and Technology, Nanjing University 5China Mobile Research Institute, Beijing 6China Mobile (Suzhou) Software Technology Co., Ltd. Suzhou

Introduction

framework
Figure 1: Overall framework of SafeDialBench. 1) Safety Taxonomy: propose a safety taxonomy comprising 6 categories. 2) Data Construction: construct datasets with 7 jailbreak attack methods based on 6 categories within 22 dialogue scenarios 3) LLM Evaluation: evaluate LLMs based on 3 safety abilities with LLMs and human judgment.

With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current bench marks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM’s capability to identify and handle unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative auto assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 19 LLMs reveal that Yi-34B-Chat, MoonShot-v1 and ChatGPT-4o demonstrate superior safety performance, while Llama3.1-8B-Instruct and reasoning model o3-mini exhibit safety vulnerabilities.

task-list
Figure 2: (a) Two-tier hierarchical safety taxonomy. (b) Process of data generation.

Evaluation Examples

task-list
Figure 3: Example of dialogue and model evaluation for ethics under scene construct attack.
task-list
Figure 4: Examples of model responses and corresponding evaluations under role play attack.

Results

task-list
Figure 5: (a) ASR scores by models. (b) Agreement between human experts and model evaluation.

Leaderboard in SafeDialBench (Updating...)

Download Excel Template