📃 Paper | 🤗 Models | 📚 Project Page
Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang
- SFT LRMs are vulnerable to jailbreaks.
- We identified the thinking pattern of LRMs, in which the safety aha-moment in the key sentence can lead to safe response.
- We proposed the SafeKey framework to improve LRM safety alignment.
Model | URL |
---|---|
SafeKey-7B | 🤗 kzhou35/SafeKey-7B |
SafeKey-8B | 🤗 kzhou35/SafeKey-8B |
SafeKey-14B | 🤗 kzhou35/SafeKey-14B |
train/
: Training scriptsbenchmark/
: Evaluation Scriptssafe_benchmark
: Safety Evaluationreasoning_benchmark/
: Reasoning Evaluation
data/
: Training data
git clone https://github.com/eric-ai-lab/SafeKey.git
cd SafeKey
conda env create -f environment.yml
cd train
bash run_sft.sh
The run_sft.sh
looks like:
accelerate launch --config_file ./configs/deepspeed_zero3.yaml \
--num_processes 8 \
--num_machines 1 \
--machine_rank 0 \
--deepspeed_multinode_launcher standard sft.py \
--model_path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--data_path ../data/train/sft_mix_2k.json \
--n_epochs 5 \
--experiment_name safe_lrm \
--base_model Llama \
--base_flag 0 \
--think_flag 1 \
--output_dir ../data/models/8b_safekey \
--train_bsz_per_gpu 2 \
--gradient_accumulation_steps 8 \
--safety_head \
--key_sentence_prediction
- Change the
model_path
to different model
You could change the mode_path
of the evaluated model in benchmark/safe_benchmark/config.py
, and benchmark/reasoning_benchmark/config.py
.
cd benchmark/safe_benchmark
bash scripts.sh
Change the model that you want you evaluate in scripts.sh
.
The code in Reasoning Benchmark is based on simple-evals
and modified.
cd benchmark/reasoning_benchmark
bash run_all_evals.sh
If you want to change models, change MODELS
inside the bash scrips run_all_evals.sh
at Line 7.
This codebase is build upon STAR-1, thanks to their great work!
@article{zhou2025safekey,
title={SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning},
author={Zhou, Kaiwen and Zhao, Xuandong and Liu, Gaowen and Srinivasa, Jayanth and Feng, Aosong and Song, Dawn and Wang, Xin Eric},
journal={arXiv preprint arXiv:2505.16186},
year={2025}
}