SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang

Introduction

SFT LRMs are vulnerable to jailbreaks.
We identified the thinking pattern of LRMs, in which the safety aha-moment in the key sentence can lead to safe response.

We proposed the SafeKey framework to improve LRM safety alignment.

Artifacts

Model

Model	URL
SafeKey-7B	🤗 kzhou35/SafeKey-7B
SafeKey-8B	🤗 kzhou35/SafeKey-8B
SafeKey-14B	🤗 kzhou35/SafeKey-14B

Structure

train/: Training scripts
benchmark/: Evaluation Scripts
- safe_benchmark: Safety Evaluation
- reasoning_benchmark/: Reasoning Evaluation
data/: Training data

Quick Start

git clone https://github.com/eric-ai-lab/SafeKey.git
cd SafeKey
conda env create -f environment.yml

Training

cd train
bash run_sft.sh

The run_sft.sh looks like:

accelerate launch --config_file ./configs/deepspeed_zero3.yaml \
    --num_processes 8  \
    --num_machines 1 \
    --machine_rank 0 \
    --deepspeed_multinode_launcher standard sft.py \
    --model_path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --data_path ../data/train/sft_mix_2k.json \
    --n_epochs 5 \
    --experiment_name safe_lrm \
    --base_model Llama \
    --base_flag 0 \
    --think_flag 1 \
    --output_dir ../data/models/8b_safekey \
    --train_bsz_per_gpu 2 \
    --gradient_accumulation_steps 8 \
    --safety_head \
    --key_sentence_prediction

Change the model_path to different model. If training is based on Qwen-7B, make sure to adjust the parameters:

accelerate launch --config_file ./configs/deepspeed_zero3.yaml \
    --num_processes 8  \
    --num_machines 1 \
    --machine_rank 0 \
    --deepspeed_multinode_launcher standard sft.py \
    --model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --data_path ../data/train/sft_mix_2k.json \
    --n_epochs 10 \
    --last_k_epoch 4 \
    --experiment_name safe_lrm \
    --base_model Qwen \
    --base_flag 0 \
    --think_flag 1 \
    --output_dir ../data/models/7b_safekey \
    --train_bsz_per_gpu 2 \
    --gradient_accumulation_steps 8 \
    --safety_head \
    --key_sentence_prediction

Evaluation

You could change the mode_path of the evaluated model in benchmark/safe_benchmark/config.py, and benchmark/reasoning_benchmark/config.py.

Safety Benchmark

cd benchmark/safe_benchmark
bash scripts.sh

Change the model that you want you evaluate in scripts.sh.

Reasoning Benchmark

The code in Reasoning Benchmark is based on simple-evals and modified.

cd benchmark/reasoning_benchmark
bash run_all_evals.sh

If you want to change models, change MODELS inside the bash scrips run_all_evals.sh at Line 7.

Acknowledgement

This codebase is build upon STAR-1, thanks to their great work!

Citation

@article{zhou2025safekey,
  title={SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning},
  author={Zhou, Kaiwen and Zhao, Xuandong and Liu, Gaowen and Srinivasa, Jayanth and Feng, Aosong and Song, Dawn and Wang, Xin Eric},
  journal={arXiv preprint arXiv:2505.16186},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
benchmark		benchmark
data/train		data/train
figures		figures
prompts		prompts
train		train
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
prompt.py		prompt.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Introduction

Artifacts

Model

Structure

Quick Start

Training

Evaluation

Safety Benchmark

Reasoning Benchmark

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

eric-ai-lab/SafeKey

Folders and files

Latest commit

History

Repository files navigation

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Introduction

Artifacts

Model

Structure

Quick Start

Training

Evaluation

Safety Benchmark

Reasoning Benchmark

Acknowledgement

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages