Skip to content

eric-ai-lab/SafeKey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

📃 Paper | 🤗 Models | 📚 Project Page

Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang

Introduction

main

  • SFT LRMs are vulnerable to jailbreaks.
  • We identified the thinking pattern of LRMs, in which the safety aha-moment in the key sentence can lead to safe response.

main

  • We proposed the SafeKey framework to improve LRM safety alignment.

Artifacts

Model

Model URL
SafeKey-7B 🤗 kzhou35/SafeKey-7B
SafeKey-8B 🤗 kzhou35/SafeKey-8B
SafeKey-14B 🤗 kzhou35/SafeKey-14B

Structure

  • train/: Training scripts
  • benchmark/: Evaluation Scripts
    • safe_benchmark: Safety Evaluation
    • reasoning_benchmark/: Reasoning Evaluation
  • data/: Training data

Quick Start

git clone https://github.com/eric-ai-lab/SafeKey.git
cd SafeKey
conda env create -f environment.yml

Training

cd train
bash run_sft.sh

The run_sft.sh looks like:

accelerate launch --config_file ./configs/deepspeed_zero3.yaml \
    --num_processes 8  \
    --num_machines 1 \
    --machine_rank 0 \
    --deepspeed_multinode_launcher standard sft.py \
    --model_path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --data_path ../data/train/sft_mix_2k.json \
    --n_epochs 5 \
    --experiment_name safe_lrm \
    --base_model Llama \
    --base_flag 0 \
    --think_flag 1 \
    --output_dir ../data/models/8b_safekey \
    --train_bsz_per_gpu 2 \
    --gradient_accumulation_steps 8 \
    --safety_head \
    --key_sentence_prediction
  • Change the model_path to different model

Evaluation

You could change the mode_path of the evaluated model in benchmark/safe_benchmark/config.py, and benchmark/reasoning_benchmark/config.py.

Safety Benchmark

cd benchmark/safe_benchmark
bash scripts.sh

Change the model that you want you evaluate in scripts.sh.

Reasoning Benchmark

The code in Reasoning Benchmark is based on simple-evals and modified.

cd benchmark/reasoning_benchmark
bash run_all_evals.sh

If you want to change models, change MODELS inside the bash scrips run_all_evals.sh at Line 7.

Acknowledgement

This codebase is build upon STAR-1, thanks to their great work!

Citation

@article{zhou2025safekey,
  title={SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning},
  author={Zhou, Kaiwen and Zhao, Xuandong and Liu, Gaowen and Srinivasa, Jayanth and Feng, Aosong and Song, Dawn and Wang, Xin Eric},
  journal={arXiv preprint arXiv:2505.16186},
  year={2025}
}

About

Official code for the paper "SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •