This repository contains the code for reproducing experiments from our ICLR 2026 paper on statistical detection of LLM model degradations using McNemar's test. We provide tools to detect whether accuracy changes in optimized models are due to actual degradation or evaluation noise.
We recommend using uv: https://docs.astral.sh/uv/getting-started/installation/
uv venv ~/venv_accuracy_paper --python 3.12 --seed
source ~/venv_accuracy_paper/bin/activate
pip install vllm==0.10.0 lm-eval[math,ifeval,sentencepiece]==0.4.8We generally recommend to use our permutation-based tests, see Appendix D. For binary data those provide equivalent results to the direct McNemar tests, but generalize to non-binary data and multiple reruns.
If you run the same samples multiple times through the model (say with non-zero temperature), first average the score for each example!
Then organize the per-sample scores in csv files. One file per task+model.
Then run the script with the following arguments:
python continuous_aggregation_script.py model_paths.json ./output_dir/Inputs:
model_paths.json: JSON file mapping model names to their task CSV filesoutput_dir: Directory for output files
CSV Format: Each CSV should contain prompt_id and score columns.
Outputs:
summary.csv: Pooled and per-task accuracies with p-values from permutation tests
For binary scores, we also provide an aggregation script to perform a statistical analysis on top of LM-Evaluation Harness runs
python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/
When running lm eval it is crucial to store the results for each example using the flag --log_samples (see example scripts).
Inputs:
task_metrics.json: Defines which metrics to extract for each taskcheckpoint_list.json: Lists models and their result pathsoutput_dir: Directory for output files
Outputs:
model_comparison.csv: Detailed metrics for all tasks and modelssummary_with_stderr.csv: Category-level metrics with standard errorssummary_without_stderr.csv: Category-level metrics without standard errorsmodel_differences.csv: Contingency table values and disagreement ratios
Update paths in:
checkpoint_list.json: Replace/path/to/your/resultswith your actual results directoryllm_experiments/*.sh: Replace/path/to/your/hf_cache,your_huggingface_token_here, and output paths with your actual paths/token
Scripts in llm_experiments/ directory. Example:
cd llm_experiments/
bash llama_3_paper.shpython aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/Scripts in plots_and_synthetic/ directory:
cd plots_and_synthetic/
python test_power_plot.py # Figure: Test power analysis
python pvalue_heatmap_comparison.py # Figure: P-value heatmaps
python test_power_vs_tasks.py # Figure: Power vs number of tasks
python intro_figure.py # Figure: Introductory exampleScripts in dataset_selection/ directory for seed/temperature analysis:
cd dataset_selection/
# Run evaluations with different seeds
bash mmlu_dataset_ablation_temp0.3.sh
bash mmlu_dataset_ablation_true_models.sh
# Analyze flip patterns across seeds
python seed_flip_analysis.py task_metrics.json results_dir1/ results_dir2/ [results_dir3/ ...]Output: success_counts.json, success_histogram.pdf - Analysis of which documents consistently succeed/fail across different model runs
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file for details.
If you find our work useful or use our tests you can cite our paper:
@inproceedings{
anonymous2026when,
title={When {LLM}s get significantly worse: A statistical approach to detect model degradations},
author={Jonas Kübler, Kailash Budhathoki, Matthäus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=cM3gsqEI4K}
}