When LLMs get significantly worse: A statistical approach to detect model degradations

This repository contains the code for reproducing experiments from our ICLR 2026 paper on statistical detection of LLM model degradations using McNemar's test. We provide tools to detect whether accuracy changes in optimized models are due to actual degradation or evaluation noise.

Installation

We recommend using uv: https://docs.astral.sh/uv/getting-started/installation/

uv venv ~/venv_accuracy_paper --python 3.12 --seed
source ~/venv_accuracy_paper/bin/activate
pip install vllm==0.10.0 lm-eval[math,ifeval,sentencepiece]==0.4.8

General Usage

We generally recommend to use our permutation-based tests, see Appendix D. For binary data those provide equivalent results to the direct McNemar tests, but generalize to non-binary data and multiple reruns.

If you run the same samples multiple times through the model (say with non-zero temperature), first average the score for each example!

Then organize the per-sample scores in csv files. One file per task+model.

Then run the script with the following arguments:

python continuous_aggregation_script.py model_paths.json ./output_dir/

Inputs:

model_paths.json: JSON file mapping model names to their task CSV files
output_dir: Directory for output files

CSV Format: Each CSV should contain prompt_id and score columns.

Outputs:

summary.csv: Pooled and per-task accuracies with p-values from permutation tests

Discrete Score Evaluation (LM-Eval Harness)

For binary scores, we also provide an aggregation script to perform a statistical analysis on top of LM-Evaluation Harness runs python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/

When running lm eval it is crucial to store the results for each example using the flag --log_samples (see example scripts).

Inputs:

task_metrics.json: Defines which metrics to extract for each task
checkpoint_list.json: Lists models and their result paths
output_dir: Directory for output files

Outputs:

model_comparison.csv: Detailed metrics for all tasks and models
summary_with_stderr.csv: Category-level metrics with standard errors
summary_without_stderr.csv: Category-level metrics without standard errors
model_differences.csv: Contingency table values and disagreement ratios

Running Paper Experiments

LLM Evaluation

Update paths in:

checkpoint_list.json: Replace /path/to/your/results with your actual results directory
llm_experiments/*.sh: Replace /path/to/your/hf_cache, your_huggingface_token_here, and output paths with your actual paths/token

Scripts in llm_experiments/ directory. Example:

cd llm_experiments/
bash llama_3_paper.sh

Statistical Analysis

python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/

Generate Figures for Synthetic Experiments and General insights

Scripts in plots_and_synthetic/ directory:

cd plots_and_synthetic/
python test_power_plot.py              # Figure: Test power analysis
python pvalue_heatmap_comparison.py    # Figure: P-value heatmaps  
python test_power_vs_tasks.py          # Figure: Power vs number of tasks
python intro_figure.py                 # Figure: Introductory example

Dataset Selection Analysis

Scripts in dataset_selection/ directory for seed/temperature analysis:

cd dataset_selection/
# Run evaluations with different seeds
bash mmlu_dataset_ablation_temp0.3.sh
bash mmlu_dataset_ablation_true_models.sh

# Analyze flip patterns across seeds
python seed_flip_analysis.py task_metrics.json results_dir1/ results_dir2/ [results_dir3/ ...]

Output: success_counts.json, success_histogram.pdf - Analysis of which documents consistently succeed/fail across different model runs

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file for details.

Citation

If you find our work useful or use our tests you can cite our paper:

@inproceedings{
anonymous2026when,
title={When {LLM}s get significantly worse: A statistical approach to detect model degradations},
author={Jonas Kübler, Kailash Budhathoki, Matthäus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=cM3gsqEI4K}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

When LLMs get significantly worse: A statistical approach to detect model degradations

Installation

General Usage

Discrete Score Evaluation (LM-Eval Harness)

Running Paper Experiments

LLM Evaluation

Statistical Analysis

Generate Figures for Synthetic Experiments and General insights

Dataset Selection Analysis

License

Citation

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset_selection		dataset_selection
llm_experiments		llm_experiments
minimal_example		minimal_example
plots_and_synthetic		plots_and_synthetic
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aggregation_script.py		aggregation_script.py
checkpoint_list.json		checkpoint_list.json
continuous_aggregation_script.py		continuous_aggregation_script.py
task_metrics.json		task_metrics.json

License

amazon-science/LLM-Accuracy-Stats

Folders and files

Latest commit

History

Repository files navigation

When LLMs get significantly worse: A statistical approach to detect model degradations

Installation

General Usage

Discrete Score Evaluation (LM-Eval Harness)

Running Paper Experiments

LLM Evaluation

Statistical Analysis

Generate Figures for Synthetic Experiments and General insights

Dataset Selection Analysis

License

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages