Skip to content

amazon-science/LLM-Accuracy-Stats

When LLMs get significantly worse: A statistical approach to detect model degradations

ICLR 2026 License: CC BY-NC 4.0 Python 3.12 arXiv

This repository contains the code for reproducing experiments from our ICLR 2026 paper on statistical detection of LLM model degradations using McNemar's test. We provide tools to detect whether accuracy changes in optimized models are due to actual degradation or evaluation noise.

Installation

We recommend using uv: https://docs.astral.sh/uv/getting-started/installation/

uv venv ~/venv_accuracy_paper --python 3.12 --seed
source ~/venv_accuracy_paper/bin/activate
pip install vllm==0.10.0 lm-eval[math,ifeval,sentencepiece]==0.4.8

General Usage

We generally recommend to use our permutation-based tests, see Appendix D. For binary data those provide equivalent results to the direct McNemar tests, but generalize to non-binary data and multiple reruns.

If you run the same samples multiple times through the model (say with non-zero temperature), first average the score for each example!

Then organize the per-sample scores in csv files. One file per task+model.

Then run the script with the following arguments:

python continuous_aggregation_script.py model_paths.json ./output_dir/

Inputs:

  • model_paths.json: JSON file mapping model names to their task CSV files
  • output_dir: Directory for output files

CSV Format: Each CSV should contain prompt_id and score columns.

Outputs:

  • summary.csv: Pooled and per-task accuracies with p-values from permutation tests

Discrete Score Evaluation (LM-Eval Harness)

For binary scores, we also provide an aggregation script to perform a statistical analysis on top of LM-Evaluation Harness runs python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/

When running lm eval it is crucial to store the results for each example using the flag --log_samples (see example scripts).

Inputs:

  • task_metrics.json: Defines which metrics to extract for each task
  • checkpoint_list.json: Lists models and their result paths
  • output_dir: Directory for output files

Outputs:

  • model_comparison.csv: Detailed metrics for all tasks and models
  • summary_with_stderr.csv: Category-level metrics with standard errors
  • summary_without_stderr.csv: Category-level metrics without standard errors
  • model_differences.csv: Contingency table values and disagreement ratios

Running Paper Experiments

LLM Evaluation

Update paths in:

  • checkpoint_list.json: Replace /path/to/your/results with your actual results directory
  • llm_experiments/*.sh: Replace /path/to/your/hf_cache, your_huggingface_token_here, and output paths with your actual paths/token

Scripts in llm_experiments/ directory. Example:

cd llm_experiments/
bash llama_3_paper.sh

Statistical Analysis

python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/

Generate Figures for Synthetic Experiments and General insights

Scripts in plots_and_synthetic/ directory:

cd plots_and_synthetic/
python test_power_plot.py              # Figure: Test power analysis
python pvalue_heatmap_comparison.py    # Figure: P-value heatmaps  
python test_power_vs_tasks.py          # Figure: Power vs number of tasks
python intro_figure.py                 # Figure: Introductory example

Dataset Selection Analysis

Scripts in dataset_selection/ directory for seed/temperature analysis:

cd dataset_selection/
# Run evaluations with different seeds
bash mmlu_dataset_ablation_temp0.3.sh
bash mmlu_dataset_ablation_true_models.sh

# Analyze flip patterns across seeds
python seed_flip_analysis.py task_metrics.json results_dir1/ results_dir2/ [results_dir3/ ...]

Output: success_counts.json, success_histogram.pdf - Analysis of which documents consistently succeed/fail across different model runs

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file for details.

Citation

If you find our work useful or use our tests you can cite our paper:

@inproceedings{
anonymous2026when,
title={When {LLM}s get significantly worse: A statistical approach to detect model degradations},
author={Jonas Kübler, Kailash Budhathoki, Matthäus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=cM3gsqEI4K}
}

About

Test optimized LLMs for degraded accuracy

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks