This is the official replication package for our paper:
AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests
AssertFlip is a system for automatically generating bug-reproducing tests from natural language reports
- Python 3.10+
- Docker
conda
(used inside Docker containers)
Install dependencies:
pip install -e .
The file scripts/.env is already created.
Just open it and insert your own credentials like this:
AZURE_API_KEY=your_azure_api_key
AZURE_API_BASE=https://your_azure_endpoint
AZURE_API_VERSION=2024-05-01-preview
Default (used in the paper)
python scripts/run_parallel.py
This uses:
- Agentless localization
- Pass-invert strategy
- 10 regeneration attempts
- 10 refinement attempts
- LLM validation enabled
- Planner enables
Config is controlled in scripts/config.py.
All datasets are in the datasets folder. These are the exact files used in our experiments:
- SWT_Verified_Agentless_Test_Source_Skeleton.json (default for Verified)
- SWT_Verified_Test_Source_Skeleton.json (perfect localization dataset)
- SWT_Lite_Agentless_Test_Source_Skeleton.json (default for Lite)
- SWT_Lite_Agentless_Unique_Only.json (default for Lite 188 unique instances)
To switch datasets, change:
DATASET_PATH in scripts/config.py.
Regeneration Ablation (0 or 5 attempts)
Edit this line in scripts/config.py:
max_regeneration_retries = 1 # for no regenerations
# or
max_regeneration_retries = 5 # for the 5 regeneration ablation
Then run:
python scripts/run_parallel.py
python scripts/run_parallel_without_validation_ablation.py
python scripts/run_parallel_without_planner_ablation.py
Change dataset in scripts/config.py to:
DATASET_PATH = "datasets/SWT_Verified_Test_Source_Skeleton.json"
Then run the default script again.
python scripts/run_parallel.py
To generate preds.json from results:
python scripts/generate_preds_phases.py --results-dir results/
We also include our original prediction files in the preds_files folder for direct use.
The previous steps produces predictions in SWT-Bench format. You can then evaluate them using SWT-Bench instructions: https://github.com/logic-star-ai/swt-bench
We also provide:
- Full outputs preds in preds_files/
- Full results after evaluating on SWT-Bench for each reported run in evaluation_results_on_SWT_Bench/
If you use this codebase, datasets, or experiments in your research, please cite our paper:
@article{khatib2025assertflip,
title={AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests},
author={Khatib, Lara and Mathews, Noble Saji and Nagappan, Meiyappan},
journal={arXiv preprint arXiv:2507.17542},
year={2025}
}
This project uses components from the opensource test generator Coverup, licensed under the Apache 2.0 License.