A small benchmark suite for login-checking data structures (linear search, binary search, hash set, Bloom filter, Cuckoo filter).
This repository includes scripts to generate synthetic username datasets and to benchmark lookup time and memory/space usage across several structures.
- Python 3.8+ (this project was developed with Poetry)
- Poetry (preferred) — or use your system Python and pip to install the dependencies in
pyproject.toml.
Install dependencies with Poetry:
poetry installOr install with pip into a virtual environment:
python -m venv .venv; .\.venv\Scripts\Activate.ps1; pip install -r <(poetry export -f requirements.txt --without-hashes)Datasets are created by data/generate_dataset.py and saved under data/datasets by default.
Usage example (creates 100 usernames):
python data/generate_dataset.py --n 100 --out data/datasets --seed 42Using Poetry (recommended):
poetry run python data/generate_dataset.py --n 100 --out data/datasets --seed 42You can generate larger datasets by changing --n. Datasets have not been pushed for size reasons.
Benchmarks are executed by the bench.run_bench module. The following example runs lookups and measures space for several structures and dataset sizes. This is the exact command used for large-scale experiments:
poetry run python -m bench.run_bench --dataset data/datasets/logins_n10000000.txt --structure linear,binary,hash,bloom,cuckoo --n 100,1000,10000,100000,1000000,10000000 --runs 3 --out results/compare_various_n_lookup_space_10e7.json --seed 42 --measures lookup,spaceFlags explained:
--dataset: path to a newline-separated file with usernames (one per line).--structure: comma-separated list of structures to benchmark. Supported:linear,binary,hash,bloom,cuckoo.--n: comma-separated list of numbers of items to use from the dataset for each experiment.--runs: how many repetitions to average per experiment.--out: output JSON file where results will be written.--seed: random seed for reproducibility.--measures: comma-separated list of measures to collect (e.g.,lookup,space).
Results are written to the specified JSON file and plots can be generated from these results using the visualization helpers in visualization/plot.py.
Run tests with Poetry:
poetry run pytest -q- If you need only a subset of records from a large dataset, use the
--noption to limit the number of items used per experiment.