Skip to content

iamshahd/Login-Checker

Repository files navigation

Login-Checker-Assignment

A small benchmark suite for login-checking data structures (linear search, binary search, hash set, Bloom filter, Cuckoo filter).

This repository includes scripts to generate synthetic username datasets and to benchmark lookup time and memory/space usage across several structures.

Requirements

  • Python 3.8+ (this project was developed with Poetry)
  • Poetry (preferred) — or use your system Python and pip to install the dependencies in pyproject.toml.

Install dependencies with Poetry:

poetry install

Or install with pip into a virtual environment:

python -m venv .venv; .\.venv\Scripts\Activate.ps1; pip install -r <(poetry export -f requirements.txt --without-hashes)

Generate datasets

Datasets are created by data/generate_dataset.py and saved under data/datasets by default.

Usage example (creates 100 usernames):

python data/generate_dataset.py --n 100 --out data/datasets --seed 42

Using Poetry (recommended):

poetry run python data/generate_dataset.py --n 100 --out data/datasets --seed 42

You can generate larger datasets by changing --n. Datasets have not been pushed for size reasons.

Run benchmarks

Benchmarks are executed by the bench.run_bench module. The following example runs lookups and measures space for several structures and dataset sizes. This is the exact command used for large-scale experiments:

poetry run python -m bench.run_bench --dataset data/datasets/logins_n10000000.txt --structure linear,binary,hash,bloom,cuckoo --n 100,1000,10000,100000,1000000,10000000 --runs 3 --out results/compare_various_n_lookup_space_10e7.json --seed 42 --measures lookup,space

Flags explained:

  • --dataset: path to a newline-separated file with usernames (one per line).
  • --structure: comma-separated list of structures to benchmark. Supported: linear, binary, hash, bloom, cuckoo.
  • --n: comma-separated list of numbers of items to use from the dataset for each experiment.
  • --runs: how many repetitions to average per experiment.
  • --out: output JSON file where results will be written.
  • --seed: random seed for reproducibility.
  • --measures: comma-separated list of measures to collect (e.g., lookup, space).

Results are written to the specified JSON file and plots can be generated from these results using the visualization helpers in visualization/plot.py.

Tests

Run tests with Poetry:

poetry run pytest -q

Notes

  • If you need only a subset of records from a large dataset, use the --n option to limit the number of items used per experiment.

About

Benchmarking Approximate Membership Query (AMQ) data structures (Bloom Filter, Cuckoo Filter)

Topics

Resources

License

Stars

Watchers

Forks

Languages