This repository contains the code for a series of attempted enhancements made to the original Neural Partitioner from the paper Unsupervised Space Partitioning for Nearest Neighbor Search, authored by Abrar Fahim, Mohammed Eunus Ali, and Muhammad Aamir Cheema.
For more detailed documentation and insights into how the various algorithms and models have been implemented, please refer to the implementation report.
Included is also an incomplete implementation of the Hierarchical Navigable Small Worlds (HNSW) algorithm, originally intended as a potential enhancement for the ANN algorithm. The class developed as part of this attempt has been included in the repository for reference and future development. You can find this class in the attempted-improvements folder.
Begin by taking a look at main.py, which serves as the entry point of the code. It relies on the paths.txt file to determine the locations of the datasets needed for training. Before proceeding, ensure that all the required dependencies mentioned in requirements.txt are installed.
You could of course skip all that, by running a ready-to-go setup on Kaggle. More on that in a bit.
Ensure that the file paths.txt specifies the absolute directory paths necessary for the code to function. It should include:
paths_to_mnist
: Directory containing the MNIST dataset inhdf5
format.path_to_sift
: Similar topaths_to_mnist
, but for the SIFT dataset.path_to_knn_matrix
: Directory where the generated k-NN matrix will be stored.path_to_models
: Directory for saving trained models.path_to_tensors
: Directory for storing processed tensors for faster re-runs.
- Python 3.5+
- Compatible with Windows, Linux, macOS
- (Optional, although ideal) GPU support for faster computation
For starters, clone the repository:
git clone git@github.com:mdarm/neural-partitioner.git
cd ./neural-partitioner/src
Before running the code, install the required dependencies by running the following command:
pip install -r src/requirements.txt
Two versions of the running workflow are available; locally and on Kaggle.
Before running the code, fill in paths.txt with the appropriate directories. Next, download the SIFT and/or MNIST datasets from ANN Benchmarks and place them in the respective folders specified in paths.txt.
To execute the code with the default configuration, merely type:
python main.py
For a custom configuration, here's an example command:
python main.py --n_bins 256 --dataset_name mnist --n_trees 3 --metric_distance mahalanobis --model_combine cnn neural linear
We have automated the process to rerun all the experiments performed using an accelerated Kaggle notebook. This allows you to easily replicate and explore the results without the hassle of manual setup; for more details see here.
Default parameter values are set in utils.py.
dataset_name
: Choose the dataset to partition (mnist
orsift
).n_bins
: Define the number of bins for dataset partitioning.k_train
: Set the number of neighbors to build the k-NN matrix.k_test
: Set the number of neighbors for testing the model.n_bins_to_search
: Choose how many bins to search for nearest neighbors.
n_epochs
: Specify the number of epochs for training.batch_size
: Set the batch size.lr
: Define the learning rate.n_trees
: Choose the number of trees for ensemble.n_levels
: Define the number of levels in each tree.tree_branching
: Set the number of children per node.model_type
: Select the model type (neural
,linear
, orcnn
).eta_value
: Balance parameter for the loss function.distance_metric
: Choose distance metric (euclidean
ormahalanobis
).model_combination
: Create an ensemble by combining models in the order provided.pl
: Run vector quantisation pipeline (executes after the model ensembling).
load_knn
: Load the k-NN matrix from file (if available).continue_train
: Continue training from the last checkpoint by loading models from file.
The program outputs the loss and accuracy (RNN-Recall) metrics and these metrics are highlighted in the plots. Additionally, a summary of the plots is displayed in the command prompt.
Plots show loss per epoch and accuracy (RNN recall) per bin (partition) size for each neural ensemble, for a 16-bin space partition. The two most likely partitions to which the queries belong are indicated by the points on each line.
The program outputs information about test accuracy, mean candidate set size, average query time, and its standard deviation for various combinations of the number of models and bins.
Example:
-- READING PATHS --
preparing knn with k = 10
BUILDING TREE
...
----- CALCULATING K-NN RECALL FOR EACH POINT -------
1 models, 1 bins
mean accuracy tensor(0.9272)
mean candidate set size tensor(8918.8584)
average query time: 0.21, standard deviation: 0.03 (miliseconds)
...
All the run experiments are in the outputs and files are named using the convention method-bin_number-dataset
. For example, if the method used was cnn
, bin number was 16
, and dataset was sift
, the output file would be named cnn-16-sift.txt
.