- Overview
- Motivation, Importance & Scope
- Theoretical Background
- Key Features
- Algorithms & Methodology
- Input & Output Specifications
- Build & Execution
- Configuration & Parameters
- Results & Evaluation
- Code Quality & Best Practices
- Directory Structure
- Installation & Dependencies
- Extensibility & Maintenance
- Authors & License
This repository provides a robust, extensible software suite for efficient nearest neighbor search and clustering in high-dimensional vectors and polygonal curves. The project leverages advanced algorithmic techniques (LSH, Hypercube, Fréchet) and is implemented in C/C++ for performance and scalability. It is designed for both academic research and industrial applications where large-scale, high-dimensional data analysis is required.
High-dimensional vectors and polygonal curves are central to a wide range of modern applications:
- Finance: Portfolio modeling, risk analysis, and market pattern recognition
- Bioinformatics: Protein structure comparison, gene expression analysis
- Geospatial Analytics: Trajectory clustering, route optimization, and movement pattern analysis
- Computer Vision & Graphics: Shape matching, object recognition, and gesture analysis
The analysis of such data is challenging due to the "curse of dimensionality" and the complexity of geometric similarity. Standard search and clustering algorithms become computationally prohibitive as data size and dimensionality increase, and they often fail to capture the true structure of geometric data such as curves.
This project directly addresses these challenges by providing:
- Scalable, high-performance algorithms for approximate nearest neighbor search and clustering in high-dimensional vector and curve spaces
- Advanced similarity measures (e.g., Fréchet distance) that accurately reflect geometric relationships
- A modular, extensible C/C++ platform suitable for both research and industrial deployment, enabling rapid experimentation and robust integration
By enabling efficient and accurate analysis of vectors and curves, this suite empowers data-driven decision-making and innovation in fields where geometric and high-dimensional data are critical.
The algorithms and tools implemented here were used in conjunction with the project kondim23/time-series-dl-analysis to validate the effectiveness of time series compression techniques. By applying nearest neighbor search and clustering to both original and compressed representations, experiments demonstrated that the compressed time series preserved the essential structure of the data, yielding matching results in search and clustering tasks. This confirms the practical utility of these algorithms for evaluating and benchmarking time series compression methods.
Locality Sensitive Hashing is a technique for efficient approximate nearest neighbor search in high-dimensional spaces. It uses hash functions designed so that similar items are more likely to be mapped to the same hash bucket. This enables sublinear query times for similarity search, especially with metrics like L2 (Euclidean) and Fréchet distance for curves.
The Hypercube algorithm projects high-dimensional data into the vertices of a binary hypercube using random projections. Search is performed by probing the cube's vertices in a controlled manner, balancing accuracy and efficiency. This method is particularly effective for very high-dimensional vector spaces.
k-means++ is an improved initialization method for the k-means clustering algorithm. It selects initial centroids to maximize their spread, leading to better convergence and clustering quality. The assignment step can be performed exactly (Lloyd's) or approximately (using LSH or Hypercube), and the update step computes new centroids as mean vectors or curves.
Fréchet distance is a measure of similarity between curves, capturing both the location and ordering of points. The discrete version considers only the sequence of points, while the continuous version allows for more flexible alignment. It is particularly suited for geometric and trajectory analysis.
- Efficient Search: LSH and Hypercube for vectors; LSH for Discrete/Continuous Fréchet on curves
- Advanced Clustering: k-means++ initialization, Lloyd's, LSH, and Hypercube-based assignment; mean vector/curve updates
- Strict Input/Output Compliance: Adheres to well-defined formats for reproducibility and integration
- Command-Line Tools: Rich configuration and automation support
- Code Quality: Modular, maintainable, and thoroughly documented
- Locality Sensitive Hashing (LSH): Accelerates approximate nearest neighbor search in high-dimensional spaces using similarity-preserving hash functions. Supports both L2 (vector) and Fréchet (curve) metrics.
- Hypercube Search: Projects data into binary space for rapid search using random projections and controlled probing.
- k-means++ Clustering: Enhanced initialization for k-means, with assignment via Lloyd's, LSH, or Hypercube, and update via mean vector or mean curve.
- Fréchet Distance: Measures similarity between curves in both discrete and continuous forms, enabling advanced geometric analysis.
- Search:
- LSH (L2) and Hypercube for vectors
- LSH (Discrete/Continuous Fréchet) for curves
- Clustering:
- Initialization: k-means++
- Assignment: Lloyd's (Classic), LSH (Vector/Fréchet), Hypercube
- Update: Mean as vector or curve (with assignment compatibility)
Tab-separated, one vector or curve per line:
item_id1 X11 X12 ... X1d
item_id2 X21 X22 ... X2d
...
item_idN XN1 XN2 ... XNd
Query: <item_id>
Algorithm: LSH_Vector | Hypercube | LSH_Frechet_Continuous | LSH_Frechet_Discrete
Approximate Nearest Neighbor: <item_id>
True Nearest Neighbor: <item_id>
distanceApproximate: <double>
distanceTrue: <double>
tApproximateAverage: <double>
tTrueAverage: <double>
MAF: <double>
Algorithm: <Assignment Method>
CLUSTER-1 {size: <int>, centroid: [coordinates or points]}
...
CLUSTER-k {size: <int>, centroid: [coordinates or points]}
clustering_time: <double>
Silhouette: [s1, ..., sk, stotal] # if -silhouette used
CLUSTER-1 {centroid, item_idA, ...} # if -complete used
- Compile with
make(produces executables inbin/):make
- Run according to assignment specifications:
./bin/search -i <input file> -q <query file> -k <int> -L <int> -M <int> -probes <int> -o <output file> -algorithm <LSH | Hypercube | Frechet> -metric <discrete | continuous> -delta <double> ./bin/cluster -i <input file> -c <config file> -o <output file> -update <Frechet | Vector> -assignment <Classic | LSH | Hypercube | LSH_Frechet> -complete -silhouette ./bin/test
- Note: Algorithms are randomized; multiple runs may be needed for expected results.
number_of_clusters: <int>
number_of_vector_hash_tables: <int> # Default: 3
number_of_vector_hash_functions: <int> # Default: 4
max_number_M_hypercube: <int> # Default: 10
number_of_hypercube_dimensions: <int> # Default: 3
number_of_probes: <int> # Default: 2- Clustering: Silhouette score, cluster sizes, centroids, and assignment details
- Search: Approximation ratios, timing, and neighbor lists
- Reproducibility: Strict adherence to input/output formats for reliable benchmarking
project_1_NN_Clustering/
├── bin/ # Compiled executables (search, cluster, test)
├── obj/ # Compiled object files
├── include/ # Header files
├── src/ # Source code
├── test/ # Unit test source and headers
├── data/ # Datasets
├── README.md # Project documentation
└── ...
- GCC or Clang (C++11 or newer)
- Make
- Modular codebase: add new distance metrics, clustering methods, or algorithmic modules with minimal changes
- Well-documented interfaces and configuration files
- Designed for reproducibility, benchmarking, and future research or production deployment