High-Dimensional Vector & Curve Search & Clustering Suite

Overview
Motivation, Importance & Scope
Theoretical Background
Key Features
Algorithms & Methodology
Input & Output Specifications
Build & Execution
Configuration & Parameters
Results & Evaluation
Code Quality & Best Practices
Directory Structure
Installation & Dependencies
Extensibility & Maintenance
Authors & License

Overview

This repository provides a robust, extensible software suite for efficient nearest neighbor search and clustering in high-dimensional vectors and polygonal curves. The project leverages advanced algorithmic techniques (LSH, Hypercube, Fréchet) and is implemented in C/C++ for performance and scalability. It is designed for both academic research and industrial applications where large-scale, high-dimensional data analysis is required.

Motivation, Importance & Scope

High-dimensional vectors and polygonal curves are central to a wide range of modern applications:

Finance: Portfolio modeling, risk analysis, and market pattern recognition
Bioinformatics: Protein structure comparison, gene expression analysis
Geospatial Analytics: Trajectory clustering, route optimization, and movement pattern analysis
Computer Vision & Graphics: Shape matching, object recognition, and gesture analysis

The analysis of such data is challenging due to the "curse of dimensionality" and the complexity of geometric similarity. Standard search and clustering algorithms become computationally prohibitive as data size and dimensionality increase, and they often fail to capture the true structure of geometric data such as curves.

This project directly addresses these challenges by providing:

Scalable, high-performance algorithms for approximate nearest neighbor search and clustering in high-dimensional vector and curve spaces
Advanced similarity measures (e.g., Fréchet distance) that accurately reflect geometric relationships
A modular, extensible C/C++ platform suitable for both research and industrial deployment, enabling rapid experimentation and robust integration

By enabling efficient and accurate analysis of vectors and curves, this suite empowers data-driven decision-making and innovation in fields where geometric and high-dimensional data are critical.

Applications & Validation

The algorithms and tools implemented here were used in conjunction with the project kondim23/time-series-dl-analysis to validate the effectiveness of time series compression techniques. By applying nearest neighbor search and clustering to both original and compressed representations, experiments demonstrated that the compressed time series preserved the essential structure of the data, yielding matching results in search and clustering tasks. This confirms the practical utility of these algorithms for evaluating and benchmarking time series compression methods.

Theoretical Background

Locality Sensitive Hashing (LSH)

Locality Sensitive Hashing is a technique for efficient approximate nearest neighbor search in high-dimensional spaces. It uses hash functions designed so that similar items are more likely to be mapped to the same hash bucket. This enables sublinear query times for similarity search, especially with metrics like L2 (Euclidean) and Fréchet distance for curves.

Hypercube Method

The Hypercube algorithm projects high-dimensional data into the vertices of a binary hypercube using random projections. Search is performed by probing the cube's vertices in a controlled manner, balancing accuracy and efficiency. This method is particularly effective for very high-dimensional vector spaces.

k-means++ Clustering

k-means++ is an improved initialization method for the k-means clustering algorithm. It selects initial centroids to maximize their spread, leading to better convergence and clustering quality. The assignment step can be performed exactly (Lloyd's) or approximately (using LSH or Hypercube), and the update step computes new centroids as mean vectors or curves.

Fréchet Distance

Fréchet distance is a measure of similarity between curves, capturing both the location and ordering of points. The discrete version considers only the sequence of points, while the continuous version allows for more flexible alignment. It is particularly suited for geometric and trajectory analysis.

Key Features

Efficient Search: LSH and Hypercube for vectors; LSH for Discrete/Continuous Fréchet on curves
Advanced Clustering: k-means++ initialization, Lloyd's, LSH, and Hypercube-based assignment; mean vector/curve updates
Strict Input/Output Compliance: Adheres to well-defined formats for reproducibility and integration
Command-Line Tools: Rich configuration and automation support
Code Quality: Modular, maintainable, and thoroughly documented

Algorithms & Methodology

Locality Sensitive Hashing (LSH): Accelerates approximate nearest neighbor search in high-dimensional spaces using similarity-preserving hash functions. Supports both L2 (vector) and Fréchet (curve) metrics.
Hypercube Search: Projects data into binary space for rapid search using random projections and controlled probing.
k-means++ Clustering: Enhanced initialization for k-means, with assignment via Lloyd's, LSH, or Hypercube, and update via mean vector or mean curve.
Fréchet Distance: Measures similarity between curves in both discrete and continuous forms, enabling advanced geometric analysis.

Supported Search & Clustering Techniques

Search:
- LSH (L2) and Hypercube for vectors
- LSH (Discrete/Continuous Fréchet) for curves
Clustering:
- Initialization: k-means++
- Assignment: Lloyd's (Classic), LSH (Vector/Fréchet), Hypercube
- Update: Mean as vector or curve (with assignment compatibility)

Input & Output Specifications

Input Format

Tab-separated, one vector or curve per line:

item_id1 X11 X12 ... X1d
item_id2 X21 X22 ... X2d
...
item_idN XN1 XN2 ... XNd

Output Format

Nearest Neighbor Search

Query: <item_id>
Algorithm: LSH_Vector | Hypercube | LSH_Frechet_Continuous | LSH_Frechet_Discrete
Approximate Nearest Neighbor: <item_id>
True Nearest Neighbor: <item_id>
distanceApproximate: <double>
distanceTrue: <double>
tApproximateAverage: <double>
tTrueAverage: <double>
MAF: <double>

Clustering

Algorithm: <Assignment Method>
CLUSTER-1 {size: <int>, centroid: [coordinates or points]}
...
CLUSTER-k {size: <int>, centroid: [coordinates or points]}
clustering_time: <double>
Silhouette: [s1, ..., sk, stotal]   # if -silhouette used
CLUSTER-1 {centroid, item_idA, ...} # if -complete used

Build & Execution

Compile with make (produces executables in bin/):
```
make
```

Run according to assignment specifications:

./bin/search -i <input file> -q <query file> -k <int> -L <int> -M <int> -probes <int> -o <output file> -algorithm <LSH | Hypercube | Frechet> -metric <discrete | continuous> -delta <double>
./bin/cluster -i <input file> -c <config file> -o <output file> -update <Frechet | Vector> -assignment <Classic | LSH | Hypercube | LSH_Frechet> -complete -silhouette
./bin/test

Note: Algorithms are randomized; multiple runs may be needed for expected results.

Configuration & Parameters

Clustering Configuration File (cluster.conf)

number_of_clusters: <int>
number_of_vector_hash_tables: <int>        # Default: 3
number_of_vector_hash_functions: <int>     # Default: 4
max_number_M_hypercube: <int>              # Default: 10
number_of_hypercube_dimensions: <int>      # Default: 3
number_of_probes: <int>                    # Default: 2

Results & Evaluation

Clustering: Silhouette score, cluster sizes, centroids, and assignment details
Search: Approximation ratios, timing, and neighbor lists
Reproducibility: Strict adherence to input/output formats for reliable benchmarking

Directory Structure

project_1_NN_Clustering/
├── bin/        # Compiled executables (search, cluster, test)
├── obj/        # Compiled object files
├── include/    # Header files
├── src/        # Source code
├── test/       # Unit test source and headers
├── data/       # Datasets
├── README.md   # Project documentation
└── ...

Installation & Dependencies

C/C++

GCC or Clang (C++11 or newer)
Make

Extensibility & Maintenance

Modular codebase: add new distance metrics, clustering methods, or algorithmic modules with minimal changes
Well-documented interfaces and configuration files
Designed for reproducibility, benchmarking, and future research or production deployment

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
include		include
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

High-Dimensional Vector & Curve Search & Clustering Suite

Table of Contents

Overview

Motivation, Importance & Scope

Applications & Validation

Theoretical Background

Locality Sensitive Hashing (LSH)

Hypercube Method

k-means++ Clustering

Fréchet Distance

Key Features

Algorithms & Methodology

Supported Search & Clustering Techniques

Input & Output Specifications

Input Format

Output Format

Nearest Neighbor Search

Clustering

Build & Execution

Configuration & Parameters

Clustering Configuration File (cluster.conf)

Results & Evaluation

Directory Structure

Installation & Dependencies

C/C++

Extensibility & Maintenance

About

Uh oh!

Languages

License

kondim23/highdim-curve-search-clustering

Folders and files

Latest commit

History

Repository files navigation

High-Dimensional Vector & Curve Search & Clustering Suite

Table of Contents

Overview

Motivation, Importance & Scope

Applications & Validation

Theoretical Background

Locality Sensitive Hashing (LSH)

Hypercube Method

k-means++ Clustering

Fréchet Distance

Key Features

Algorithms & Methodology

Supported Search & Clustering Techniques

Input & Output Specifications

Input Format

Output Format

Nearest Neighbor Search

Clustering

Build & Execution

Configuration & Parameters

Clustering Configuration File (cluster.conf)

Results & Evaluation

Directory Structure

Installation & Dependencies

C/C++

Extensibility & Maintenance

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages