OneDB is a high-performance, distributed similarity search framework built to handle large-scale multi-modal and multi-metric data. This system enables efficient multi-metric range and k-nearest neighbor (kNN) queries by leveraging flexible indexing strategies and automatic parameter tuning. Designed for use cases across healthcare, e-commerce, and other domains, OneDB supports a variety of data types and distance metrics.
- Multi-Metric Weight Learning: A lightweight learning model captures the importance of different modalities dynamically.
 - Dual-Layer Indexing: A two-level index structure ensures both global partitioning and optimized local search.
 - End-to-End Parameter Tuning: Deep reinforcement learning optimizes system performance automatically.
 
We recommend IntelliJ IDEA for development, with Spark and Hadoop setup following the official guide.
- Install Spark and Hadoop dependencies.
 - Configure Maven projects with:
- Profiles: 
hadoop-2.6,hive-provided,yarn 
 - Profiles: 
 - Reimport and generate sources using Maven Projects in IntelliJ.
 - Mark generated sources as appropriate:
target/scala-2.11/src_managed/main/compiled_avroforspark-streaming-flume-sink.src/gen/javaforspark-hive-thriftserver.
 
To perform similarity search using OneDB's SQL interface:
1. Multi-Metric Range Query:
SELECT * FROM my_table WHERE col IN ODBRANGE(query_point, weight_vector, radius);2. Multi-Metric kNN Query:
SELECT * FROM my_table WHERE col IN ODBKNN(query_point, weight_vector, k);For sample code, refer to:
examples/src/main/scala/org/apache/spark/examples/sql/onedb/ODBMultiSQLExample.scala
ODBMultiDataFrameExample for how to use DataFrame for multi-metric similarity search
- Core Logic:
/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/onedb
 - Catalyst Expressions:
/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/onedb
 - Examples:
/spark/examples/src/main/scala/org/apache/spark/examples/sql/onedb
 
| Dataset | Size | Metric | Link | 
|---|---|---|---|
| Air Quality | 1,150,000 | 13 | Link | 
| Food | 38,757 | 9 | Link | 
| Rental | 113,176 | 5 | Link | 
Datasets are stored in TXT format and can be accessed from HDFS or a local directory. Each row represents one data point with dimensions separated by spaces.
| Algorithm | Description | Year | 
|---|---|---|
| DIMS | Distributed similarity search | 2024 | 
| DESIRE | Multi-metric clustering-based indexing | 2022 | 
| Milvus | Vector-based similarity search | 2021 | 
- 
Build the Project:
mvn clean install
 - 
Launch OneDB with Spark:
spark-submit --class org.apache.spark.examples.sql.onedb.ODBMultiDataFrameExample --master yarn
 - 
Test SQL Queries: Use Spark SQL CLI or connect to a running Spark instance.
 
This project follows the ACM Copyright policy for SIGMOD Conference works. For personal or classroom use, copying is permitted without fee. Redistribution for commercial purposes requires specific permission from ACM.