OneDB: A Distributed Multi-Metric Data Similarity Search System

Overview

OneDB is a high-performance, distributed similarity search framework built to handle large-scale multi-modal and multi-metric data. This system enables efficient multi-metric range and k-nearest neighbor (kNN) queries by leveraging flexible indexing strategies and automatic parameter tuning. Designed for use cases across healthcare, e-commerce, and other domains, OneDB supports a variety of data types and distance metrics.

Key Features:

Multi-Metric Weight Learning: A lightweight learning model captures the importance of different modalities dynamically.
Dual-Layer Indexing: A two-level index structure ensures both global partitioning and optimized local search.
End-to-End Parameter Tuning: Deep reinforcement learning optimizes system performance automatically.

Development Environment

We recommend IntelliJ IDEA for development, with Spark and Hadoop setup following the official guide.

Steps:

Install Spark and Hadoop dependencies.
Configure Maven projects with:
- Profiles: hadoop-2.6, hive-provided, yarn
Reimport and generate sources using Maven Projects in IntelliJ.
Mark generated sources as appropriate:
- target/scala-2.11/src_managed/main/compiled_avro for spark-streaming-flume-sink.
- src/gen/java for spark-hive-thriftserver.

Example Usage

To perform similarity search using OneDB's SQL interface:

1. Multi-Metric Range Query:

SELECT * FROM my_table WHERE col IN ODBRANGE(query_point, weight_vector, radius);

2. Multi-Metric kNN Query:

SELECT * FROM my_table WHERE col IN ODBKNN(query_point, weight_vector, k);

For sample code, refer to: examples/src/main/scala/org/apache/spark/examples/sql/onedb/ODBMultiSQLExample.scala

ODBMultiDataFrameExample for how to use DataFrame for multi-metric similarity search

Source Code Structure

Core Logic:
- /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/onedb
Catalyst Expressions:
- /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/onedb
Examples:
- /spark/examples/src/main/scala/org/apache/spark/examples/sql/onedb

Datasets

Dataset	Size	Metric	Link
Air Quality	1,150,000	13	Link
Food	38,757	9	Link
Rental	113,176	5	Link

Datasets are stored in TXT format and can be accessed from HDFS or a local directory. Each row represents one data point with dimensions separated by spaces.

Compared Algorithms

Algorithm	Description	Year
DIMS	Distributed similarity search	2024
DESIRE	Multi-metric clustering-based indexing	2022
Milvus	Vector-based similarity search	2021

How to Run

Build the Project:
```
mvn clean install
```

Launch OneDB with Spark:

spark-submit --class org.apache.spark.examples.sql.onedb.ODBMultiDataFrameExample --master yarn

Test SQL Queries: Use Spark SQL CLI or connect to a running Spark instance.

License

This project follows the ACM Copyright policy for SIGMOD Conference works. For personal or classroom use, copying is permitted without fee. Redistribution for commercial purposes requires specific permission from ACM.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
OneDBTuning		OneDBTuning
spark		spark
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OneDB: A Distributed Multi-Metric Data Similarity Search System

Overview

Key Features:

Development Environment

Steps:

Example Usage

Source Code Structure

Datasets

Compared Algorithms

How to Run

License

About

Uh oh!

Releases

Packages

Languages

License

ZJU-DAILY/OneDB

Folders and files

Latest commit

History

Repository files navigation

OneDB: A Distributed Multi-Metric Data Similarity Search System

Overview

Key Features:

Development Environment

Steps:

Example Usage

Source Code Structure

Datasets

Compared Algorithms

How to Run

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages