cube-js · ovr · Aug 6, 2025 · Aug 6, 2025
@@ -0,0 +1,140 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Repository Overview
+
+CubeStore is the Rust-based distributed OLAP storage engine for Cube.js, designed to store and serve pre-aggregations at scale. It's part of the larger Cube.js monorepo and serves as the materialized cache store for rollup tables.
+
+## Architecture Overview
+
+### Core Components
+
+The codebase is organized as a Rust workspace with multiple crates:
+
+- **`cubestore`**: Main CubeStore implementation with distributed storage, query execution, and API interfaces
+- **`cubestore-sql-tests`**: SQL compatibility test suite and benchmarks
+- **`cubehll`**: HyperLogLog implementation for approximate distinct counting
+- **`cubedatasketches`**: DataSketches integration for advanced approximate algorithms
+- **`cubezetasketch`**: Theta Sketch implementation for set operations
+- **`cuberpc`**: RPC layer for distributed communication
+- **`cuberockstore`**: RocksDB wrapper and storage abstraction
+
+### Key Modules in `cubestore/src/`
+
+- **`metastore/`**: Metadata management, table schemas, partitioning, and distributed coordination
+- **`queryplanner/`**: Query planning, optimization, and physical execution planning using DataFusion
+- **`store/`**: Core storage layer with compaction and data management
+- **`cluster/`**: Distributed cluster management, worker pools, and inter-node communication
+- **`table/`**: Table data handling, Parquet integration, and data redistribution
+- **`cachestore/`**: Caching layer with eviction policies and queue management
+- **`sql/`**: SQL parsing and execution layer
+- **`streaming/`**: Kafka streaming support and traffic handling
+- **`remotefs/`**: Cloud storage integration (S3, GCS, MinIO)
+- **`config/`**: Dependency injection and configuration management
+
+## Development Commands
+
+### Building
+
+```bash
+# Build all crates in release mode
+cargo build --release
+
+# Build all crates in debug mode
+cargo build
+
+# Build specific crate
+cargo build -p cubestore
+
+# Check code without building
+cargo check
+```
+
+### Testing
+
+```bash
+# Run all tests
+cargo test
+
+# Run tests for specific crate
+cargo test -p cubestore
+cargo test -p cubestore-sql-tests
+
+# Run single test
+cargo test test_name
+
+# Run tests with output
+cargo test -- --nocapture
+
+# Run integration tests
+cargo test --test '*'
+
+# Run benchmarks
+cargo bench
+```
+
+### Development
+
+```bash
+# Format code
+cargo fmt
+
+# Check formatting
+cargo fmt -- --check
+
+# Run clippy lints
+cargo clippy
+
+# Run with debug logging
+RUST_LOG=debug cargo run
+
+# Run specific binary
+cargo run --bin cubestore
+
+# Watch for changes (requires cargo-watch)
+cargo watch -x check -x test
+```
+
+### JavaScript Wrapper Commands
+
+```bash
+# Build TypeScript wrapper
+npm run build
+
+# Run JavaScript tests
+npm test
+
+# Lint JavaScript code
+npm run lint
+
+# Fix linting issues
+npm run lint:fix
+```
+
+## Key Dependencies and Technologies
+
+- **DataFusion**: Apache Arrow-based query engine (using Cube's fork)
+- **Apache Arrow/Parquet**: Columnar data format and processing
+- **RocksDB**: Embedded key-value store for metadata
+- **Tokio**: Async runtime for concurrent operations
+- **sqlparser-rs**: SQL parsing (using Cube's fork)
+
+## Configuration via Dependency Injection
+
+The codebase uses a custom dependency injection system defined in `config/injection.rs`. Services are configured through the `Injector` and use `Arc<dyn ServiceTrait>` patterns for abstraction.
+
+## Testing Approach
+
+- Unit tests are colocated with source files using `#[cfg(test)]` modules
+- Integration tests are in `cubestore-sql-tests/tests/`
+- SQL compatibility tests use fixtures in `cubestore-sql-tests/src/tests.rs`
+- Benchmarks are in `benches/` directories
+
+## Important Notes
+
+- This is a Rust nightly project (see `rust-toolchain.toml`)
+- Uses custom forks of Arrow/DataFusion and sqlparser-rs for Cube-specific features
+- Distributed mode involves router and worker nodes communicating via RPC
+- Heavy use of async/await patterns with Tokio runtime
+- Parquet files are the primary storage format for data
@@ -74,7 +74,6 @@ tokio-stream = { version = "0.1.15", features=["io-util"] }
 scopeguard = "1.1.0"
 async-compression = { version = "0.3.7", features = ["gzip", "tokio"] }
 tempfile = "3.10.1"
-tarpc = { version = "0.24", features = ["tokio1"] }
 pin-project-lite = "0.2.4"
 paste = "1.0.4"
 memchr = "2"

@@ -43,35 +43,29 @@ impl InfoSchemaTableDef for ColumnsInfoSchemaTableDef {
     fn columns(&self) -> Vec<Box<dyn Fn(Arc<Vec<(Column, TablePath)>>) -> ArrayRef>> {
         vec![
             Box::new(|tables| {
-                Arc::new(StringArray::from(
+                Arc::new(StringArray::from_iter_values(
                     tables
                         .iter()
-                        .map(|(_, row)| row.schema.get_row().get_name().as_str())
-                        .collect::<Vec<_>>(),
+                        .map(|(_, row)| row.schema.get_row().get_name()),
                 ))
             }),
             Box::new(|tables| {
-                Arc::new(StringArray::from(
+                Arc::new(StringArray::from_iter_values(
                     tables
                         .iter()
-                        .map(|(_, row)| row.table.get_row().get_table_name().as_str())
-                        .collect::<Vec<_>>(),
+                        .map(|(_, row)| row.table.get_row().get_table_name()),
                 ))
             }),
             Box::new(|tables| {
-                Arc::new(StringArray::from(
-                    tables
-                        .iter()
-                        .map(|(column, _)| column.get_name().as_str())
-                        .collect::<Vec<_>>(),
+                Arc::new(StringArray::from_iter_values(
+                    tables.iter().map(|(column, _)| column.get_name()),
                 ))
             }),
             Box::new(|tables| {
-                Arc::new(StringArray::from(
+                Arc::new(StringArray::from_iter_values(
                     tables
                         .iter()
-                        .map(|(column, _)| column.get_column_type().to_string())
-                        .collect::<Vec<_>>(),
+                        .map(|(column, _)| column.get_column_type().to_string()),
                 ))
             }),
         ]

@@ -26,11 +26,8 @@ impl InfoSchemaTableDef for SchemataInfoSchemaTableDef {
 
     fn columns(&self) -> Vec<Box<dyn Fn(Arc<Vec<Self::T>>) -> ArrayRef>> {
         vec![Box::new(|tables| {
-            Arc::new(StringArray::from(
-                tables
-                    .iter()
-                    .map(|row| row.get_row().get_name().as_str())
-                    .collect::<Vec<_>>(),
+            Arc::new(StringArray::from_iter_values(
+                tables.iter().map(|row| row.get_row().get_name()),
             ))
         })]
     }

@@ -40,48 +40,38 @@ impl InfoSchemaTableDef for TablesInfoSchemaTableDef {
     fn columns(&self) -> Vec<Box<dyn Fn(Arc<Vec<Self::T>>) -> ArrayRef>> {
         vec![
             Box::new(|tables| {
-                Arc::new(StringArray::from(
-                    tables
-                        .iter()
-                        .map(|row| row.schema.get_row().get_name().as_str())
-                        .collect::<Vec<_>>(),
+                Arc::new(StringArray::from_iter_values(
+                    tables.iter().map(|row| row.schema.get_row().get_name()),
                 ))
             }),
             Box::new(|tables| {
-                Arc::new(StringArray::from(
+                Arc::new(StringArray::from_iter_values(
                     tables
                         .iter()
-                        .map(|row| row.table.get_row().get_table_name().as_str())
-                        .collect::<Vec<_>>(),
+                        .map(|row| row.table.get_row().get_table_name()),
                 ))
             }),
             Box::new(|tables| {
-                Arc::new(TimestampNanosecondArray::from(
-                    tables
-                        .iter()
-                        .map(|row| {
-                            row.table
-                                .get_row()
-                                .build_range_end()
-                                .as_ref()
-                                .map(|t| t.timestamp_nanos())
-                        })
-                        .collect::<Vec<_>>(),
-                ))
+                Arc::new(TimestampNanosecondArray::from_iter(tables.iter().map(
+                    |row| {
+                        row.table
+                            .get_row()
+                            .build_range_end()
+                            .as_ref()
+                            .map(|t| t.timestamp_nanos())
+                    },
+                )))
             }),
             Box::new(|tables| {
-                Arc::new(TimestampNanosecondArray::from(
-                    tables
-                        .iter()
-                        .map(|row| {
-                            row.table
-                                .get_row()
-                                .seal_at()
-                                .as_ref()
-                                .map(|t| t.timestamp_nanos())
-                        })
-                        .collect::<Vec<_>>(),
-                ))
+                Arc::new(TimestampNanosecondArray::from_iter(tables.iter().map(
+                    |row| {
+                        row.table
+                            .get_row()
+                            .seal_at()
+                            .as_ref()
+                            .map(|t| t.timestamp_nanos())
+                    },
+                )))
             }),
         ]
     }

@@ -47,17 +47,14 @@ impl InfoSchemaTableDef for SystemCacheTableDef {
                 ))
             }),
             Box::new(|items| {
-                Arc::new(TimestampNanosecondArray::from(
-                    items
-                        .iter()
-                        .map(|row| {
-                            row.get_row()
-                                .get_expire()
-                                .as_ref()
-                                .map(|t| t.timestamp_nanos())
-                        })
-                        .collect::<Vec<_>>(),
-                ))
+                Arc::new(TimestampNanosecondArray::from_iter(items.iter().map(
+                    |row| {
+                        row.get_row()
+                            .get_expire()
+                            .as_ref()
+                            .map(|t| t.timestamp_nanos())
+                    },
+                )))
             }),
             Box::new(|items| {
                 Arc::new(StringArray::from_iter(