Skip to content

Add composite keys for secondary indexes (refs #180)#254

Open
shrayarora8 wants to merge 2 commits into
apache:masterfrom
shrayarora8:feature/composite-keys-storage
Open

Add composite keys for secondary indexes (refs #180)#254
shrayarora8 wants to merge 2 commits into
apache:masterfrom
shrayarora8:feature/composite-keys-storage

Conversation

@shrayarora8
Copy link
Copy Markdown

@shrayarora8 shrayarora8 commented May 12, 2026

What this PR does

This is the first part of work on the composite keys feature from issue #180. It adds the storage layer primitives that make secondary attribute lookups O(log N + M) instead of O(N).

I split the work into 3 phases so each PR stays small and reviewable:

  • Phase 1 (this PR): codec for encoding/decoding composite keys. Four new storage methods + filter for user-facing scans + tests + benchmark
  • Phase 2 (future PR): proto + KVExecutor wiring
  • Phase 3 (future PR): client SDK + end-to-end benchmark

Encoding

A composite key ("ck") is a single string with this layout:

"ck"  \0  index_name  \0  attribute_1  \0  [attribute_2 \0] ...  \0  primary_key
  • The "ck" namespace plus null delimiter (\0) (3 bytes total) keeps these entries from colliding with regular user keys. Even a user key like "ck_balance" is safe, it starts with "ck" but not with "ck\0".
  • \0 is used as the field separator because it's illegal in user supplied attribute strings, so it can never appear inside a field. The encoder rejects inputs that contain it.
  • The value stored alongside the key is empty (""). The key itself encodes everything we need.

Why is the value empty?

The composite key is an index entry, not user data. The user's actual data is stored separately under their primary key (e.g. via SetValueWithVersion("user:1", <data>, 0)). The composite key's job is just to record the existence of an (attribute → primary_key) association so that later, when someone asks "give me everyone whose city is Davis", we can find all the matching primary keys without scanning everything. The onus is on the client to create the composite key on the secondary attribute (s).
The workflow looks like this:

  1. App stores user data: SetValueWithVersion("user:1", <data>, 0)
  2. App creates an index marker: CreateCompositeKey(EncodeCompositeKey("byCity", {"Davis"}, "user:1"))
  3. Later, app queries by city: GetByCompositeKeyPrefix(EncodeCompositeKeyPrefix("byCity", {"Davis"}))
    • that returns all composite keys with prefix "ck\0byCity\0Davis\0"
    • app decodes each one to pull out the primary key
    • app calls GetValue(primary_key) for the data
      The composite key string already encodes everything we need (index name, attribute values, primary key). Storing anything in the value would just be duplication. LevelDB sorts by key, not value . Keeping it empty also keeps the index cheap on disk.

For a prefix scan, the codec builds the same string up to (and including) the trailing \0 after the last attribute. That makes it a strict byte prefix of every key we want to match.

Architecture / flow

Composite keys live in the storage layer only for this PR.

Write path:

application
   ↓ CreateCompositeKey(encoded_string)
Storage interface
   ↓
ResLevelDB::CreateCompositeKey  →  leveldb::DB::Put(key, "")
   or
MemoryDB::CreateCompositeKey    →  ck_map_[key] = ""

Read path:

application
   ↓ GetByCompositeKeyPrefix(prefix)
Storage interface
   ↓
ResLevelDB  →  iter.Seek(prefix), walk while memcmp matches  →  O(log N + M)
MemoryDB    →  ck_map_.lower_bound(prefix), walk while matches  →  O(log N + M)

Both backends use sorted-order data structures, so Seek / lower_bound jumps to the first candidate key, and the iteration stops at the first non-match.

UpdateCompositeKey is implemented as an atomic delete + insert. On LevelDB this uses WriteBatch so we never end up in a state where the old key is gone but the new one didn't make it in.

Filter for user-facing scans

GetAllItems() and GetKeyRange() had to be tweaked so they don't return composite-key markers to applications that just want their own data. The filter is:

if (key starts with "ck\0") continue;

I tested explicitly that:

  • composite keys are hidden from GetAllItems
  • a user key called "ck_balance" is NOT hidden (the filter checks "ck\0", not "ck")
  • composite keys are still reachable via GetByCompositeKeyPrefix — they're hidden, not deleted

For MemoryDB, no filter is needed: composite keys live in a separate std::map<std::string, std::string> ck_map_, so they can't leak into GetAllItems by construction.

Tests

composite_key_codec_test.cpp (5 tests)

  • RoundTrip — Encode ("byOwner", ["alice", "active"], "user:1"), then decode the result and check I get back the same index name, the attributes in the same order, and the same primary key. Basic correctness check.
  • RejectDelimInInput — If any input field contains \0, encoding must refuse and return "". Otherwise you'd produce a key that can't be decoded. Side note: C-string literals like "in\0dex" truncate at the null byte and never actually trigger this path, so the test builds the strings explicitly with std::string("in") + kCompositeKeyDelim + "dex".
  • PrefixIsStrictBytePrefix — For any (idx, attrs), EncodeCompositeKeyPrefix(idx, attrs) must be a strict byte prefix of EncodeCompositeKey(idx, attrs, any_pk). This is the property that makes the LevelDB Seek + walk pattern correct — if the prefix wasn't an exact byte prefix, Seek could land in the wrong spot.
  • DecodeMalformed — Garbage inputs (missing namespace, only one field after the namespace) return false instead of crashing or returning wrong data.
  • EmptyAttributes — Encoding/decoding still works with zero attributes (just index_name + primary_key). Unusual case but allowed by the format, so it has to work.

kv_storage_test.cpp (6 new parametrized tests)

Each runs against all three backends — MemoryDB, ResLevelDB, ResLevelDB-with-block-cache — so 18 cases total.

  • CreateAndRetrieveCompositeKey — Insert three Davis users, prefix-scan, verify all three come back.
  • PrefixScanOrdering — Insert keys out of order (SF, Davis-1, Davis-2, NYC), then do a Davis prefix scan. The result must be [Davis-1, Davis-2] in that exact order. This matters for BFT determinism as every honest replica running the same scan against the same state has to produce the same byte-for-byte output, otherwise replicas would diverge during consensus.
  • DeleteRemovesEntry — Create, delete, prefix-scan returns 0 results. Confirms DeleteCompositeKey actually removes the marker.
  • UpdateIsAtomic — Move user:1 from Davis to SF via UpdateCompositeKey. After the call, Davis prefix has 0 entries and SF prefix has 1 entry. On LevelDB this is enforced with WriteBatch.
  • EmptyPrefixScanReturnsNothing — Insert Davis keys, scan for NYC prefix, get 0 results. Confirms no false positives — the Seek + walk pattern doesn't accidentally pick up keys from other prefixes.
  • GetAllItemsExcludesCompositeKeys — The filter test. Setup: insert two regular user keys (user:1, user:2), one user key that starts with "ck" but is real data (ck_balance — important edge case), and two composite key markers. Then GetAllItems() must return exactly the 3 real keys (including ck_balance) and exclude the markers. As a sanity check, I also verify the markers are still reachable via GetByCompositeKeyPrefix they're hidden from the user-facing API, not deleted.
    All 48 tests in kv_storage_test pass (30 + 18 cases).

Benchmark

benchmark/storage/composite_key_benchmark.cpp measures the two paths head-to-head. Setup for each (N, selectivity) cell:

  1. Spin up a fresh LevelDB instance in /tmp
  2. Insert N user records with primary keys user:0, user:1, ..., user:N-1. The first selectivity * N records get value "Davis"; the rest get "Other".
  3. For every record, also create a composite-key index entry under byCity. So LevelDB ends up with roughly 2N keys.
    The two paths being measured:
  • OLD: GetAllItems() returns every user record. filter value == "Davis" in C++ code. O(N), has to touch every record.
  • NEW: GetByCompositeKeyPrefix("byCity\0Davis\0") returns just the matching index entries. O(log N + M), Seek lands at the prefix in O(log N), then walks M matches.
    Sample run on Apple M2:
    | Records | Selectivity | OLD (ms) | NEW (ms) | Speedup |
    |---|---|---|---|---|
    | 1,000 | 1% | 0.39 | 0.00 | 151.9× |
    | 1,000 | 10% | 0.37 | 0.01 | 32.0× |
    | 10,000 | 1% | 4.44 | 0.02 | 225.3× |
    | 10,000 | 10% | 4.43 | 0.13 | 34.9× |
    | 100,000 | 1% | 52.60 | 0.15 | 340.8× |
    | 100,000 | 10% | 52.20 | 1.37 | 38.2× |
    Run it with:
bazel run //benchmark/storage:composite_key_benchmark --copt=-Wno-implicit-function-declaration

Why benchmark at the storage layer

The composite-keys feature is a storage-layer optimization. Consensus, networking, and the executor are unchanged. Consensus adds a constant per request overhead, so an end to end benchmark would be measuring consensus variance more than the storage gains.

End-to-end measurement makes more sense once Phase 3 (executor) and Phase 4 (client) are in place. That benchmark will live in the Phase 4 PR.

Files touched

New files:

  • chain/storage/composite_key_codec.h
  • chain/storage/composite_key_codec.cpp
  • chain/storage/composite_key_codec_test.cpp
  • benchmark/storage/BUILD
  • benchmark/storage/composite_key_benchmark.cpp

Modified files (all in chain/storage/):

  • storage.h — 4 pure virtual methods added to the interface
  • leveldb.h — 4 method declarations
  • leveldb.cpp — 4 method implementations + filter in GetAllItems/GetKeyRange
  • memory_db.h — 4 method declarations + ck_map_ member
  • memory_db.cpp — 4 method implementations
  • kv_storage_test.cpp — 6 new parametrized tests
  • BUILD — wire up the codec library + add deps to existing rules

Every change is contained within chain/storage/ and the new benchmark/storage/ directory. Consensus, networking, the executor, the client SDK, and protobuf definitions are all untouched.

Future work

  • Phase 3: add a KVRequest op for composite keys and route it through KVExecutor
  • Phase 4 — client SDK + end-to-end benchmark. Once the executor knows about composite keys, the next step is letting actual applications use them. The client SDK would expose something like client.CreateCompositeKey(index_name, attributes, primary_key) and client.LookupByPrefix(index_name, attribute_prefix). After that's done, we can run a real benchmark over a running ResilientDB cluster and measure end-to-end speedup with real world overhead baked.

Refs #180

)

Implements Phase 1 + 2 of apache#180: codec, four storage methods on both
backends, scan filter, unit tests, and a microbenchmark. Executor and
client wiring follow in separate PRs.
@shrayarora8 shrayarora8 force-pushed the feature/composite-keys-storage branch from b1cfa50 to c8a8315 Compare May 12, 2026 02:10
@cjcchen
Copy link
Copy Markdown
Contributor

cjcchen commented May 25, 2026

Please address the action issues above: build failed/ UT failed.

@Bismanpal-Singh
Copy link
Copy Markdown
Contributor

@shrayarora8 Please address the build issues before we merge in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants