Add composite keys for secondary indexes (refs #180)#254
Open
shrayarora8 wants to merge 2 commits into
Open
Conversation
) Implements Phase 1 + 2 of apache#180: codec, four storage methods on both backends, scan filter, unit tests, and a microbenchmark. Executor and client wiring follow in separate PRs.
b1cfa50 to
c8a8315
Compare
Contributor
|
Please address the action issues above: build failed/ UT failed. |
Contributor
|
@shrayarora8 Please address the build issues before we merge in. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
This is the first part of work on the composite keys feature from issue #180. It adds the storage layer primitives that make secondary attribute lookups O(log N + M) instead of O(N).
I split the work into 3 phases so each PR stays small and reviewable:
KVExecutorwiringEncoding
A composite key ("ck") is a single string with this layout:
"ck"namespace plus null delimiter (\0) (3 bytes total) keeps these entries from colliding with regular user keys. Even a user key like"ck_balance"is safe, it starts with"ck"but not with"ck\0".\0is used as the field separator because it's illegal in user supplied attribute strings, so it can never appear inside a field. The encoder rejects inputs that contain it.""). The key itself encodes everything we need.Why is the value empty?
The composite key is an index entry, not user data. The user's actual data is stored separately under their primary key (e.g. via
SetValueWithVersion("user:1", <data>, 0)). The composite key's job is just to record the existence of an(attribute → primary_key)association so that later, when someone asks "give me everyone whose city is Davis", we can find all the matching primary keys without scanning everything. The onus is on the client to create the composite key on the secondary attribute (s).The workflow looks like this:
SetValueWithVersion("user:1", <data>, 0)CreateCompositeKey(EncodeCompositeKey("byCity", {"Davis"}, "user:1"))GetByCompositeKeyPrefix(EncodeCompositeKeyPrefix("byCity", {"Davis"}))"ck\0byCity\0Davis\0"GetValue(primary_key)for the dataThe composite key string already encodes everything we need (index name, attribute values, primary key). Storing anything in the value would just be duplication. LevelDB sorts by key, not value . Keeping it empty also keeps the index cheap on disk.
For a prefix scan, the codec builds the same string up to (and including) the trailing
\0after the last attribute. That makes it a strict byte prefix of every key we want to match.Architecture / flow
Composite keys live in the storage layer only for this PR.
Write path:
Read path:
Both backends use sorted-order data structures, so
Seek/lower_boundjumps to the first candidate key, and the iteration stops at the first non-match.UpdateCompositeKeyis implemented as an atomic delete + insert. On LevelDB this usesWriteBatchso we never end up in a state where the old key is gone but the new one didn't make it in.Filter for user-facing scans
GetAllItems()andGetKeyRange()had to be tweaked so they don't return composite-key markers to applications that just want their own data. The filter is:I tested explicitly that:
GetAllItems"ck_balance"is NOT hidden (the filter checks"ck\0", not"ck")GetByCompositeKeyPrefix— they're hidden, not deletedFor
MemoryDB, no filter is needed: composite keys live in a separatestd::map<std::string, std::string> ck_map_, so they can't leak intoGetAllItemsby construction.Tests
composite_key_codec_test.cpp(5 tests)RoundTrip— Encode("byOwner", ["alice", "active"], "user:1"), then decode the result and check I get back the same index name, the attributes in the same order, and the same primary key. Basic correctness check.RejectDelimInInput— If any input field contains\0, encoding must refuse and return"". Otherwise you'd produce a key that can't be decoded. Side note: C-string literals like"in\0dex"truncate at the null byte and never actually trigger this path, so the test builds the strings explicitly withstd::string("in") + kCompositeKeyDelim + "dex".PrefixIsStrictBytePrefix— For any(idx, attrs),EncodeCompositeKeyPrefix(idx, attrs)must be a strict byte prefix ofEncodeCompositeKey(idx, attrs, any_pk). This is the property that makes the LevelDBSeek+ walk pattern correct — if the prefix wasn't an exact byte prefix,Seekcould land in the wrong spot.DecodeMalformed— Garbage inputs (missing namespace, only one field after the namespace) returnfalseinstead of crashing or returning wrong data.EmptyAttributes— Encoding/decoding still works with zero attributes (justindex_name+primary_key). Unusual case but allowed by the format, so it has to work.kv_storage_test.cpp(6 new parametrized tests)Each runs against all three backends — MemoryDB, ResLevelDB, ResLevelDB-with-block-cache — so 18 cases total.
CreateAndRetrieveCompositeKey— Insert three Davis users, prefix-scan, verify all three come back.PrefixScanOrdering— Insert keys out of order (SF, Davis-1, Davis-2, NYC), then do a Davis prefix scan. The result must be[Davis-1, Davis-2]in that exact order. This matters for BFT determinism as every honest replica running the same scan against the same state has to produce the same byte-for-byte output, otherwise replicas would diverge during consensus.DeleteRemovesEntry— Create, delete, prefix-scan returns 0 results. ConfirmsDeleteCompositeKeyactually removes the marker.UpdateIsAtomic— Move user:1 from Davis to SF viaUpdateCompositeKey. After the call, Davis prefix has 0 entries and SF prefix has 1 entry. On LevelDB this is enforced withWriteBatch.EmptyPrefixScanReturnsNothing— Insert Davis keys, scan for NYC prefix, get 0 results. Confirms no false positives — theSeek+ walk pattern doesn't accidentally pick up keys from other prefixes.GetAllItemsExcludesCompositeKeys— The filter test. Setup: insert two regular user keys (user:1,user:2), one user key that starts with"ck"but is real data (ck_balance— important edge case), and two composite key markers. ThenGetAllItems()must return exactly the 3 real keys (includingck_balance) and exclude the markers. As a sanity check, I also verify the markers are still reachable viaGetByCompositeKeyPrefixthey're hidden from the user-facing API, not deleted.All 48 tests in
kv_storage_testpass (30 + 18 cases).Benchmark
benchmark/storage/composite_key_benchmark.cppmeasures the two paths head-to-head. Setup for each(N, selectivity)cell:/tmpuser:0,user:1, ...,user:N-1. The firstselectivity * Nrecords get value"Davis"; the rest get"Other".byCity. So LevelDB ends up with roughly 2N keys.The two paths being measured:
GetAllItems()returns every user record. filtervalue == "Davis"in C++ code. O(N), has to touch every record.GetByCompositeKeyPrefix("byCity\0Davis\0")returns just the matching index entries. O(log N + M),Seeklands at the prefix in O(log N), then walks M matches.Sample run on Apple M2:
| Records | Selectivity | OLD (ms) | NEW (ms) | Speedup |
|---|---|---|---|---|
| 1,000 | 1% | 0.39 | 0.00 | 151.9× |
| 1,000 | 10% | 0.37 | 0.01 | 32.0× |
| 10,000 | 1% | 4.44 | 0.02 | 225.3× |
| 10,000 | 10% | 4.43 | 0.13 | 34.9× |
| 100,000 | 1% | 52.60 | 0.15 | 340.8× |
| 100,000 | 10% | 52.20 | 1.37 | 38.2× |
Run it with:
Why benchmark at the storage layer
The composite-keys feature is a storage-layer optimization. Consensus, networking, and the executor are unchanged. Consensus adds a constant per request overhead, so an end to end benchmark would be measuring consensus variance more than the storage gains.
End-to-end measurement makes more sense once Phase 3 (executor) and Phase 4 (client) are in place. That benchmark will live in the Phase 4 PR.
Files touched
New files:
chain/storage/composite_key_codec.hchain/storage/composite_key_codec.cppchain/storage/composite_key_codec_test.cppbenchmark/storage/BUILDbenchmark/storage/composite_key_benchmark.cppModified files (all in
chain/storage/):storage.h— 4 pure virtual methods added to the interfaceleveldb.h— 4 method declarationsleveldb.cpp— 4 method implementations + filter inGetAllItems/GetKeyRangememory_db.h— 4 method declarations +ck_map_membermemory_db.cpp— 4 method implementationskv_storage_test.cpp— 6 new parametrized testsBUILD— wire up the codec library + add deps to existing rulesEvery change is contained within
chain/storage/and the newbenchmark/storage/directory. Consensus, networking, the executor, the client SDK, and protobuf definitions are all untouched.Future work
KVRequestop for composite keys and route it throughKVExecutorclient.CreateCompositeKey(index_name, attributes, primary_key)andclient.LookupByPrefix(index_name, attribute_prefix). After that's done, we can run a real benchmark over a running ResilientDB cluster and measure end-to-end speedup with real world overhead baked.Refs #180