- 
                Notifications
    
You must be signed in to change notification settings  - Fork 313
 
[Bug] OverflowFile::checkpoint() corrupts PrimaryKeyIndexStorageInfo when no data has been written #6045
Description
Description
When creating a HNSW vector index without inserting any data, the database checkpoint completes successfully but corrupts the metadata. Reopening the database fails with an assertion error in hash_index.cpp:487.
Environment
- Kuzu Version: v0.11.1
 - Platform: All platforms (macOS, iOS, Linux, etc.)
 - Language Bindings: C++ (core), Swift (wrapper)
 
Minimal Reproduction
#include "main/kuzu.h"
using namespace kuzu::main;
int main() {
    // 1. Create database and table with vector column
    auto db = std::make_unique<Database>("test.db");
    auto conn = std::make_unique<Connection>(db.get());
    
    conn->query("CREATE NODE TABLE Item(id STRING PRIMARY KEY, embedding FLOAT[3])");
    
    // 2. Create vector index (no data inserted)
    conn->query("CALL CREATE_VECTOR_INDEX('Item', 'item_idx', 'embedding', metric := 'l2')");
    
    // 3. Close database
    conn.reset();
    db.reset();
    
    // 4. Reopen database
    auto db2 = std::make_unique<Database>("test.db");  // ❌ ASSERTION FAILURE
    
    return 0;
}Expected: Database reopens successfully
Actual: Assertion failure at hash_index.cpp:487:
KU_ASSERT(hashIndexStorageInfo.overflowHeaderPage == INVALID_PAGE_IDX);
Root Cause
The bug is in src/storage/overflow_file.cpp:236 in OverflowFile::checkpoint():
void OverflowFile::checkpoint(PageAllocator& pageAllocator) {
    KU_ASSERT(fileHandle);
    if (headerPageIdx == INVALID_PAGE_IDX) {
        // ❌ BUG: Allocates page even when no data has been written
        this->headerPageIdx = getNewPageIdx(&pageAllocator);
        headerChanged = true;
    }
    ...
}What happens:
- VectorIndex creation creates a 
PrimaryKeyIndex(for STRING primary keyid) PrimaryKeyIndexcreates anOverflowFile(for strings >12 bytes) withheaderPageIdx = INVALID_PAGE_IDX- During checkpoint, 
OverflowFile::checkpoint()unconditionally allocates a page (e.g., page 1) even though no data has been written - This sets 
PrimaryKeyIndexStorageInfo.overflowHeaderPage = 1(should beINVALID_PAGE_IDX) - The corrupted metadata is serialized to disk
 - On database reopen, 
PrimaryKeyIndexconstructor hits the assertion:if (hashIndexStorageInfo.firstHeaderPage == INVALID_PAGE_IDX) { // firstHeaderPage = INVALID, but overflowHeaderPage = 1 ❌ KU_ASSERT(hashIndexStorageInfo.overflowHeaderPage == INVALID_PAGE_IDX); }
 
Proposed Fix
Modify OverflowFile::checkpoint() to skip checkpoint when no data has been written, following the same pattern as NodeTable, RelTable, and other components:
void OverflowFile::checkpoint(PageAllocator& pageAllocator) {
    KU_ASSERT(fileHandle);
    // Skip checkpoint if no data has been written
    // headerChanged is set to true only when actual string data (>12 bytes) is written
    // via OverflowFileHandle::setStringOverflow()
    if (!headerChanged) {
        return;
    }
    if (headerPageIdx == INVALID_PAGE_IDX) {
        // Reserve a page for the header (only when data has actually been written)
        this->headerPageIdx = getNewPageIdx(&pageAllocator);
    }
    // ... rest of the function
}Why this fix is correct:
headerChangedis set totrueonly inOverflowFileHandle::setStringOverflow()when writing strings >12 bytes- If 
headerChanged == false, thenpageWriteCacheis guaranteed to be empty (no data to flush) - This follows the same design pattern as other checkpoint methods:
NodeTable::checkpoint():if (!hasChanges) return;RelTable::checkpoint():if (!hasChanges) return;
 
Impact
Affected scenarios:
- Creating a VectorIndex without inserting data
 - Any table with STRING primary key and no long strings (≤12 bytes)
 - Empty databases with indexes
 
Workaround (before fix):
Insert at least one record after creating VectorIndex and execute manual CHECKPOINT:
conn->query("CREATE (i:Item {id: 'test', embedding: [1.0, 2.0, 3.0]})");
conn->query("CHECKPOINT");This triggers hasStorageChanges=1, causing a second checkpoint that re-serializes metadata with correct values.
Additional Context
Design Pattern Violation:
All other components in Kuzu follow the pattern "skip checkpoint when no changes":
NodeTable::checkpoint()checkshasChangesRelTable::checkpoint()checkshasChangesOverflowFile::checkpoint()was the only exception that didn't checkheaderChanged
Benefits of the fix:
- ✅ No more metadata corruption
 - ✅ Eliminates unnecessary disk I/O when reopening databases without changes
 - ✅ Consistent with system-wide design pattern
 - ✅ No workaround needed
 
Files to modify:
src/storage/overflow_file.cpp(line 234-254)
Testing
After applying the fix, the minimal reproduction above should work correctly:
auto db2 = std::make_unique<Database>("test.db");  // ✅ Should succeedI can submit a pull request with this fix if needed.