Skip to content
This repository was archived by the owner on Oct 10, 2025. It is now read-only.
This repository was archived by the owner on Oct 10, 2025. It is now read-only.

[Bug] OverflowFile::checkpoint() corrupts PrimaryKeyIndexStorageInfo when no data has been written #6045

@1amageek

Description

@1amageek

Description

When creating a HNSW vector index without inserting any data, the database checkpoint completes successfully but corrupts the metadata. Reopening the database fails with an assertion error in hash_index.cpp:487.

Environment

  • Kuzu Version: v0.11.1
  • Platform: All platforms (macOS, iOS, Linux, etc.)
  • Language Bindings: C++ (core), Swift (wrapper)

Minimal Reproduction

#include "main/kuzu.h"
using namespace kuzu::main;

int main() {
    // 1. Create database and table with vector column
    auto db = std::make_unique<Database>("test.db");
    auto conn = std::make_unique<Connection>(db.get());
    
    conn->query("CREATE NODE TABLE Item(id STRING PRIMARY KEY, embedding FLOAT[3])");
    
    // 2. Create vector index (no data inserted)
    conn->query("CALL CREATE_VECTOR_INDEX('Item', 'item_idx', 'embedding', metric := 'l2')");
    
    // 3. Close database
    conn.reset();
    db.reset();
    
    // 4. Reopen database
    auto db2 = std::make_unique<Database>("test.db");  // ❌ ASSERTION FAILURE
    
    return 0;
}

Expected: Database reopens successfully
Actual: Assertion failure at hash_index.cpp:487:

KU_ASSERT(hashIndexStorageInfo.overflowHeaderPage == INVALID_PAGE_IDX);

Root Cause

The bug is in src/storage/overflow_file.cpp:236 in OverflowFile::checkpoint():

void OverflowFile::checkpoint(PageAllocator& pageAllocator) {
    KU_ASSERT(fileHandle);
    if (headerPageIdx == INVALID_PAGE_IDX) {
        // ❌ BUG: Allocates page even when no data has been written
        this->headerPageIdx = getNewPageIdx(&pageAllocator);
        headerChanged = true;
    }
    ...
}

What happens:

  1. VectorIndex creation creates a PrimaryKeyIndex (for STRING primary key id)
  2. PrimaryKeyIndex creates an OverflowFile (for strings >12 bytes) with headerPageIdx = INVALID_PAGE_IDX
  3. During checkpoint, OverflowFile::checkpoint() unconditionally allocates a page (e.g., page 1) even though no data has been written
  4. This sets PrimaryKeyIndexStorageInfo.overflowHeaderPage = 1 (should be INVALID_PAGE_IDX)
  5. The corrupted metadata is serialized to disk
  6. On database reopen, PrimaryKeyIndex constructor hits the assertion:
    if (hashIndexStorageInfo.firstHeaderPage == INVALID_PAGE_IDX) {
        // firstHeaderPage = INVALID, but overflowHeaderPage = 1 ❌
        KU_ASSERT(hashIndexStorageInfo.overflowHeaderPage == INVALID_PAGE_IDX);
    }

Proposed Fix

Modify OverflowFile::checkpoint() to skip checkpoint when no data has been written, following the same pattern as NodeTable, RelTable, and other components:

void OverflowFile::checkpoint(PageAllocator& pageAllocator) {
    KU_ASSERT(fileHandle);
    // Skip checkpoint if no data has been written
    // headerChanged is set to true only when actual string data (>12 bytes) is written
    // via OverflowFileHandle::setStringOverflow()
    if (!headerChanged) {
        return;
    }
    if (headerPageIdx == INVALID_PAGE_IDX) {
        // Reserve a page for the header (only when data has actually been written)
        this->headerPageIdx = getNewPageIdx(&pageAllocator);
    }
    // ... rest of the function
}

Why this fix is correct:

  1. headerChanged is set to true only in OverflowFileHandle::setStringOverflow() when writing strings >12 bytes
  2. If headerChanged == false, then pageWriteCache is guaranteed to be empty (no data to flush)
  3. This follows the same design pattern as other checkpoint methods:
    • NodeTable::checkpoint(): if (!hasChanges) return;
    • RelTable::checkpoint(): if (!hasChanges) return;

Impact

Affected scenarios:

  • Creating a VectorIndex without inserting data
  • Any table with STRING primary key and no long strings (≤12 bytes)
  • Empty databases with indexes

Workaround (before fix):
Insert at least one record after creating VectorIndex and execute manual CHECKPOINT:

conn->query("CREATE (i:Item {id: 'test', embedding: [1.0, 2.0, 3.0]})");
conn->query("CHECKPOINT");

This triggers hasStorageChanges=1, causing a second checkpoint that re-serializes metadata with correct values.

Additional Context

Design Pattern Violation:

All other components in Kuzu follow the pattern "skip checkpoint when no changes":

  • NodeTable::checkpoint() checks hasChanges
  • RelTable::checkpoint() checks hasChanges
  • OverflowFile::checkpoint() was the only exception that didn't check headerChanged

Benefits of the fix:

  1. ✅ No more metadata corruption
  2. ✅ Eliminates unnecessary disk I/O when reopening databases without changes
  3. ✅ Consistent with system-wide design pattern
  4. ✅ No workaround needed

Files to modify:

  • src/storage/overflow_file.cpp (line 234-254)

Testing

After applying the fix, the minimal reproduction above should work correctly:

auto db2 = std::make_unique<Database>("test.db");  // ✅ Should succeed

I can submit a pull request with this fix if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions