Skip to content

Conversation

bgillesp
Copy link
Collaborator

@bgillesp bgillesp commented Oct 6, 2025

PR implements a small binary to construct an HNSW graph from plaintext iris code input, with output serialized to file using the standard "single-graph" binary output format. This binary will be used primarily for upcoming data analysis tasks.

@bgillesp bgillesp self-assigned this Oct 6, 2025
@bgillesp bgillesp requested review from mcalancea and iliailia October 6, 2025 03:23
Copy link
Collaborator

@mcalancea mcalancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality looks very similar to generate_benchmark_data. I know there are some differences, but I feel we should reconcile the abstractions and have a single implementation. init_test_dbs also has overlapping functionality, which adds to the confusion.

Of course, it can be rectified later if data analysis needs this specific implementation now.

}

info!("Opening iris codes input stream");
let file = File::open(args.iris_codes_path.as_path()).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could repurpose the read_json function from py_bindings/io.rs

let mut graph = GraphMem::new();
let prf_seed = (args.hnsw_prf_key as u128).to_le_bytes();

for json_ptxt in stream {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you initialize PlaintextStore, then you can simply call PlaintextStore::generate_graph to do the rest.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to change it slightly to emit info! events.

@bgillesp bgillesp requested a review from mcalancea October 6, 2025 22:55
@bgillesp
Copy link
Collaborator Author

bgillesp commented Oct 6, 2025

Thanks for the input! @mcalancea definitely this is intended as a one-off binary that we need for short-term usage. Agreed that we should eventually have a utility that combines this with the functionality provided by the generate_benchmark_data and init_test_dbs binaries. I did revise to integrate it a bit better with existing utilities to keep the implementation a little lighter.

@bgillesp bgillesp force-pushed the bg/construct-graph-ptxt branch from 478c09b to 934b04f Compare October 7, 2025 00:12
@bgillesp bgillesp changed the base branch from main to dev October 7, 2025 00:12
Copy link
Collaborator

@iliailia iliailia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good idea to use the parallel insert algorithm.

@bgillesp bgillesp changed the title Implement construct-graph-ptxt binary [POP-2992] Implement construct-graph-ptxt binary Oct 7, 2025
@bgillesp bgillesp merged commit 1e0c9c0 into dev Oct 7, 2025
13 of 14 checks passed
@bgillesp bgillesp deleted the bg/construct-graph-ptxt branch October 7, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants