Skip to content

feat: build BTree index using range based mode#274

Open
fangbo wants to merge 5 commits intolance-format:mainfrom
fangbo:improve-btree
Open

feat: build BTree index using range based mode#274
fangbo wants to merge 5 commits intolance-format:mainfrom
fangbo:improve-btree

Conversation

@fangbo
Copy link
Collaborator

@fangbo fangbo commented Feb 26, 2026

Close #250

@github-actions github-actions bot added the enhancement New feature or request label Feb 26, 2026
@fangbo fangbo force-pushed the improve-btree branch 5 times, most recently from ee78973 to 78aa7b6 Compare February 27, 2026 11:27
@fangbo
Copy link
Collaborator Author

fangbo commented Feb 27, 2026

@jackye1995 @hamersaw This PR is ready. Could you please take a look at it ? Thank you.

*/
def buildIndexType(method: String): IndexType = {
method.toLowerCase match {
method match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're removing the toLowerCase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def buildForRange(rangeId: Int, rowsIter: Iterator[InternalRow]): Iterator[Unit] = {
// Initialize writer to write data to arrow stream
val allocator = LanceRuntime.allocator()
val data =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only closed if no data is written. Should it be closed in all cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , it should be closed when ArrowStreamWriter is closed. I will fixed it.

Comment on lines +453 to +456
val reader = new ArrowStreamReader(in, allocator)
val stream = ArrowArrayStream.allocateNew(allocator)

val dataset = IndexUtils.openDatasetWithOptions(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if openDatasetWithOptions throws an exception, we never close reader and stream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create catch! openDatasetWithOptions should be invoked in try block to ensure that reader and stream will be closed finally.

val ident = addIndexExec.ident
val indexName = addIndexExec.indexName
val columns = addIndexExec.columns.toList
val zoneSize = addIndexExec.args.find(_.name == "zone_size").map(_.value.asInstanceOf[Long])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use of asInstanceOf[Long] is not safe here and can error if the value is actually an Integer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +485 to +487
stream.close()
reader.close()
dataset.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of these fail and throw an Exception, it would mask the original error.

Comment on lines 41 to 46
/**
* Physical execution of distributed CREATE INDEX (ALTER TABLE ... CREATE INDEX ...) for Lance datasets.
*
* This builds per-fragment indexes with the provided options, merges index metadata
* and commits an index-creation transaction.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will modify the comments to distinguish between BTree indexes and other types of indexes.


writer.finish()

// No any row is written
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"No rows are written"?

*
* @param encodedReadOptions Configuration for Lance dataset access, serialized
* @param columns column names to index
* @param method Indexing method to use (e.g., "btree", "ivf_pq")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "ivf_pq" supported here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, "ivf_pq" is built in FragmentIndexTask. "btree" should be removed from the comment.

@fangbo
Copy link
Collaborator Author

fangbo commented Mar 13, 2026

@hamersaw Greatly thanks for your review! I have fixed the comments. Could you please look at it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve distributed index building for btree

2 participants