feat: build BTree index using range based mode#274
feat: build BTree index using range based mode#274fangbo wants to merge 5 commits intolance-format:mainfrom
Conversation
ee78973 to
78aa7b6
Compare
|
@jackye1995 @hamersaw This PR is ready. Could you please take a look at it ? Thank you. |
| */ | ||
| def buildIndexType(method: String): IndexType = { | ||
| method.toLowerCase match { | ||
| method match { |
There was a problem hiding this comment.
Is there a reason we're removing the toLowerCase?
There was a problem hiding this comment.
When AddIndexExec is built, the method is converted to lower case. See https://github.com/lance-format/lance-spark/pull/274/changes#diff-faa5d16fcbd2d8ced324add5e49a1403acaed197bc93b3ece1e3fc1f4d2fe6d7R43
| def buildForRange(rangeId: Int, rowsIter: Iterator[InternalRow]): Iterator[Unit] = { | ||
| // Initialize writer to write data to arrow stream | ||
| val allocator = LanceRuntime.allocator() | ||
| val data = |
There was a problem hiding this comment.
I think this is only closed if no data is written. Should it be closed in all cases?
There was a problem hiding this comment.
Yes , it should be closed when ArrowStreamWriter is closed. I will fixed it.
| val reader = new ArrowStreamReader(in, allocator) | ||
| val stream = ArrowArrayStream.allocateNew(allocator) | ||
|
|
||
| val dataset = IndexUtils.openDatasetWithOptions( |
There was a problem hiding this comment.
if openDatasetWithOptions throws an exception, we never close reader and stream
There was a problem hiding this comment.
Create catch! openDatasetWithOptions should be invoked in try block to ensure that reader and stream will be closed finally.
| val ident = addIndexExec.ident | ||
| val indexName = addIndexExec.indexName | ||
| val columns = addIndexExec.columns.toList | ||
| val zoneSize = addIndexExec.args.find(_.name == "zone_size").map(_.value.asInstanceOf[Long]) |
There was a problem hiding this comment.
I think use of asInstanceOf[Long] is not safe here and can error if the value is actually an Integer
There was a problem hiding this comment.
We always convert integer type to Long. Please see https://github.com/lance-format/lance-spark/blob/main/lance-spark-3.5_2.12/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/LanceSqlExtensionsAstBuilder.scala#L108
| stream.close() | ||
| reader.close() | ||
| dataset.close() |
There was a problem hiding this comment.
If any of these fail and throw an Exception, it would mask the original error.
| /** | ||
| * Physical execution of distributed CREATE INDEX (ALTER TABLE ... CREATE INDEX ...) for Lance datasets. | ||
| * | ||
| * This builds per-fragment indexes with the provided options, merges index metadata | ||
| * and commits an index-creation transaction. | ||
| */ |
There was a problem hiding this comment.
Is this comment still accurate?
There was a problem hiding this comment.
Yes, I will modify the comments to distinguish between BTree indexes and other types of indexes.
|
|
||
| writer.finish() | ||
|
|
||
| // No any row is written |
| * | ||
| * @param encodedReadOptions Configuration for Lance dataset access, serialized | ||
| * @param columns column names to index | ||
| * @param method Indexing method to use (e.g., "btree", "ivf_pq") |
There was a problem hiding this comment.
Is "ivf_pq" supported here?
There was a problem hiding this comment.
Actually, "ivf_pq" is built in FragmentIndexTask. "btree" should be removed from the comment.
|
@hamersaw Greatly thanks for your review! I have fixed the comments. Could you please look at it again? |
Close #250