-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix: ViewType gc on huge batch would produce bad output #8694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .map(|i| unsafe { self.copy_view_to_buffer(i, &mut data_buf) }) | ||
| .collect(); | ||
| for view in self.views() { | ||
| let len = *view as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is so slow, but it's right, I can make it faster(by handling the numbers via grouping or batching) if required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how you would make this much faster - I think the code needs to find the locations to split in any event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the buffer size is greater than i32::MAX, it's possible that a single buffer is much smaller than i32::MAX, so this can find batch-by-batch, rather than just adding small buffer one-by-one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- you are saying you could potentially optimize the input by looking at each input buffer or something and gcing it individually or something.
That would be interesting, though it would probably take a lot of care to make it fast.
|
@alamb Besides, I meet this bug when I have 4GiB StringViewArray, arrow-rs regard offset as u32, however, in arrow standard, this uses i32. So I limit it to 2GiB There're other places uses |
| }; | ||
| vec![gc_copy_group] | ||
| }; | ||
| assert!(gc_copy_groups.len() <= i32::MAX as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can be removed, I just ensure it would pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be change to assert debug here.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mapleFU -- this is a good find. I left some comments, let me know what you think
cc @zhuqi-lucas perhaps you have some thoughts
| .map(|i| unsafe { self.copy_view_to_buffer(i, &mut data_buf) }) | ||
| .collect(); | ||
| for view in self.views() { | ||
| let len = *view as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how you would make this much faster - I think the code needs to find the locations to split in any event
|
🤖 |
|
The MIRI test is probably failing due to the massive memory use in https://github.com/apache/arrow-rs/actions/runs/18818674867/job/53690752815?pr=8694 I suggest we don't run that test under miri by disabling it, with something like #[cfg_attr(miri, ignore)] // Takes too longFor example |
|
🤖: Benchmark completed Details
|
zhuqi-lucas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding, thank you!
| }; | ||
| vec![gc_copy_group] | ||
| }; | ||
| assert!(gc_copy_groups.len() <= i32::MAX as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be change to assert debug here.
| } | ||
| &groups | ||
| } else { | ||
| one_group.as_slice() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've add a one_group for gc-copy-group in stack, hopefully this avoids allocation
let one_group = [GcCopyGroup {
total_buffer_bytes: total_large,
total_len: len,
}];|
@alamb I've ran the new code on MacOS M4 Pro. Don't know whether the performance comes from unstable environment or other This patch: main: |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mapleFU - I went through this again and it looks good to me.
I would like to run the benchmarks one more time before merging this (I have queued them up on my machine so hopefully they'll post to this PR in a few hours)
| .map(|i| unsafe { self.copy_view_to_buffer(i, &mut data_buf) }) | ||
| .collect(); | ||
| for view in self.views() { | ||
| let len = *view as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- you are saying you could potentially optimize the input by looking at each input buffer or something and gcing it individually or something.
That would be interesting, though it would probably take a lot of care to make it fast.
|
🤖 |
|
🤖: Benchmark completed Details
|
| if len > MAX_INLINE_VIEW_LEN { | ||
| if current_length + len > i32::MAX as u32 { | ||
| // Start a new group | ||
| groups.push(GcCopyGroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can preallocate groups (with_capacity) or use .collect
| for view_idx in current_view_idx..current_view_idx + gc_copy_group.total_len { | ||
| let view = | ||
| unsafe { self.copy_view_to_buffer(view_idx, group_idx as i32, &mut data_buf) }; | ||
| views_buf.push(view); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can use Vec::collect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
views_buf is preallocated once, would Vec::collect allocate more buffer than currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using collect doesn't allocate additional memory, but it can be faster than push because collect does the capacity check once where push has to check capacity on each call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, since both Vecs are being modified at the same time, I couldn't figure out a way to use collect here -- I could use extend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think collect is possible here, would do it after work tonight
| views_buf.push(view); | ||
| } | ||
|
|
||
| data_blocks.push(Buffer::from_vec(data_buf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This coul as well use collect.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over this PR again and I think the benchmarks show it slowing down GC, which is not good. I have some idea on how to get performance back (along the lines of what @Dandandan suggested)
| for view_idx in current_view_idx..current_view_idx + gc_copy_group.total_len { | ||
| let view = | ||
| unsafe { self.copy_view_to_buffer(view_idx, group_idx as i32, &mut data_buf) }; | ||
| views_buf.push(view); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using collect doesn't allocate additional memory, but it can be faster than push because collect does the capacity check once where push has to check capacity on each call
| for view_idx in current_view_idx..current_view_idx + gc_copy_group.total_len { | ||
| let view = | ||
| unsafe { self.copy_view_to_buffer(view_idx, group_idx as i32, &mut data_buf) }; | ||
| views_buf.push(view); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, since both Vecs are being modified at the same time, I couldn't figure out a way to use collect here -- I could use extend
|
Here is a proposal to avoid the performance drop in this PR: |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The performance results on Seem to confirm it avoids the performance regression |
Try and improve GC performance
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Lets continue here)
| let data_blocks = vec![data_block]; | ||
| (views_buf, data_blocks) | ||
| } else { | ||
| // slow path, need to split into multiple buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it better if I extract this into a new function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so!
| (views_buf, data_blocks) | ||
| }; | ||
|
|
||
| // 5) Wrap up buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to what?
| let views_buf: Vec<u128> = (0..len) | ||
| .map(|i| unsafe { self.copy_view_to_buffer(i, &mut data_buf) }) | ||
| .collect(); | ||
| let mut groups = Vec::with_capacity(total_large / (i32::MAX as usize) + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might not good, and allocation count in this slow path might not matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are referring tot his comment
I didn't quite follow the concern, see mapleFU#1 (comment)
However that being said we can revert this part too -- I don't feel strongly about it and I agree that the allocation likely doesn't matter on the slow path
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is now ready to go and doesn't regress performance. I started another benchmark just to make sure
Thanks @mapleFU -- I think we can hone it a bit more too if you want (extract the slow path as a function, for example) but we could also do it as a follow on as well
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I've just change two lines here, it's ready to merge for me now |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Thanks again @mapleFU -- sorry this one took so long but I think it is quite great now |
Which issue does this PR close?
Rationale for this change
Previously,
gc()will produce a single buffer. However, for buffer size greater than 2GiB, it would be buggy, since buffer-offset it's a 4-byte signed integer.What changes are included in this PR?
Add a GcCopyGroup type, and do gc for it.
Are these changes tested?
Yes
Are there any user-facing changes?
gc would produce more buffers