-
Notifications
You must be signed in to change notification settings - Fork 74
Description
This is a follow-up issue for improvements to the map_zip_with
algorithm, introduced in #3490.
Per @revans2's review of #3490, there is still scope for improvement in the performance. For very large columns (with wide maps), it was observed that thrust::cuda_cub::_uninitialized_fill()
took up too much time in the query. This is likely from the calls to cudf::detail::label_segments
which is called over a sequence spanning each key comparison. This could be huge if each row has a large number of keys.
The current implementation is an almost straightforward algorithm tuned for small-medium sized maps. There might be other approaches available:
- When matching keys per row, one approach could involve using
cuco
's hashmaps. This might prove faster to compare. - While there isn't a guarantee of the keys being sorted, perhaps there's a way to sort both sides, and do a scan-based match. Much like a join. This might make it challenging to match the ordering one sees in Spark. Further, the preprocessing stands the risk of eating into the performance gained from scanning.
- Maybe there is a way to avoid materializing the cross-product vector entirely. A quick implementation (with
thrust::upper_bound
) indicated that the current approach is faster.
There might be need to switch between algorithms based on a heuristic: E.g. For small maps on average, use the current implementation. For larger maps, switch to the hash-table approach. Some strings algorithms do exactly this in CUDF.