Skip to content

Investigate improvement to map_zip_with algorithm #3588

@mythrocks

Description

@mythrocks

This is a follow-up issue for improvements to the map_zip_with algorithm, introduced in #3490.

Per @revans2's review of #3490, there is still scope for improvement in the performance. For very large columns (with wide maps), it was observed that thrust::cuda_cub::_uninitialized_fill() took up too much time in the query. This is likely from the calls to cudf::detail::label_segments which is called over a sequence spanning each key comparison. This could be huge if each row has a large number of keys.

The current implementation is an almost straightforward algorithm tuned for small-medium sized maps. There might be other approaches available:

  1. When matching keys per row, one approach could involve using cuco's hashmaps. This might prove faster to compare.
  2. While there isn't a guarantee of the keys being sorted, perhaps there's a way to sort both sides, and do a scan-based match. Much like a join. This might make it challenging to match the ordering one sees in Spark. Further, the preprocessing stands the risk of eating into the performance gained from scanning.
  3. Maybe there is a way to avoid materializing the cross-product vector entirely. A quick implementation (with thrust::upper_bound) indicated that the current approach is faster.

There might be need to switch between algorithms based on a heuristic: E.g. For small maps on average, use the current implementation. For larger maps, switch to the hash-table approach. Some strings algorithms do exactly this in CUDF.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions