Skip to content

Optimizing for duplicate groupby-aggregate operations #8

@tedmiddleton

Description

@tedmiddleton

For example, agg::stddev() involves calculating a mean and agg::mean() involves calculating a sum. Likewise, agg::corr() will end up calculating 2 means. In the case of something like

auto gf = fr1.groupby(_1, 2)
gf.aggregate(sum(_3), sum(_4), mean(_3), mean(_4), stddev(_3), corr(_3, _4))

...how many times will we be summing up the elements of each group in _3 and _4? Naively, it would be 4 times for _3 (with sum, mean, stddev, and corr) and 3 times with _4 (with sum, mean, and corr), but it seems like we should be able to cut it down to 1 time for _3 and _4 each.

I think the key here is to

  1. do ops in passes, so do all sum's first, then min's and max's, then means, then stddevs, then regress's, and then corr's
  2. put the column names in a dict and then check the dict before doing the calculation.

Whatever I do here I have to make sure that I'm not making it slower accidentally by doing dictionary lookups

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions