Optimizing for duplicate groupby-aggregate operations

For example, agg::stddev() involves calculating a mean and agg::mean() involves calculating a sum. Likewise, agg::corr() will end up calculating 2 means. In the case of something like

    auto gf = fr1.groupby(_1, 2)
    gf.aggregate(sum(_3), sum(_4), mean(_3), mean(_4), stddev(_3), corr(_3, _4))

...how many times will we be summing up the elements of each group in _3 and _4? Naively, it would be 4 times for _3 (with sum, mean, stddev, and corr) and 3 times with _4 (with sum, mean, and corr), but it seems like we should be able to cut it down to 1 time for _3 and _4 each.

I think the key here is to
1. do ops in passes, so do all sum's first, then min's and max's, then means, then stddevs, then regress's, and then corr's
2. put the column names in a dict and then check the dict before doing the calculation.

Whatever I do here I have to make sure that I'm not making it slower accidentally by doing dictionary lookups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing for duplicate groupby-aggregate operations #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimizing for duplicate groupby-aggregate operations #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions