Skip to content

[Data] Add percentiles and statistics aggregations to Ray Data #52588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
marwan116 opened this issue Apr 24, 2025 · 2 comments
Open

[Data] Add percentiles and statistics aggregations to Ray Data #52588

marwan116 opened this issue Apr 24, 2025 · 2 comments
Labels
enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@marwan116
Copy link
Contributor

marwan116 commented Apr 24, 2025

Description

I would like to use Ray Data for some of my exploratory data analysis.

One common task is to compute the distribution of a column.

For example, with pandas I would rely on the .describe() method.

import pandas as pd
import numpy as np

df = pd.DataFrame({"a": np.random.uniform(0, 1, size=10000)})
df["a"].describe()

This returns

count    10000.000000
mean         0.500795
std          0.291086
min          0.000033
25%          0.247651
50%          0.500268
75%          0.753518
max          0.999915
Name: a, dtype: float64

With Ray Data, I can't currently get this convience out of the box.

I can achieve part of this with built-in aggregations like .count(), .min(), .max(), .mean(), .std().

However I can't get the percentiles to find outliers, interquartile range, median, etc.

Suggestion - add a percentile method to the Ray Data API.

ds.percentile("a", [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99], exact=True)

It ideally would come in two flavors:

  1. approximate percentile aggregations - perform faster for large datasets
  2. exact percentile aggregations - perform slower but more accurate for small datasets

And ideally, extend to add a describe method.

ds.describe("a", exact=True)

Use case

Primarily for data analysis work before or after running ML training or ML inference with Ray.

@marwan116 marwan116 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 24, 2025
@x-Tong
Copy link

x-Tong commented Apr 25, 2025

This looks quite interesting, I want to give it a try, but I don't promise to finish quickly.

@wingkitlee0
Copy link
Contributor

@x-Tong Check out https://docs.ray.io/en/latest/data/api/doc/ray.data.aggregate.AggregateFnV2.html

Probably approximate percentile is easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants