[Data] Add percentiles and statistics aggregations to Ray Data #52588

marwan116 · 2025-04-24T21:36:30Z

Description

I would like to use Ray Data for some of my exploratory data analysis.

One common task is to compute the distribution of a column.

For example, with pandas I would rely on the .describe() method.

import pandas as pd
import numpy as np

df = pd.DataFrame({"a": np.random.uniform(0, 1, size=10000)})
df["a"].describe()

This returns

count    10000.000000
mean         0.500795
std          0.291086
min          0.000033
25%          0.247651
50%          0.500268
75%          0.753518
max          0.999915
Name: a, dtype: float64

With Ray Data, I can't currently get this convience out of the box.

I can achieve part of this with built-in aggregations like .count(), .min(), .max(), .mean(), .std().

However I can't get the percentiles to find outliers, interquartile range, median, etc.

Suggestion - add a percentile method to the Ray Data API.

ds.percentile("a", [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99], exact=True)

It ideally would come in two flavors:

approximate percentile aggregations - perform faster for large datasets
exact percentile aggregations - perform slower but more accurate for small datasets

And ideally, extend to add a describe method.

ds.describe("a", exact=True)

Use case

Primarily for data analysis work before or after running ML training or ML inference with Ray.

The text was updated successfully, but these errors were encountered:

x-Tong · 2025-04-25T05:05:25Z

This looks quite interesting, I want to give it a try, but I don't promise to finish quickly.

wingkitlee0 · 2025-04-25T14:18:13Z

@x-Tong Check out https://docs.ray.io/en/latest/data/api/doc/ray.data.aggregate.AggregateFnV2.html

Probably approximate percentile is easier.

marwan116 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add percentiles and statistics aggregations to Ray Data #52588

[Data] Add percentiles and statistics aggregations to Ray Data #52588

marwan116 commented Apr 24, 2025 •

edited

Loading

x-Tong commented Apr 25, 2025

wingkitlee0 commented Apr 25, 2025

[Data] Add percentiles and statistics aggregations to Ray Data #52588

[Data] Add percentiles and statistics aggregations to Ray Data #52588

Comments

marwan116 commented Apr 24, 2025 • edited Loading

Description

Use case

x-Tong commented Apr 25, 2025

wingkitlee0 commented Apr 25, 2025

marwan116 commented Apr 24, 2025 •

edited

Loading