[Data] Avoid merging map ops in cases when it leads to substantial parallelism reduction #52570

alexeykudinkin · 2025-04-24T02:19:56Z

Why are these changes needed?

These changes are needed to prevent cases when Map Operator fusion could lead to substantial parallelism reduction.

Consider following scenario:

Upstream MapOp specifies min_rows_per_input_bundle=1
Downstreeam MapOp specifies min_rows_per_input_bundle=100

For a dataset of 100 rows and 10 blocks (10 rows / block) if we do fuse in this case, fused operator's parallelism will be just 1 task (determined by downstream) substantially reducing upstream's parallelism.

This is a big issue when we fuse Read ops with subsequent Map operations.

This change:

Avoids fusion of Read ops with downstream Map ops that have batch_size specified
Adjusts fusion sequence to avoid fusing operators with substantial reduction in estimated parallelism
Adds telemetry

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…undle` Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…_rows_per_bundled_input` is not specified Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…never it has `min_num_rows_per_input_bundle` specified Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…uction (by more than > 4x) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen · 2025-04-24T17:49:37Z

python/ray/data/_internal/logical/rules/operator_fusion.py

+
+        # Do not fuse read op with downstream map op in case when downstream has
+        # `min_rows_per_input_bundle` specified (to avoid reducing reading parallelism)
+        if upstream_op.is_read_op() and ds_bundle_min_rows_req is not None:


In addition to read ops, I think we should disable fusion as long as the previous map op doesn't preserve num rows (e.g., read, filter, map_batches, flat_map, etc)

for map_batches, typically it preserves num rows.
But today we don't enforce that.
related issue #36295
One option is to enforce that by default, and add a flag to allow violation.

One option is to enforce that by default, and add a flag to allow violation.

I don't think we can do that anymore with our public API -- i can totally see that being too limiting.

Regardless, though preserving num-rows for proper limit push-downs is an important topic but tangential to this change.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner April 24, 2025 02:19

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Apr 24, 2025

alexeykudinkin requested review from raulchen and bveeramani April 24, 2025 02:20

alexeykudinkin force-pushed the ak/op-fus-fix-2 branch from f8ab19b to ab58245 Compare April 24, 2025 02:22

alexeykudinkin requested review from a team, sven1977 and simonsays1980 as code owners April 24, 2025 02:22

alexeykudinkin changed the base branch from ak/op-fus-fix to master April 24, 2025 02:23

alexeykudinkin removed request for a team, simonsays1980 and sven1977 April 24, 2025 02:23

alexeykudinkin added 15 commits April 24, 2025 10:48

Abstracted _derive_bundle_min_num_rows

bdc0ffb

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added a warning for substantial increase in `min_num_rows_per_input_b…

ce32cef

…undle` Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Cleaned up message

a3a8573

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added is_read_op method

a5b9a66

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Rebased ExecutionPlan to utilize is_read_op

4d6de46

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

cc666ff

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added assertions

f57a12c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed _derive_bundle_min_num_rows to properly handle case when `min…

bea2c17

…_rows_per_bundled_input` is not specified Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

e07950f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added a check to avoid fusing read op with downstream MapOperator whe…

55db796

…never it has `min_num_rows_per_input_bundle` specified Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Avoid fusing operators that could lead to substantial parallelism red…

f77918f

…uction (by more than > 4x) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Clean up dead-code

3ee705d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing invalid ref

eaa7794

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

fddc740

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated py-doc

7a4b160

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen reviewed Apr 24, 2025

View reviewed changes

alexeykudinkin added 2 commits April 24, 2025 10:58

Tidying up

05de56a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed NPE in _derive_upstream_parallelism_reduction_factor

c90719b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 7 commits April 24, 2025 11:57

Fixed test

c7fcaf2

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added both happy, unhappy paths for read <> map fusion

b75398a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisited tests to assert whole plan

eb9874e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added tests for batch-size controlled fusion

7644b89

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

d49c5f1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added py-doc

7598053

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

7e6d122

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/op-fus-fix-2 branch from ab58245 to 7e6d122 Compare April 24, 2025 20:03

Deleted parallelism reduction factor analysis as not substantial enough

ccc7c1f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Avoid merging map ops in cases when it leads to substantial parallelism reduction #52570

[Data] Avoid merging map ops in cases when it leads to substantial parallelism reduction #52570

alexeykudinkin commented Apr 24, 2025

raulchen Apr 24, 2025

raulchen Apr 24, 2025

alexeykudinkin Apr 24, 2025

[Data] Avoid merging map ops in cases when it leads to substantial parallelism reduction #52570

Are you sure you want to change the base?

[Data] Avoid merging map ops in cases when it leads to substantial parallelism reduction #52570

Conversation

alexeykudinkin commented Apr 24, 2025

Why are these changes needed?

Related issue number

Checks

raulchen Apr 24, 2025

Choose a reason for hiding this comment

raulchen Apr 24, 2025

Choose a reason for hiding this comment

alexeykudinkin Apr 24, 2025

Choose a reason for hiding this comment