Skip to content

[Data] support num_cpus, memory, concurrency, batch_size for preprocess #52574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

xingyu-long
Copy link
Contributor

@xingyu-long xingyu-long commented Apr 24, 2025

Why are these changes needed?

As title, we'd like to support above parameters for preprocess

Related issue number

Close #52448

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xingyu-long xingyu-long requested a review from a team as a code owner April 24, 2025 05:02
@xingyu-long
Copy link
Contributor Author

cc @richardliaw to review this change set.

Please note that I only update the Tokenizer with above arguments, if we are good with this solution, I will update other subclass of preprocess.

@richardliaw
Copy link
Contributor

OK, thinking through this a little more. I think maybe better if we just expose:

  • num_cpus
  • memory
  • batch_size
  • concurrency

And avoid exposing ray_remote_args and ray_remote_args_fn generally.

This will be simpler to maintain and will allow it to be easily serializable

@xingyu-long
Copy link
Contributor Author

OK, thinking through this a little more. I think maybe better if we just expose:

  • num_cpus
  • memory
  • batch_size
  • concurrency

And avoid exposing ray_remote_args and ray_remote_args_fn generally.

This will be simpler to maintain and will allow it to be easily serializable

Make sense, will update code accordingly. thanks!

@richardliaw
Copy link
Contributor

richardliaw commented Apr 26, 2025 via email

@xingyu-long
Copy link
Contributor Author

xingyu-long commented Apr 26, 2025

Another thought -

  1. This should only be for transform(), not init
  2. for fit_transform() we can add “transform_num_cpus” or something

sure, we can do this too. just updated it. also I was wondering if we should remove function _get_transform_config since we used these configs explicitly, let me know what do you thin k on this. Thanks!

@xingyu-long xingyu-long changed the title [Data] support ray_remote_args and ray_remote_args_fn for preprocess [Data] support num_cpus, memory, concurrency, batch_size for preprocess Apr 26, 2025
@xingyu-long xingyu-long force-pushed the issue_52448_preprocess_remote branch from 9900995 to 7caf5af Compare April 27, 2025 04:41
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 28, 2025
@masoudcharkhabi masoudcharkhabi added data Ray Data-related issues usability labels Apr 28, 2025
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, left some comments

@xingyu-long
Copy link
Contributor Author

Thanks for reviewing it, @richardliaw, I just updated the changes.

@xingyu-long xingyu-long force-pushed the issue_52448_preprocess_remote branch from 857f6a1 to f0dcb1d Compare April 30, 2025 03:35
@xingyu-long
Copy link
Contributor Author

with this change (f52f057), fixed the test_chain test and also tested it all preprocessor tests.

$ pytest -v -s python/ray/data/tests/preprocessors/
Results (25.86s):
     119 passed

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
@xingyu-long xingyu-long force-pushed the issue_52448_preprocess_remote branch from f52f057 to 831466f Compare April 30, 2025 07:13
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 small nit

@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Apr 30, 2025
@richardliaw
Copy link
Contributor

oh i just realized, this looks good - you're planning to extend this to all preprocessors right?

@xingyu-long
Copy link
Contributor Author

oh i just realized, this looks good - you're planning to extend this to all preprocessors right?

That was the idea (with constructor approach I tried), however, since we modified common functions in parent class, so we don't have to touch them any more.

I've ran all tests under preprocess test dir. let me know if I am missing any other tests.

btw, I will fix the above comments later today. Thanks!

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
@richardliaw richardliaw merged commit a7f1f24 into ray-project:master May 2, 2025
5 checks passed
vickytsang pushed a commit to ROCm/ray that referenced this pull request May 5, 2025
…ss (ray-project#52574)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

As title, we'd like to support above parameters for preprocess

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Close ray-project#52448

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
GokuMohandas pushed a commit that referenced this pull request May 8, 2025
…ss (#52574)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

As title, we'd like to support above parameters for preprocess

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Close #52448

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests usability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[docs/preprocessors] Support ray_remote_args/ray_remote_args_fn in preprocessors
4 participants