Skip to content

[docs/preprocessors] Support ray_remote_args/ray_remote_args_fn in preprocessors #52448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
richardliaw opened this issue Apr 18, 2025 · 3 comments · May be fixed by #52574
Open

[docs/preprocessors] Support ray_remote_args/ray_remote_args_fn in preprocessors #52448

richardliaw opened this issue Apr 18, 2025 · 3 comments · May be fixed by #52574
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P2 Important issue, but not time-critical

Comments

@richardliaw
Copy link
Contributor

Description

Right now preprocessors use map_batches, but we can't configure the underlying execution. I've seen users subclass/pull out the preprocessors component to implement this. Would be great to support this in the preprocessor API

Use case

No response

@richardliaw richardliaw added data Ray Data-related issues enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P2 Important issue, but not time-critical labels Apr 18, 2025
@richardliaw
Copy link
Contributor Author

Also this one, which is similar to #51379 -- cc @xingyu-long

@xingyu-long
Copy link
Contributor

xingyu-long commented Apr 20, 2025

Hi @richardliaw, I did some investigating on this.

if I understand this issue correctly, we'd like to have the ray_remote_args/ray_remote_args_fn in preprocessors

class Preprocessor(abc.ABC):

I looked the code, and it seems users can implement

def _get_transform_config(self) -> Dict[str, Any]:
"""Returns kwargs to be passed to :meth:`ray.data.Dataset.map_batches`.
This can be implemented by subclassing preprocessors.
"""
return {}

and then while we are calling map_batches, we load the config

kwargs = self._get_transform_config()
if transform_type == BatchFormat.PANDAS:
return ds.map_batches(
self._transform_pandas, batch_format=BatchFormat.PANDAS, **kwargs
)

i.e., they can add ray_remote_args/ray_remote_args_fn into the dict within def _get_transform_config(self) if they are implementing subclass

def _get_transform_config(self) -> Dict[str, Any]:
    return {
      "ray_remote_args": ...,
      "ray_remote_args_fn": ...,
      # other args...
   }

maybe we can have the corresponding setter for transform_config? and then we only update preprocessor. or you are looking for something different in terms of API design here?

or we modify the constructor of preprocessor with default value for both two arguments, all existing subclass also need to be updated?

for example with tokenizer

>>> from ray.data.preprocessors import Tokenizer
>>> tokenizer = Tokenizer(columns=["text"], ray_remote_args=..., ray_remote_args_fn=...)

btw, feel free to assign this issue to me too. Thanks!

@xingyu-long
Copy link
Contributor

I decided to give a try with constructor approach above. feel free to check out linked PR for details. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants