Skip to content

[Ray Data] Enhance the expression for getting a field or slice #57668

@codingl2k1

Description

@codingl2k1

Description

I want to access some fields of a struct, for example, the batch dtype is

file: string                                                                                                              
audio: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
text: string
speaker_id: int64
chapter_id: int64
id: string

I want to get the bytes from the audio struct, however, the Ray Data expression can't handle this. This is a Daft example:

  • Getting a single value:
import daft
df = daft.from_pydict({"struct": [{"x": 1, "y": 2}, {"x": 3, "y": 4}], "list": [[10, 20], [30, 40]]})
df = df.select(df["struct"]["x"], df["list"][0].alias("first"))
df.show()
╭───────┬───────╮
│ xfirst │
│ ------   │
│ Int64Int64 │
╞═══════╪═══════╡
│ 110    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 330    │
╰───────┴───────╯

(Showing first 2 of 2 rows)
  • Getting a slice:
df = daft.from_pydict({"x": [[1, 2, 3], [4, 5, 6, 7], [8]]})
df = df.select(df["x"][1:-1])
df.show()
╭─────────────╮
│ x           │
│ ---         │
│ List[Int64] │
╞═════════════╡
│ [2]         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ []          │
╰─────────────╯

(Showing first 3 of 3 rows)

Use case

from datasets import load_dataset
import ray.data
from ray.data.expressions import col

dataset = load_dataset("openslr/librispeech_asr", "clean", split="validation")
print(dataset)

ds = ray.data.from_huggingface(dataset)
print(ds.take_batch(batch_size=1, batch_format="pyarrow").schema)
ds = ds.with_column("bytes", col("audio")["bytes"])    # This raises an error in Ray.
ds.show(1)

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to Rayusability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions