-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Open
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Rayusability
Description
Description
I want to access some fields of a struct, for example, the batch dtype is
file: string
audio: struct<bytes: binary, path: string>
child 0, bytes: binary
child 1, path: string
text: string
speaker_id: int64
chapter_id: int64
id: string
I want to get the bytes from the audio struct, however, the Ray Data expression can't handle this. This is a Daft example:
- Getting a single value:
import daft
df = daft.from_pydict({"struct": [{"x": 1, "y": 2}, {"x": 3, "y": 4}], "list": [[10, 20], [30, 40]]})
df = df.select(df["struct"]["x"], df["list"][0].alias("first"))
df.show()
╭───────┬───────╮
│ x ┆ first │
│ --- ┆ --- │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1 ┆ 10 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 30 │
╰───────┴───────╯
(Showing first 2 of 2 rows)
- Getting a slice:
df = daft.from_pydict({"x": [[1, 2, 3], [4, 5, 6, 7], [8]]})
df = df.select(df["x"][1:-1])
df.show()
╭─────────────╮
│ x │
│ --- │
│ List[Int64] │
╞═════════════╡
│ [2] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [] │
╰─────────────╯
(Showing first 3 of 3 rows)
Use case
from datasets import load_dataset
import ray.data
from ray.data.expressions import col
dataset = load_dataset("openslr/librispeech_asr", "clean", split="validation")
print(dataset)
ds = ray.data.from_huggingface(dataset)
print(ds.take_batch(batch_size=1, batch_format="pyarrow").schema)
ds = ds.with_column("bytes", col("audio")["bytes"]) # This raises an error in Ray.
ds.show(1)
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Rayusability