Fix Audio.cast_storage for CSV-loaded large_string columns by pdTetteh · Pull Request #8121 · huggingface/datasets

pdTetteh · 2026-04-04T12:44:03Z

Summary

This PR fixes Audio.cast_storage() so that audio path columns loaded as large_string can be cast to Audio() correctly.

Motivation

Dataset.from_dict({"audio": [...]}) produces a string column and cast_column("audio", Audio()) works, but load_dataset("csv", ...) produces a large_string column, which currently fails during the cast to the Audio storage type.

Changes made

Added handling for pa.large_string() in Audio.cast_storage()
Cast large string storage to pa.string() before building the Audio struct storage
Added a regression test covering the CSV-loaded audio path case

Testing

Reproduced the issue locally
Ran the targeted regression test for the CSV-loaded large_string case
Ran tests/features/test_audio.py in the local environment

Closes #7970

lhoestq · 2026-04-07T13:51:22Z

Thanks for the investigation and the fix ! This makes me think that maybe load_dataset("csv") should return "string" types instead of "large_string" to be consistent with load_dataset("json") as well as the Dataset.from_xxx methods

lhoestq

The proposed fix looks good to me to enable Audio on large_string types :)

Could you also add the same fix for Video and Pdf ?

lhoestq · 2026-04-07T13:52:11Z

src/datasets/features/audio.py

+        if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type):
+            if pa.types.is_large_string(storage.type):
+                storage = array_cast(storage, pa.string())


you can use the same code as in Image.cast_storage

Suggested change

if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type):

if pa.types.is_large_string(storage.type):

storage = array_cast(storage, pa.string())

if pa.types.is_large_string(storage.type):

try:

storage = storage.cast(pa.string())

except pa.ArrowInvalid as e:

raise ValueError(

f"Failed to cast large_string to string for Image feature. "

f"This can happen if string values exceed 2GB. "

f"Original error: {e}"

) from e

if pa.types.is_string(storage.type):

Alright that's helpful

lhoestq · 2026-04-07T13:52:28Z

src/datasets/features/audio.py

        """Cast an Arrow array to the Audio arrow storage type.
        The Arrow types that can be converted to the Audio pyarrow storage type are:

        - `pa.string()` - it must contain the "path" data


Suggested change

- `pa.string()` - it must contain the "path" data

- `pa.large_string()` - it must contain the "path" data (will be cast to string if possible)

lhoestq · 2026-04-07T13:55:39Z

tests/features/test_audio.py

+
+    from datasets import Audio, load_dataset
+
+    audio_path = tmp_path / "example.wav"


there is a audio_path fixture you can use already

HuggingFaceDocBuilderDev · 2026-04-07T13:57:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pdTetteh · 2026-04-07T14:52:49Z

Thanks for the review and the suggestions.

I’ll update the PR to:

add the same large_string handling for Video and Pdf
update the docstrings to mention pa.large_string()
switch the audio test to use the existing audio_path fixture
fix the code quality check failure

pdTetteh · 2026-04-07T15:45:15Z

I have updated the PR to:

add the same large_string handling for Video and Pdf
update the docstrings to mention pa.large_string()
keep the Audio fix
add matching regression tests for Audio, Video, and Pdf

I also fixed the code quality issue and reran the targeted feature tests locally.

lhoestq · 2026-04-10T13:23:41Z

Great ! It looks like you didn't commit the change to switch the audio test to use the existing audio_path fixture and to update the Audio docstring, could you add your changes before we merge ?

pdTetteh · 2026-04-10T21:29:05Z

Thanks - I updated the audio regression test to use the existing audio fixture available in the test suite and added the missing Audio docstring update for pa.large_string().

Handle large_string storage in Audio.cast_storage

04163a9

pdTetteh mentioned this pull request Apr 4, 2026

cast_column(..., Audio) fails with load_dataset("csv",) #7970

Open

lhoestq reviewed Apr 7, 2026

View reviewed changes

Support large_string paths for audio, video, and pdf features

71f9b70

Use existing audio fixture and document large_string support for Audio

01deb00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Audio.cast_storage for CSV-loaded large_string columns#8121

Fix Audio.cast_storage for CSV-loaded large_string columns#8121
pdTetteh wants to merge 3 commits intohuggingface:mainfrom
pdTetteh:fix/cast-column-audio-csv-7970

pdTetteh commented Apr 4, 2026

Uh oh!

lhoestq commented Apr 7, 2026

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Apr 7, 2026

Uh oh!

pdTetteh Apr 7, 2026

Uh oh!

lhoestq Apr 7, 2026

Uh oh!

lhoestq Apr 7, 2026

Uh oh!

pdTetteh Apr 7, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2026

Uh oh!

pdTetteh commented Apr 7, 2026

Uh oh!

pdTetteh commented Apr 7, 2026

Uh oh!

lhoestq commented Apr 10, 2026

Uh oh!

pdTetteh commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type):
-            if pa.types.is_large_string(storage.type):
-                storage = array_cast(storage, pa.string())
+        if pa.types.is_large_string(storage.type):
+            try:
+                storage = storage.cast(pa.string())
+            except pa.ArrowInvalid as e:
+                raise ValueError(
+                    f"Failed to cast large_string to string for Image feature. "
+                    f"This can happen if string values exceed 2GB. "
+                    f"Original error: {e}"
+                ) from e
+        if pa.types.is_string(storage.type):


	- `pa.string()` - it must contain the "path" data
	- `pa.large_string()` - it must contain the "path" data (will be cast to string if possible)


		from datasets import Audio, load_dataset

		audio_path = tmp_path / "example.wav"

Conversation

pdTetteh commented Apr 4, 2026

Summary

Motivation

Changes made

Testing

Uh oh!

lhoestq commented Apr 7, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

pdTetteh Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

pdTetteh Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2026

Uh oh!

pdTetteh commented Apr 7, 2026

Uh oh!

pdTetteh commented Apr 7, 2026

Uh oh!

lhoestq commented Apr 10, 2026

Uh oh!

pdTetteh commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants