Skip to content

Fix Audio.cast_storage for CSV-loaded large_string columns#8121

Open
pdTetteh wants to merge 3 commits intohuggingface:mainfrom
pdTetteh:fix/cast-column-audio-csv-7970
Open

Fix Audio.cast_storage for CSV-loaded large_string columns#8121
pdTetteh wants to merge 3 commits intohuggingface:mainfrom
pdTetteh:fix/cast-column-audio-csv-7970

Conversation

@pdTetteh
Copy link
Copy Markdown

@pdTetteh pdTetteh commented Apr 4, 2026

Summary

This PR fixes Audio.cast_storage() so that audio path columns loaded as large_string can be cast to Audio() correctly.

Motivation

Dataset.from_dict({"audio": [...]}) produces a string column and cast_column("audio", Audio()) works, but load_dataset("csv", ...) produces a large_string column, which currently fails during the cast to the Audio storage type.

Changes made

  • Added handling for pa.large_string() in Audio.cast_storage()
  • Cast large string storage to pa.string() before building the Audio struct storage
  • Added a regression test covering the CSV-loaded audio path case

Testing

  • Reproduced the issue locally
  • Ran the targeted regression test for the CSV-loaded large_string case
  • Ran tests/features/test_audio.py in the local environment

Closes #7970

@lhoestq
Copy link
Copy Markdown
Member

lhoestq commented Apr 7, 2026

Thanks for the investigation and the fix ! This makes me think that maybe load_dataset("csv") should return "string" types instead of "large_string" to be consistent with load_dataset("json") as well as the Dataset.from_xxx methods

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed fix looks good to me to enable Audio on large_string types :)

Could you also add the same fix for Video and Pdf ?

Comment on lines +252 to +254
if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type):
if pa.types.is_large_string(storage.type):
storage = array_cast(storage, pa.string())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the same code as in Image.cast_storage

Suggested change
if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type):
if pa.types.is_large_string(storage.type):
storage = array_cast(storage, pa.string())
if pa.types.is_large_string(storage.type):
try:
storage = storage.cast(pa.string())
except pa.ArrowInvalid as e:
raise ValueError(
f"Failed to cast large_string to string for Image feature. "
f"This can happen if string values exceed 2GB. "
f"Original error: {e}"
) from e
if pa.types.is_string(storage.type):

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright that's helpful

"""Cast an Arrow array to the Audio arrow storage type.
The Arrow types that can be converted to the Audio pyarrow storage type are:

- `pa.string()` - it must contain the "path" data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `pa.string()` - it must contain the "path" data
- `pa.large_string()` - it must contain the "path" data (will be cast to string if possible)


from datasets import Audio, load_dataset

audio_path = tmp_path / "example.wav"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a audio_path fixture you can use already

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@pdTetteh
Copy link
Copy Markdown
Author

pdTetteh commented Apr 7, 2026

Thanks for the review and the suggestions.

I’ll update the PR to:

  • add the same large_string handling for Video and Pdf
  • update the docstrings to mention pa.large_string()
  • switch the audio test to use the existing audio_path fixture
  • fix the code quality check failure

@pdTetteh
Copy link
Copy Markdown
Author

pdTetteh commented Apr 7, 2026

I have updated the PR to:

  • add the same large_string handling for Video and Pdf
  • update the docstrings to mention pa.large_string()
  • keep the Audio fix
  • add matching regression tests for Audio, Video, and Pdf

I also fixed the code quality issue and reran the targeted feature tests locally.

@lhoestq
Copy link
Copy Markdown
Member

lhoestq commented Apr 10, 2026

Great ! It looks like you didn't commit the change to switch the audio test to use the existing audio_path fixture and to update the Audio docstring, could you add your changes before we merge ?

@pdTetteh
Copy link
Copy Markdown
Author

Thanks - I updated the audio regression test to use the existing audio fixture available in the test suite and added the missing Audio docstring update for pa.large_string().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cast_column(..., Audio) fails with load_dataset("csv",)

3 participants