Fix Audio.cast_storage for CSV-loaded large_string columns#8121
Fix Audio.cast_storage for CSV-loaded large_string columns#8121pdTetteh wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
Thanks for the investigation and the fix ! This makes me think that maybe load_dataset("csv") should return "string" types instead of "large_string" to be consistent with load_dataset("json") as well as the Dataset.from_xxx methods |
lhoestq
left a comment
There was a problem hiding this comment.
The proposed fix looks good to me to enable Audio on large_string types :)
Could you also add the same fix for Video and Pdf ?
| if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type): | ||
| if pa.types.is_large_string(storage.type): | ||
| storage = array_cast(storage, pa.string()) |
There was a problem hiding this comment.
you can use the same code as in Image.cast_storage
| if pa.types.is_string(storage.type) or pa.types.is_large_string(storage.type): | |
| if pa.types.is_large_string(storage.type): | |
| storage = array_cast(storage, pa.string()) | |
| if pa.types.is_large_string(storage.type): | |
| try: | |
| storage = storage.cast(pa.string()) | |
| except pa.ArrowInvalid as e: | |
| raise ValueError( | |
| f"Failed to cast large_string to string for Image feature. " | |
| f"This can happen if string values exceed 2GB. " | |
| f"Original error: {e}" | |
| ) from e | |
| if pa.types.is_string(storage.type): |
| """Cast an Arrow array to the Audio arrow storage type. | ||
| The Arrow types that can be converted to the Audio pyarrow storage type are: | ||
|
|
||
| - `pa.string()` - it must contain the "path" data |
There was a problem hiding this comment.
| - `pa.string()` - it must contain the "path" data | |
| - `pa.large_string()` - it must contain the "path" data (will be cast to string if possible) |
tests/features/test_audio.py
Outdated
|
|
||
| from datasets import Audio, load_dataset | ||
|
|
||
| audio_path = tmp_path / "example.wav" |
There was a problem hiding this comment.
there is a audio_path fixture you can use already
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thanks for the review and the suggestions. I’ll update the PR to:
|
|
I have updated the PR to:
I also fixed the code quality issue and reran the targeted feature tests locally. |
|
Great ! It looks like you didn't commit the change to switch the audio test to use the existing audio_path fixture and to update the Audio docstring, could you add your changes before we merge ? |
|
Thanks - I updated the audio regression test to use the existing audio fixture available in the test suite and added the missing |
Summary
This PR fixes
Audio.cast_storage()so that audio path columns loaded aslarge_stringcan be cast toAudio()correctly.Motivation
Dataset.from_dict({"audio": [...]})produces astringcolumn andcast_column("audio", Audio())works, butload_dataset("csv", ...)produces alarge_stringcolumn, which currently fails during the cast to the Audio storage type.Changes made
pa.large_string()inAudio.cast_storage()pa.string()before building the Audio struct storageTesting
large_stringcasetests/features/test_audio.pyin the local environmentCloses #7970