You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are at least 2 types of Parquet files.
One where the columns are a LargeList and one where the columns are Sequence.
(Note: In general there could also be a mix, with some columns being Sequence and others being LargeList.)
Trying to use both types together leads to a crash
ValueError: Thefeaturescan't be aligned because the key input_ids of features {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None) (expected either LargeList(dtype=Value(dtype='int32', id=None), id=None) orValue("null").
This is also an issue when trying to mix a pretokenized dataset with online data processing of a JSONL dataset.
All of our code produces Sequence columns by default. So if the pretokenized dataset is a parquet with LargeList (e.g. from storage systems like Lakehouse), then the same crash is triggered.
Platform
Please provide details about the environment you are using, including the following:
Interpreter version: Python 3.13.1
Library version: main branch commit 8821791f3485a639ab1f08314a6edb182e54f108
Sample Code
Steps to reproduce:
Download the zip and uncompress to get the 2 types of parquet files. data.zip
Should not crash. Intelligently look at the column type and cast the output of the data processing to the correct type (usually this means casting the output of the data handlers Sequence, into a LargeList)
Option 1 Casting
AFTER the data handlers are done, and BEFORE the call to concatenate or interleave the datasets, we should check the column types are cast all of them to the same type (e.g. LargeList).
ValueError: Thefeaturescan't be aligned because the key input_ids of features {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None) (expected either LargeList(dtype=Value(dtype='int32', id=None), id=None) orValue("null").
Additional context
Ran into this while trying to mix pretokenized replay buffer data with use-case specific JSONL chat datasets.
Example chat dataset in JSONL format (used to create the parquet files): chatdata.zip
Sequence parquet file created using the offline processing script:
# -----------------------------------------# Data config docs: https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/docs/advanced-data-preprocessing.mddataprocessor:
type: defaultstreaming: false# granite 3.1 8b instruct chat template# https://huggingface.co/ibm-granite/granite-3.1-8b-instruct/blob/main/tokenizer_config.json#L188
chat_template: "{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content'] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"Knowledge Cutoff Date: April 2024.\nToday's Date: \" + strftime_now('%B %d, %Y') + \".\nYou are Granite, developed by IBM.\" %}\n {%- if tools and documents %}\n {%- set system_message = system_message + \" You are a helpful AI assistant with access to the following tools. When a tool is required to answer the user's query, respond with <|tool_call|> followed by a JSON list of tools used. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.\n\nWrite the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.\" %}\n {%- elif tools %}\n {%- set system_message = system_message + \" You are a helpful AI assistant with access to the following tools. When a tool is required to answer the user's query, respond with <|tool_call|> followed by a JSON list of tools used. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.\" %}\n {%- elif documents %}\n {%- set system_message = system_message + \" Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.\" %}\n {%- else %}\n {%- set system_message = system_message + \" You are a helpful AI assistant.\" %} \n {%- endif %}\n {%- if 'citations' in controls and documents %}\n {%- set system_message = system_message + '\n\nIn your response, use the symbols <co> and </co> to indicate when a fact comes from a document in the search result, e.g <co>0</co> for a fact from document 0. Afterwards, list all the citations with their corresponding documents in an ordered list.' %}\n {%- endif %}\n {%- if 'hallucinations' in controls and documents %}\n {%- set system_message = system_message + '\n\nFinally, after the response is written, include a numbered list of sentences from the response that are potentially hallucinated and not based in the documents.' %}\n {%- endif %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{{- '<|start_of_role|>system<|end_of_role|>' + system_message + '<|end_of_text|>\n' }}\n{%- if tools %}\n {{- '<|start_of_role|>tools<|end_of_role|>' }}\n {{- tools | tojson(indent=4) }}\n {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- if documents %}\n {{- '<|start_of_role|>documents<|end_of_role|>' }}\n {%- for document in documents %}\n {{- 'Document ' + loop.index0 | string + '\n' }}\n {{- document['text'] }}\n {%- if not loop.last %}\n {{- '\n\n'}}\n {%- endif%}\n {%- endfor %}\n {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in loop_messages %}\n {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|start_of_role|>assistant' }}\n {%- if controls %}\n {{- ' ' + controls | tojson()}}\n {%- endif %}\n {{- '<|end_of_role|>' }}\n {%- endif %}\n{%- endfor %}"
datasets:
- name: tuning_datadata_paths:
- './chat_dataset.jsonl'data_handlers:
- name: tokenize_and_apply_chat_template_with_maskingarguments:
remove_columns: allbatched: falsefn_kwargs:
conversation_column: "messages"
The text was updated successfully, but these errors were encountered:
HarikrishnanBalagopal
changed the title
Crash when trying to use 2 parquet datasets with different types (LargeList and Sequence)
bug: Crash when trying to use 2 parquet datasets with different types (LargeList and Sequence)
May 12, 2025
Describe the bug
There are at least 2 types of Parquet files.
One where the columns are a
LargeList
and one where the columns areSequence
.(Note: In general there could also be a mix, with some columns being
Sequence
and others beingLargeList
.)Trying to use both types together leads to a crash
This is also an issue when trying to mix a pretokenized dataset with online data processing of a JSONL dataset.
All of our code produces
Sequence
columns by default. So if the pretokenized dataset is a parquet withLargeList
(e.g. from storage systems like Lakehouse), then the same crash is triggered.Platform
Please provide details about the environment you are using, including the following:
main
branch commit8821791f3485a639ab1f08314a6edb182e54f108
Sample Code
Steps to reproduce:
Download the zip and uncompress to get the 2 types of parquet files. data.zip
Load the datasets individually
or try to interleave the datasets https://huggingface.co/docs/datasets/v3.6.0/en/process#interleave
Expected behavior
Should not crash. Intelligently look at the column type and cast the output of the data processing to the correct type (usually this means casting the output of the data handlers
Sequence
, into aLargeList
)Option 1 Casting
AFTER the data handlers are done, and BEFORE the call to concatenate or interleave the datasets, we should check the column types are cast all of them to the same type (e.g.
LargeList
).Option 2 Loading the datasets together does the casting automatically
When loading them together, the order is important!
When giving
LargeList
first, you getLargeList
When giving
Sequence
first, you getSequence
Observed behavior
Additional context
Ran into this while trying to mix pretokenized replay buffer data with use-case specific JSONL chat datasets.
Example chat dataset in JSONL format (used to create the parquet files):
chatdata.zip
Sequence
parquet file created using the offline processing script:The
data_config.yaml
The text was updated successfully, but these errors were encountered: