Skip to content

[数据预处理-tokenization时报错] datasets.builder.DatasetGenerationError #269

@ShanJianSoda

Description

@ShanJianSoda

按照步骤生成了jsonl文件
然后运行一下代码

python tokenize_dataset_rows.py ^
--jsonl_path data/alpaca_data.jsonl ^
--save_path data/alpaca ^
--max_seq_length 200 

报错

E:\ChatGLM\ChatGLM3\ChatGLM-LoRA>python tokenize_dataset_rows.py ^
More? --jsonl_path data/alpaca_data.jsonl ^
More? --save_path data/alpaca ^
More? --max_seq_length 200
  0%|                                                                                        | 0/52002 [00:00<?, ?it/s]
Generating train split: 0 examples [00:02, ? examples/s]                                     | 0/52002 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1676, in _prepare_split_single
    for key, record in generator:
  File "e:\anaconda3\Lib\site-packages\datasets\packaged_modules\generator\generator.py", line 30, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 31, in read_jsonl
    feature = preprocess(tokenizer, config, example, max_seq_length)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 10, in preprocess
    prompt = example["text"]
             ~~~~~~~^^^^^^^^
KeyError: 'text'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 53, in <module>
    main()
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 46, in main
    dataset = datasets.Dataset.from_generator(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\anaconda3\Lib\site-packages\datasets\arrow_dataset.py", line 1072, in from_generator
    ).read()
      ^^^^^^
  File "e:\anaconda3\Lib\site-packages\datasets\io\generator.py", line 47, in read
    self.builder.download_and_prepare(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1555, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1712, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

查阅信息,没有找到有效方法

有没有大佬邦邦鸭——

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions