Skip to content

微调模型最少需要多少硬件资源? #15

@thunderbolt-fire

Description

@thunderbolt-fire

非常感谢作者的工作,目前我想复现下微调模型这一部分,我的平台是 NVIDIA A100 80GB 。我已经把batch调整成1了,可还是无法训练,请问怎么调整才可以在这一张卡上进行微调训练。

DATASET=PopQA
PER_DEVICE_BATCH_SIZE=1
NUM_DEVICE=1
TOTAL_BATCH_SIZE=1
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_DEVICE/$PER_DEVICE_BATCH_SIZE))

报错

(flashrag) (base) root@5e17ee1d7b71:~/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/InstructRAG# bash train.sh
/opt/conda/envs/flashrag/lib/python3.11/site-packages/transformers/training_args.py:1913: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.08it/s]
Loading training set from: dataset/PopQA/train.json

===DEBUG Input:
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDocument 1 (Title: George Claus Rankin): George Claus Rankin Sir George Claus Rankin PC (12 August 1877 \u2013 8 April 1946) was a British judge in India. Rankin was born in Lamington, Lanarkshire, the son of Rev. Robert Rankin. He was educated at the University of Edinburgh and Trinity College, Cambridge. He as admitted at Lincoln's Inn and called to the bar in 1904. He served in the First World War with the Royal Garrison Artillery. He went to India in 1918 and served first as a puisne judge of the High Court of Calcutta, and then as Chief Justice, from 1926 to 1934. While in\n\nDocument 2 (Title: Tom Rankin): a gardener and greenkeeper. Tom Rankin Thomas Charles \"Tom\" Rankin (3 May 1881 \u2013 18 February 1958) was an Australian rules footballer who played with Geelong in the Victorian Football League (VFL). His brother, Edwin (known as \u2018Teddy\u2019) and other members of the Rankin family also played for Geelong. Rankin got off to a promising start in the 1904 and early 1905 seasons, but his career was compromised by serious injuries to his knee and kidney sustained during a match in May 1905. He married Adeline Harrison and raised a family of ten children. After football, he remained in Geelong\n\nDocument 3 (Title: Arthur Rankin): Arthur Rankin Arthur Rankin (1816 \u2013 March 13, 1893) was a surveyor, entrepreneur and political figure in Canada West. Rankin was born in Montreal in 1816, the son of Irish immigrants. He ran away from home and became a cabin boy. In 1835, he returned to Canada, then qualified as a surveyor and moved to the Windsor area. In 1837, he smuggled an escaped slave from Ohio to Upper Canada. He served in the militia during the 1837 Rebellions. In 1843, with nine Ojibwas, he toured Britain with a \"wild west show\" that appeared before Queen Victoria. In 1844, he\n\nDocument 4 (Title: George Rankin): was buried, and was survived by his wife. George Rankin Major General George James Rankin, (1 May 1887 \u2013 28 December 1957) was an Australian soldier and politician. He served in both the House of Representatives and the Senate, representing the Country Party of Australia. Rankin was born at Bamawm, Victoria, the tenth child of Irish farmer James Rankin and Sarah, n\u00e9e Gallagher. He attended the local state school and became a farmer. In 1907, he joined the Militia, and was commissioned in the 9th Light Horse Regiment in 1909. He married Annie Isabella Oliver at Rochester, Victoria on 7\n\nDocument 5 (Title: George Rankin): George Rankin Major General George James Rankin, (1 May 1887 \u2013 28 December 1957) was an Australian soldier and politician. He served in both the House of Representatives and the Senate, representing the Country Party of Australia. Rankin was born at Bamawm, Victoria, the tenth child of Irish farmer James Rankin and Sarah, n\u00e9e Gallagher. He attended the local state school and became a farmer. In 1907, he joined the Militia, and was commissioned in the 9th Light Horse Regiment in 1909. He married Annie Isabella Oliver at Rochester, Victoria on 7 July 1912. In 1914, he was appointed a\n\nBased on your knowledge and the provided information, answer the question:\n{question}what is George Rankin's occupation?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe documents that are useful to answer the question \"What is George Rankin's occupation?\" are:\n\n* Document 1: George Claus Rankin\n* Document 4: George Rankin\n* Document 5: George Rankin\n\nThese documents provide information about George Rankin's occupations, which are:\n\n* Judge (Document 1)\n* Politician (Document 4 and Document 5)\n\nThe contents of these documents lead to the answer as follows:\n\n* Document 1 states that George Claus Rankin was a British judge in India, serving as a puisne judge of the High Court of Calcutta and then as Chief Justice.\n* Document 4 and Document 5 mention George Rankin as an Australian soldier and politician, serving in both the House of Representatives and the Senate, representing the Country Party of Australia.\n\nThe other documents (Document 2 and Document 3) do not provide information about George Rankin's occupation.\n\nBased on the provided information, the correct answers are:\n\n* Politician\n* Political leader\n* Political figure\n\nThese answers are supported by the contents of Documents 4 and 5, which describe George Rankin's career in politics.<|eot_id|>"===

===DEBUG Target:
tensor([   791,   9477,    430,    527,   5505,    311,   4320,    279,   3488,
           330,   3923,    374,  10058,  19856,    258,    596,  30747,   7673,
           527,   1473,      9,  12051,    220,     16,     25,  10058,  68119,
         19856,    258,    198,      9,  12051,    220,     19,     25,  10058,
         19856,    258,    198,      9,  12051,    220,     20,     25,  10058,
         19856,    258,    271,   9673,   9477,   3493,   2038,    922,  10058,
         19856,    258,    596,  60966,     11,    902,    527,   1473,      9,
         20819,    320,   7676,    220,     16,    340,      9,  16307,  12734,
           320,   7676,    220,     19,    323,  12051,    220,     20,    696,
           791,   8970,    315,   1521,   9477,   3063,    311,    279,   4320,
           439,  11263,   1473,      9,  12051,    220,     16,   5415,    430,
         10058,  68119,  19856,    258,    574,    264,   8013,  11913,    304,
          6890,     11,  13788,    439,    264,  44829,    818,  11913,    315,
           279,   5234,   7301,    315,   3400,  10453,   2629,    323,   1243,
           439,  14681,  12007,    627,      9,  12051,    220,     19,    323,
         12051,    220,     20,   6420,  10058,  19856,    258,    439,    459,
         13673,  27202,    323,  37038,     11,  13788,    304,   2225,    279,
          4783,    315,  40845,    323,    279,  10092,     11,  14393,    279,
         14438,   8722,    315,   8494,    382,    791,   1023,   9477,    320,
          7676,    220,     17,    323,  12051,    220,     18,      8,    656,
           539,   3493,   2038,    922,  10058,  19856,    258,    596,  30747,
           382,  29815,    389,    279,   3984,   2038,     11,    279,   4495,
         11503,    527,   1473,      9,  16307,  12734,    198,      9,  31597,
          7808,    198,      9,  31597,   7216,    271,   9673,  11503,    527,
          7396,    555,    279,   8970,    315,  45890,    220,     19,    323,
           220,     20,     11,    902,   7664,  10058,  19856,    258,    596,
          7076,    304,  11759,     13, 128009]) ==> "The documents that are useful to answer the question \"What is George Rankin's occupation?\" are:\n\n* Document 1: George Claus Rankin\n* Document 4: George Rankin\n* Document 5: George Rankin\n\nThese documents provide information about George Rankin's occupations, which are:\n\n* Judge (Document 1)\n* Politician (Document 4 and Document 5)\n\nThe contents of these documents lead to the answer as follows:\n\n* Document 1 states that George Claus Rankin was a British judge in India, serving as a puisne judge of the High Court of Calcutta and then as Chief Justice.\n* Document 4 and Document 5 mention George Rankin as an Australian soldier and politician, serving in both the House of Representatives and the Senate, representing the Country Party of Australia.\n\nThe other documents (Document 2 and Document 3) do not provide information about George Rankin's occupation.\n\nBased on the provided information, the correct answers are:\n\n* Politician\n* Political leader\n* Political figure\n\nThese answers are supported by the contents of Documents 4 and 5, which describe George Rankin's career in politics.<|eot_id|>"===
Tokenization metadata:
{"num_examples": 12868, "input_ids_avg_len": 1026.9486322660864, "input_ids_max_len": 1688, "input_ids_min_len": 759, "labels_avg_len": 1026.9486322660864, "labels_max_len": 1688, "labels_min_len": 759, "model_max_length": 4096}
/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/InstructRAG/src/finetune.py:132: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(
/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py:444: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
  0%|                                                                                           | 0/25736 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]:   File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/InstructRAG/src/finetune.py", line 147, in <module>
[rank0]:     main()
[rank0]:   File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/InstructRAG/src/finetune.py", line 139, in main
[rank0]:     trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/transformers/trainer.py", line 2611, in _inner_training_loop
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/accelerate/optimizer.py", line 178, in step
[rank0]:     self.optimizer.step(closure)
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 140, in wrapper
[rank0]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/optim/optimizer.py", line 493, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/optim/adamw.py", line 232, in step
[rank0]:     has_complex = self._init_group(
[rank0]:                   ^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/optim/adamw.py", line 175, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]:                           ^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.25 GiB of which 106.75 MiB is free. Including non-PyTorch memory, this process has 79.14 GiB memory in use. Of the allocated memory 78.17 GiB is allocated by PyTorch, and 151.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  0%|                                                                                           | 0/25736 [00:01<?, ?it/s]
[rank0]:[W702 17:01:28.620971400 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0702 17:01:29.827000 27595 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 27678) of binary: /opt/conda/envs/flashrag/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/flashrag/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-02_17:01:29
  host      : 5e17ee1d7b71
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27678)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions