为什么我用自己的数据训练的时候出现训练几轮就杀死的问题? #17030
Unanswered
changluzll
asked this question in
Q&A
Replies: 1 comment
-
|
我的显存是32g,设置batchsize为128,numwoker为16,一周前可以训练,今天又想训练一下发现老是自动杀死 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
我的卡是5090,用的是适配50系列的源码编译的paddleocr,在第一次训练的时候我用了大概60张图片训练recv5的预训练模型,训练很顺利,训练了75次,也得到了训练好的模型,再过了一周后我再次用同样的代码运行时,发现训练几次就杀死训练几次就杀死,显存占用10个G左右,gpu占用20%左右,明显不是性能太差造成的问题,我也试了降低batchsize和多线程数,还是训练几轮就自动杀死。[2025/11/10 12:00:07] ppocr INFO: profiler_options : None
[2025/11/10 12:00:07] ppocr INFO: train with paddle 3.1.0 and device Place(gpu:0)
[2025/11/10 12:00:07] ppocr INFO: Initialize indexes of datasets:['./train_data/train_list.txt']
[2025/11/10 12:00:07] ppocr INFO: Initialize indexes of datasets:['./train_data/val_list.txt']
W1110 12:00:07.285504 12943 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 12.0, Driver API Version: 13.0, Runtime API Version: 12.9
[2025/11/10 12:00:08] ppocr INFO: train dataloader has 6 iters
[2025/11/10 12:00:08] ppocr INFO: valid dataloader has 1 iters
[2025/11/10 12:00:08] ppocr INFO: load pretrain successful from ./pretrain_models/PP-OCRv5_server_rec_pretrained
[2025/11/10 12:00:08] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
[2025/11/10 12:00:11] ppocr INFO: epoch: [1/100], global_step: 6, lr: 0.000208, acc: 0.000000, norm_edit_dis: 0.309245, CTCLoss: 66.794888, NRTRLoss: 2.908411, loss: 69.703297, avg_reader_cost: 0.09391 s, avg_batch_cost: 0.31125 s, avg_samples: 6.8, ips: 21.84748 samples/s, eta: 0:05:08, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:12] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/latest
[2025/11/10 12:00:14] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/iter_epoch_1
[2025/11/10 12:00:16] ppocr INFO: epoch: [2/100], global_step: 10, lr: 0.000375, acc: 0.081250, norm_edit_dis: 0.689636, CTCLoss: 24.897188, NRTRLoss: 1.920134, loss: 26.701511, avg_reader_cost: 0.44859 s, avg_batch_cost: 0.50373 s, avg_samples: 5.0, ips: 9.92604 samples/s, eta: 0:08:00, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:16] ppocr INFO: epoch: [2/100], global_step: 12, lr: 0.000458, acc: 0.237500, norm_edit_dis: 0.748177, CTCLoss: 16.367727, NRTRLoss: 1.920134, loss: 18.176137, avg_reader_cost: 0.00061 s, avg_batch_cost: 0.02315 s, avg_samples: 1.8, ips: 77.76478 samples/s, eta: 0:06:50, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:18] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/latest
[2025/11/10 12:00:19] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/iter_epoch_2
[2025/11/10 12:00:21] ppocr INFO: epoch: [3/100], global_step: 18, lr: 0.000500, acc: 0.312500, norm_edit_dis: 0.838594, CTCLoss: 13.625386, NRTRLoss: 1.777071, loss: 15.545521, avg_reader_cost: 0.36667 s, avg_batch_cost: 0.46468 s, avg_samples: 6.8, ips: 14.63385 samples/s, eta: 0:07:01, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:22] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/latest
[2025/11/10 12:00:23] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/iter_epoch_3
[2025/11/10 12:00:25] ppocr INFO: epoch: [4/100], global_step: 20, lr: 0.000500, acc: 0.406250, norm_edit_dis: 0.852917, CTCLoss: 12.437979, NRTRLoss: 1.768700, loss: 14.284549, avg_reader_cost: 0.37056 s, avg_batch_cost: 0.42594 s, avg_samples: 2.6, ips: 6.10417 samples/s, eta: 0:08:21, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:26] ppocr INFO: epoch: [4/100], global_step: 24, lr: 0.000500, acc: 0.599999, norm_edit_dis: 0.875000, CTCLoss: 8.322095, NRTRLoss: 1.668357, loss: 10.085095, avg_reader_cost: 0.00097 s, avg_batch_cost: 0.08416 s, avg_samples: 4.2, ips: 49.90531 samples/s, eta: 0:07:15, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:28] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/latest
[2025/11/10 12:00:29] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/iter_epoch_4
[2025/11/10 12:00:31] ppocr INFO: epoch: [5/100], global_step: 30, lr: 0.000499, acc: 0.599999, norm_edit_dis: 0.901042, CTCLoss: 6.642822, NRTRLoss: 1.543158, loss: 8.218270, avg_reader_cost: 0.40208 s, avg_batch_cost: 0.48889 s, avg_samples: 6.8, ips: 13.90913 samples/s, eta: 0:07:17, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:32] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/latest
[2025/11/10 12:00:34] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/iter_epoch_5
[2025/11/10 12:00:36] ppocr INFO: epoch: [6/100], global_step: 36, lr: 0.000499, acc: 0.599999, norm_edit_dis: 0.933173, CTCLoss: 3.649432, NRTRLoss: 1.463548, loss: 5.092332, avg_reader_cost: 0.43825 s, avg_batch_cost: 0.52280 s, avg_samples: 6.8, ips: 13.00693 samples/s, eta: 0:07:22, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
[2025/11/10 12:00:37] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/latest
[2025/11/10 12:00:39] ppocr INFO: save model in ./output/PP-OCRv5_server_rec/iter_epoch_6
[2025/11/10 12:00:40] ppocr INFO: epoch: [7/100], global_step: 40, lr: 0.000498, acc: 0.612499, norm_edit_dis: 0.935958, CTCLoss: 3.621450, NRTRLoss: 1.453031, loss: 5.050488, avg_reader_cost: 0.34274 s, avg_batch_cost: 0.40490 s, avg_samples: 3.6, ips: 8.89112 samples/s, eta: 0:07:32, max_mem_reserved: 2669 MB, max_mem_allocated: 2215 MB
已杀死
Beta Was this translation helpful? Give feedback.
All reactions