[Help] Trouble fine-tuning PaddleOCR for Vietnamese OCR #16624
Unanswered
tttiuem2k3
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
Hello, you might want to try the PP-OCRv5 Latin model. The configuration file is as follows: https://github.com/PaddlePaddle/PaddleOCR/blob/main/configs/rec/PP-OCRv5/multi_language/latin_PP-OCRv5_mobile_rec.yml You can download the training weights from the PaddleOCR multilingual recognition documentation https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/OCR.html#1-ocr-pipeline-introduction. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
-
I’m developing an OCR system for Vietnamese invoices and documents.
My goal is to fine-tune the recognition (rec) model in PaddleOCR so it can accurately read Vietnamese text (with tone marks, diacritics, and spaces).
Problem
I suspect the issue is a mismatch between the recognition model and my Vietnamese character set (diacritics are likely OOV).
What I’ve tried
Used latin_PP-OCRv3_mobile_rec.yml with:
Global.character_type: latin
Character.character_dict_path: ./data/vietnamese/vi_vietnam.txt
My vi_vietnam.txt dictionary includes all Vietnamese letters (
"àáạảãăằắặẳẵâầấậẩẫèéẹẻẽêềếệểễ"
"ìíịỉĩòóọỏõôồốộổỗơờớợởỡùúụủũ"
"ưừứựửữỳýỵỷỹđ"
"ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪÈÉẸẺẼÊỀẾỆỂỄ"
"ÌÍỊỈĨÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠÙÚỤỦŨ"
"ƯỪỨỰỬỮỲÝỴỶỸĐ"
) in both uppercase and lowercase.
Dataset: ~25k cropped word images (train_list.txt / val_list.txt format).
Training command:
python3 tools/train.py
-c configs/rec/PP-OCRv3/multi_language/latin_PP-OCRv3_mobile_rec.yml
-o Global.use_gpu=True
Global.epoch_num=15
Global.save_model_dir=./model/ppocrv3_vi
Global.character_type=latin
Global.character_dict_path=./data/vietnamese/vi_vietnam.txt
Still, accuracy remains near zero and Vietnamese diacritics are ignored or misread.
Questions / Need Advice
Which recognition model is best for Vietnamese?
Should I use PP-OCRv5 multilingual, PP-OCRv5 server, or SVTR instead of the v3 Latin model?
Is there a pretrained checkpoint known to work better for tonal or Unicode languages?
YAML configuration for Vietnamese dictionary
Should I set Character.character_type to "ch" to ensure PaddleOCR actually loads my vi_dict.txt?
Which parameters are essential for Vietnamese invoices?
Any example YAML file for training a custom language recognizer?
Training settings
Recommended epoch count for convergence (40–500 epochs?)
Suggested data augmentation for printed invoice text (should I disable heavy distortions)?
How to check OOV or Unicode normalization (NFC) issues before training?
Goal
I want to build a reliable Vietnamese OCR recognizer for invoices/forms using PaddleOCR —
able to recognize diacritics, spaces, and case-sensitive text correctly.
If anyone has a working example config or fine-tuned model for Vietnamese, I would really appreciate your guidance or YAML reference.
Thank you so much for your time and help 🙏
Beta Was this translation helpful? Give feedback.
All reactions