Bharat Scene Text Dataset

Bharat Scene Text Dataset (BSTD) is large and real Indian language scene-text dataset with coverage across 13 Indian languages and English. It consists of 6,582 scene-text images, with polygon bounding box annotations of 1,20,560 words and ground truth text annotations of 1,00,495 cropped words. This dataset is an effort towards scaling scene-text detection and recognition systems to work on Indian languages. The current version of this dataset can be used for studying scene text detection and cropped scene text word recognition.

Release updates:

[8/8/24] First Public Release.

Data Statistics:

Scene Text Detection

Total images	#Total bounding box (bb)	#Train Images	#Test Images
6,582	1,26,292	5,263 (#bb = 94,128)	1,319 (#bb = 32,164)

Cropped Word Recogntion

Language	#Total recognition annotations	#Train	#Test
assamese	4132	2627	1505
bengali	6304	4936	1368
gujarati	2899	1884	1015
english	41696	29123	12573
hindi	19773	14927	4846
kannada	2928	2208	720
malayalam	2940	2393	547
marathi	5113	3917	1196
odia	4192	3148	1044
punjabi	11199	8319	2880
tamil	2542	2029	513
telugu	2760	2215	545
others	19814	-	-
total	126292	77726	28752

Task 1: Scene text detection

Data Download:

Download the detection.zip from the link (zip file ~17 GB).

Annotations are in BSTD_release_v1.json

File structure

Detection/
│
├── A/
│   ├── image_xx.jpg
│   ├── ...
│   └── image_xx.jpg
├── B/
├── C/
├── ...
├── M/
└── BSTD_release_v1.json

Annotation Format (BSTD_release_v1.json):

Words in the image are annotated in the polygon format. The annotation file is a json file with the following format:

"folderName_image_id": {
    "annotations": 
    {
        "polygon_0":
        {
            "coordinates":
                [
                    [x1, y1],
                    [x2, y2],
                    ...,
                    [xn, yn]
                ],
            "text": "text in the current polygon",
            "script_language" : "language of the word present in the polygon."
        },
        ...,
        "polygon_n":
        {
            "coordinates":
                [
                    [x1, y1],
                    [x2, y2],
                    ...,
                    [xn, yn]
                ],
            "text": "text in the current polygon",
            "script_language" : "language of the word present in the polygon."
        }
    },
    "url": "url of the image",
    "image_name": "path to the image",
    "split" : "train/test split"
    "folderName": "folder of the image"
}

Task 2: Cropped word recognition

Data Download:

Download the recognition.zip from the link (zip file ~774 MB).

File structure

Recognition/
│
├── train/
│   ├── assamese/
│   │   ├── X_image_name_xx_xx.jpg
    │   ├── X_image_name_xx_xx.jpg
    │   ├── X_image_name_xx_xx.jpg
│   ├── bengali/
│   │   ├── ...
│   ├── ...
│   └── urdu/
├── test/
│   ├── assamese/
│   ├── bengali/
│   ├── ...
│   └── urdu/
├── train.csv
└── test.csv

Annotation Format (BSTD_release_v1.json):

Files: recognition/train.csv and recognition/test.csv

Each file contains rows (each row has comma seperated values as follows)

path_to_the_cropped_word_image, recogntion_annotation, script_language

Data Connversion:

To convert the recognition data into lmbd files use utils/fetch_lmdb_format_data.py.

Usage
python fetch_lmdb_format_data.py --recognition_folder_path ~bstd/recognition/ --split train --language hindi --output_directory lmdb/hindi/train/real/hindi

To get more details on arguments

python fetch_lmdb_format_data.py --help

Task 3: Script Identification

For the task of script identification, there are two ways we have configured the dataset.

First 3-way classification: A dataset comprising images from three languages :English, Hindi, and a specific regional language has been created. This setup allows training and testing a model that classifies these three classes.

Folder	Regional Language	#Language (Reg. Only)	#English (Reg. + Eng.)	#Hindi (Reg. + Hin.)
(Lang.)	(Name)	Train \| Test	Train \| Test	Train \| Test
assamese_	Assamese	2623 \| 1343	2623 \| 1343	2623 \| 1343
bengali_	Bengali	4968 \| 1161	4968 \| 1161	4968 \| 1161
gujarati_	Gujarati	1956 \| 693	1956 \| 693	1956 \| 693
kannada_	Kannada	2241 \| 693	2241 \| 693	2241 \| 693
malayalam_	Malayalam	2408 \| 567	2408 \| 567	2408 \| 567
marathi_	Marathi	3932 \| 1045	3932 \| 1045	3932 \| 1045
odia_	Odia	3176 \| 1022	3176 \| 1022	3176 \| 1022
tamil_	Tamil	2041 \| 507	2041 \| 507	2041 \| 507
telugu_	Telugu	2227 \| 482	2227 \| 482	2227 \| 482
hindi_	Hindi	- \| -	14855 \| 4034	14855 \| 4034

This dataset can be downloaded from this link. A script utils/make_dataset_for_scriptIdentification.py has also beed added to as to be able to directly create this dataset using the recognition dataset made available the upper section.

Second 12-way classification: A dataset set has been curated to classify script into 12-classes altogether. Each language contains 1800 images in train set and 478 images in test set all of which are taken from recognition dataset. The images can be download from here.

How to use

Each folder contains images from three language folders. For example, the folder bengali_ includes cropped word images of Hindi, English, and Bengali. For the test/bengali_folder, all image paths are listed in test.csv, which includes the correct language tag for each image. Similarly, all images in the train folder under each language-specific folder are listed in train.csv with their respective language tags.

Note: The hindi_ folder contains only cropped images of Hindi and English, with each image path listed in the CSV files.

Image subset used in (Vaidya et al., ICPR 2024) Preprint

Data Download:

BSTD images split used for Hindi to English scene text to scene text transaltion can be downloaded from the link

Images used for Hindi to English scene text to scene text transaltion can be downloaded directly from the link

Data Visualisation of Detection Annotations:

To visualise detection annotations, run the following command:

python3 visualise.py <image_path> <path_to_BSTD_release_v1.json>

for e.g.

python3 visualise.py D/image_141.jpg path_to_BSTD_release_v1.json

Some examples are below:

Data Annotation

All the images are collected from Wikimedia commons (under Creative Commons Licence, cc-by-sa-4.0).
Further detection and recognition annotations are manually annotated.

Related Indian Language Scene Text Recognition Toolkit

IndicPhotoOCR

Acknowledgement

This work was partly supported by MeitY, Government of India (Project Number: S/MeitY/AM/20210114) under NLTM-Bhashini.

Contact

For any queries, please contact us at:

Citation

@misc{BSTD,
   title      = {{B}harat {S}cene {T}ext {D}ataset},
  howpublished = {\url{https://github.com/Bhashini-IITJ/BharatSceneTextDataset}},
  year         = 2024,
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
utils		utils
visualised_images		visualised_images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
visualise.py		visualise.py

Folder	Regional Language	#Language (Reg. Only)	#English (Reg. + Eng.)	#Hindi (Reg. + Hin.)
(Lang.)	(Name)	Train \| Test	Train \| Test	Train \| Test
assamese_	Assamese	2623 \| 1343	2623 \| 1343	2623 \| 1343
bengali_	Bengali	4968 \| 1161	4968 \| 1161	4968 \| 1161
gujarati_	Gujarati	1956 \| 693	1956 \| 693	1956 \| 693
kannada_	Kannada	2241 \| 693	2241 \| 693	2241 \| 693
malayalam_	Malayalam	2408 \| 567	2408 \| 567	2408 \| 567
marathi_	Marathi	3932 \| 1045	3932 \| 1045	3932 \| 1045
odia_	Odia	3176 \| 1022	3176 \| 1022	3176 \| 1022
tamil_	Tamil	2041 \| 507	2041 \| 507	2041 \| 507
telugu_	Telugu	2227 \| 482	2227 \| 482	2227 \| 482
hindi_	Hindi	- \| -	14855 \| 4034	14855 \| 4034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bharat Scene Text Dataset

Release updates:

Data Statistics:

Scene Text Detection

Cropped Word Recogntion

Task 1: Scene text detection

Data Download:

File structure

Annotation Format (BSTD_release_v1.json):

Task 2: Cropped word recognition

Data Download:

File structure

Annotation Format (BSTD_release_v1.json):

Data Connversion:

Task 3: Script Identification

How to use

Image subset used in (Vaidya et al., ICPR 2024) Preprint

Data Download:

Data Visualisation of Detection Annotations:

Data Annotation

Related Indian Language Scene Text Recognition Toolkit

Acknowledgement

Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Bhashini-IITJ/BharatSceneTextDataset

Folders and files

Latest commit

History

Repository files navigation

Bharat Scene Text Dataset

Release updates:

Data Statistics:

Scene Text Detection

Cropped Word Recogntion

Task 1: Scene text detection

Data Download:

File structure

Annotation Format (BSTD_release_v1.json):

Task 2: Cropped word recognition

Data Download:

File structure

Annotation Format (BSTD_release_v1.json):

Data Connversion:

Task 3: Script Identification

How to use

Image subset used in (Vaidya et al., ICPR 2024) Preprint

Data Download:

Data Visualisation of Detection Annotations:

Data Annotation

Related Indian Language Scene Text Recognition Toolkit

Acknowledgement

Contact

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages