Skip to content

Bhashini-IITJ/BharatSceneTextDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bharat Scene Text Dataset

GitHub stars Visitors arXiv

Bharat Scene Text Dataset (BSTD) is large and real Indian language scene-text dataset with coverage across 13 Indian languages and English. It consists of 6,582 scene-text images, with polygon bounding box annotations of 1,20,560 words and ground truth text annotations of 1,00,495 cropped words. This dataset is an effort towards scaling scene-text detection and recognition systems to work on Indian languages. The current version of this dataset can be used for studying scene text detection and cropped scene text word recognition.

Release updates:

  • [8/8/24] First Public Release.

Data Statistics:

Scene Text Detection

Total images #Total bounding box (bb) #Train Images #Test Images
6,582 1,26,292 5,263 (#bb = 94,128) 1,319 (#bb = 32,164)

Cropped Word Recogntion

Language #Total recognition annotations #Train #Test
assamese 4132 2627 1505
bengali 6304 4936 1368
gujarati 2899 1884 1015
english 41696 29123 12573
hindi 19773 14927 4846
kannada 2928 2208 720
malayalam 2940 2393 547
marathi 5113 3917 1196
odia 4192 3148 1044
punjabi 11199 8319 2880
tamil 2542 2029 513
telugu 2760 2215 545
others 19814 - -
total 126292 77726 28752

Task 1: Scene text detection

Data Download:

Download the detection.zip from the link (zip file ~17 GB).

Annotations are in BSTD_release_v1.json

File structure

Detection/
│
├── A/
│   ├── image_xx.jpg
│   ├── ...
│   └── image_xx.jpg
├── B/
├── C/
├── ...
├── M/
└── BSTD_release_v1.json

Annotation Format (BSTD_release_v1.json):

Words in the image are annotated in the polygon format. The annotation file is a json file with the following format:

"folderName_image_id": {
    "annotations": 
    {
        "polygon_0":
        {
            "coordinates":
                [
                    [x1, y1],
                    [x2, y2],
                    ...,
                    [xn, yn]
                ],
            "text": "text in the current polygon",
            "script_language" : "language of the word present in the polygon."
        },
        ...,
        "polygon_n":
        {
            "coordinates":
                [
                    [x1, y1],
                    [x2, y2],
                    ...,
                    [xn, yn]
                ],
            "text": "text in the current polygon",
            "script_language" : "language of the word present in the polygon."
        }
    },
    "url": "url of the image",
    "image_name": "path to the image",
    "split" : "train/test split"
    "folderName": "folder of the image"
}

Task 2: Cropped word recognition

Data Download:

Download the recognition.zip from the link (zip file ~774 MB).

File structure

Recognition/
│
├── train/
│   ├── assamese/
│   │   ├── X_image_name_xx_xx.jpg
    │   ├── X_image_name_xx_xx.jpg
    │   ├── X_image_name_xx_xx.jpg
│   ├── bengali/
│   │   ├── ...
│   ├── ...
│   └── urdu/
├── test/
│   ├── assamese/
│   ├── bengali/
│   ├── ...
│   └── urdu/
├── train.csv
└── test.csv

Annotation Format (BSTD_release_v1.json):

Files: recognition/train.csv and recognition/test.csv

Each file contains rows (each row has comma seperated values as follows)

path_to_the_cropped_word_image, recogntion_annotation, script_language

Data Connversion:

To convert the recognition data into lmbd files use utils/fetch_lmdb_format_data.py.

Usage
python fetch_lmdb_format_data.py --recognition_folder_path ~bstd/recognition/ --split train --language hindi --output_directory lmdb/hindi/train/real/hindi

To get more details on arguments

python fetch_lmdb_format_data.py --help

Task 3: Script Identification

For the task of script identification, there are two ways we have configured the dataset.

First 3-way classification: A dataset comprising images from three languages :English, Hindi, and a specific regional language has been created. This setup allows training and testing a model that classifies these three classes.

Folder Regional Language #Language (Reg. Only) #English (Reg. + Eng.) #Hindi (Reg. + Hin.)
(Lang.) (Name) Train | Test Train | Test Train | Test
assamese_ Assamese 2623 | 1343 2623 | 1343 2623 | 1343
bengali_ Bengali 4968 | 1161 4968 | 1161 4968 | 1161
gujarati_ Gujarati 1956 | 693 1956 | 693 1956 | 693
kannada_ Kannada 2241 | 693 2241 | 693 2241 | 693
malayalam_ Malayalam 2408 | 567 2408 | 567 2408 | 567
marathi_ Marathi 3932 | 1045 3932 | 1045 3932 | 1045
odia_ Odia 3176 | 1022 3176 | 1022 3176 | 1022
tamil_ Tamil 2041 | 507 2041 | 507 2041 | 507
telugu_ Telugu 2227 | 482 2227 | 482 2227 | 482
hindi_ Hindi - | - 14855 | 4034 14855 | 4034

This dataset can be downloaded from this link. A script utils/make_dataset_for_scriptIdentification.py has also beed added to as to be able to directly create this dataset using the recognition dataset made available the upper section.

Second 12-way classification: A dataset set has been curated to classify script into 12-classes altogether. Each language contains 1800 images in train set and 478 images in test set all of which are taken from recognition dataset. The images can be download from here.

How to use

Each folder contains images from three language folders. For example, the folder bengali_ includes cropped word images of Hindi, English, and Bengali. For the test/bengali_folder, all image paths are listed in test.csv, which includes the correct language tag for each image. Similarly, all images in the train folder under each language-specific folder are listed in train.csv with their respective language tags.

Note: The hindi_ folder contains only cropped images of Hindi and English, with each image path listed in the CSV files.

Image subset used in (Vaidya et al., ICPR 2024) Preprint

Data Download:

BSTD images split used for Hindi to English scene text to scene text transaltion can be downloaded from the link

Images used for Hindi to English scene text to scene text transaltion can be downloaded directly from the link

Data Visualisation of Detection Annotations:

To visualise detection annotations, run the following command:

python3 visualise.py <image_path> <path_to_BSTD_release_v1.json>

for e.g.

python3 visualise.py D/image_141.jpg path_to_BSTD_release_v1.json

Some examples are below:

image info image info

Data Annotation

  • All the images are collected from Wikimedia commons (under Creative Commons Licence, cc-by-sa-4.0).
  • Further detection and recognition annotations are manually annotated.

Related Indian Language Scene Text Recognition Toolkit

IndicPhotoOCR

Acknowledgement

This work was partly supported by MeitY, Government of India (Project Number: S/MeitY/AM/20210114) under NLTM-Bhashini.

Contact

For any queries, please contact us at:

Citation

@misc{BSTD,
   title      = {{B}harat {S}cene {T}ext {D}ataset},
  howpublished = {\url{https://github.com/Bhashini-IITJ/BharatSceneTextDataset}},
  year         = 2024,
}

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages