This model leverages pre-trained weights from the EfficientNet model on ImageNet for the CNN part. In contrast, the Transformer component is trained from scratch to generate captions. Also the training dataset is 40K preprocessing images and captions from Coco Flicker Farsi dataset:
This project's dataset is a subset of the Coco-Flickr Farsi dataset, totaling 19 GB. To train the model effectively, the dataset has been filtered by clipping captions' length between 10 to 25 tokens. The resulting equalized dataset comprises 40,000 images. Histograms and plots illustrating the distribution of caption lengths in the dataset are as follows:
Caption: Distribution of original captions lenght
Caption: Distribution of filtered dataset caption lens
PIC model is designed with a three-part architecture, utilizing Convolutional Neural Networks (CNNs), Encoders and Decoders (Transformers):
-
CNN: The EfficientNetB0 model is employed as the initial layer to extract meaningful features from input images. The pre-trained weights from ImageNet are used, and the feature extractor is frozen during training.
-
Encoder: The extracted image features are passed through a Transformer-based encoder. This encoder enhances the representation of the inputs, incorporating self-attention mechanisms for better context understanding.
-
Decoder: This model takes the encoder output and text data (sequences) as inputs. It is trained to generate captions by utilizing self-attention and cross-attention mechanisms. The decoder incorporates positional embeddings for sequence information and employs dropout layers for regularization. Models**: Utilize the latest advancements in deep learning for image captioning.