Image Caption Generator

About the project:

Generation of image content description is a fundamental problem in the artificial intelligence field that connects computer vision and natural language processing. In this project, we present a comparison of models that lie in an encoder-decoder framework. We specifically experimented with different classes of backbones like: Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Transformers. To have a justifiable and consistent experiment, we trained/validated all frameworks along with its backbone on Flickr8k dataset. Moreover, we evaluated our models on a subset of COCO to assess the performance. Our results are as follows:

The best model is Densenet161 + Trans- former and it shows BLEU@1 score 62.57.
Experiments showed that such model con- figurations like CNN + LSTM are better in terms of computation time but they cannot achieve the same quality as Transformer- based architectures.

See report for more details! ⭐

Team members:

Albert Saiapin 🌱
Farid Davletshin 🌲
Fakhriddin Tojiboev 🌴
Olga Gorbunova 🌳
Evgeniy Garsiya 🌿
Hai Le 🍁
Lina Bashaeva 🌼
Dmitriy Gilyov 🌻

Environment

We use conda package manager to install required python packages. In order to improve speed and reliability of package version resolution it is advised to use mamba-forge (installation) that works over conda. Once mamba is installed, run the following command (while in the root of the repository):

mamba env create -f environment.yml

This will create new environment named img_caption with many required packages already installed. You can install additional packages by running:

mamba install <package name>

You should run the following commands to install pytorch library:

conda activate img_caption

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

conda install -c pytorch torchtext

In order to read and run Jupyter Notebooks you may follow either of two options:

[recommended] using notebook-compatibility features of modern IDEs, e.g. via python and jupyter extensions of VS Code.
install jupyter notebook packages: either with mamba install jupyterlab or with mamba install jupyter notebook

Note: If you prefer to use conda, just replace mamba commands with conda, e.g. instead of mamba install use conda install.

General setup

Clone this repository

$ git clone https://github.com/tojiboyevf/image_captioning.git

Move to project's directory and download dataset Flickr8k, COCO_2014 and GloVe

$ cd image_captioning
$ bash load_flickr8k.sh
$ bash load_glove.sh
$ bash load_coco.sh

Quick start

If you want to try re-train our models and/or observe evaluation results you are welcome to examples folder.

Open any notebook from there and follow the instructions inside.

Evaluation results

Flickr8k Dataset

		bleu 1	bleu 2	bleu 3	bleu 4
vgg16 + lstm	`train` `val` `test`	`55.53` `55.14` `55.41`	`34.94` `34.42` `34.34`	`21.94` `21.36` `21.13`	`14.02` `13.47` `13.29`
vgg16 + transformer	`train` `val` `test`	`53.13` `52.79` `52.76`	`33.63` `33.07` `33.04`	`21.01` `20.13` `20.27`	`13.21` `12.31` `12.38`
densenet161 + lstm	`train` `val` `test`	`55.05` `55.18` `55.27`	`31.18` `31.23` `30.76`	`17.79` `17.75` `17.11`	`10.84` `10.78` `10.23`
densenet161 + transformer	`train` `val` `test`	`69.55` `65.71` `65.98`	`49.93` `44.46` `44.79`	`35.55` `29.94` `30.04`	`25.03` `20.13` `19.75`
DeiT + lstm	`train` `val` `test`	`56.06` `53.23` `53.48`	`34.40` `30.86` `31.06`	`20.97` `17.62` `17.61`	`13.24` `10.91` `10.61`
DeiT + transformer	`train` `val` `test`	`70.43` `62.71` `62.57`	`53.22` `43.71` `44.09`	`42.16` `34.58` `35.11`	`35.15` `29.32` `29.80`
inceptionV3 + transformer	`train` `val` `test`	`61.44` `60.37` `60.19`	`41.09` `39.84` `39.19`	`27.52` `26.26` `25.70`	`18.29` `17.25` `16.70`
resnet34 + transformer	`train` `val` `test`	`67.23` `63.33` `63.70`	`48.05` `42.58` `42.92`	`34.08` `28.69` `29.19`	`23.84` `19.22` `19.51`

COCO val2014 Dataset

	bleu 1	bleu 2	bleu 3	bleu 4
vgg16 + lstm	`46.71`	`23.75`	`12.25`	`8.39`
vgg16 + transformer	`50.24`	`27.14`	`16.10`	`8.80`
densenet161 + lstm	`49.33`	`23.25`	`11.70`	`9.46`
densenet161 + transformer	`55.38`	`30.71`	`17.09`	`9.79`
DeiT + lstm	`45.73`	`22.04`	`11.14`	`9.12`
DeiT + transformer	`53.09`	`29.76`	`16.92`	`9.95`
inceptionV3 + transformer	`49.14`	`26.49`	`14.21`	`8.11`

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
datasets		datasets
examples		examples
models/torch		models/torch
report		report
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
glove.py		glove.py
load_coco.sh		load_coco.sh
load_flickr8k.sh		load_flickr8k.sh
load_glove.sh		load_glove.sh
metrics.py		metrics.py
requirements.txt		requirements.txt
train_torch.py		train_torch.py
train_transformer.py		train_transformer.py
utils_plot.py		utils_plot.py
utils_torch.py		utils_torch.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Caption Generator

About the project:

Environment

General setup

Quick start

Evaluation results

Flickr8k Dataset

COCO val2014 Dataset

About

Uh oh!

Releases

Packages

Languages

License

AlbMLpy/Image-Caption-Generator

Folders and files

Latest commit

History

Repository files navigation

Image Caption Generator

About the project:

Environment

General setup

Quick start

Evaluation results

Flickr8k Dataset

COCO val2014 Dataset

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages