Generation of image content description is a fundamental problem in the artificial intelligence field that connects computer vision and natural language processing. In this project, we present a comparison of models that lie in an encoder-decoder framework. We specifically experimented with different classes of backbones like: Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Transformers. To have a justifiable and consistent experiment, we trained/validated all frameworks along with its backbone on Flickr8k dataset. Moreover, we evaluated our models on a subset of COCO to assess the performance. Our results are as follows:
- The best model is Densenet161 + Trans- former and it shows BLEU@1 score 62.57.
- Experiments showed that such model con- figurations like CNN + LSTM are better in terms of computation time but they cannot achieve the same quality as Transformer- based architectures.
See report for more details! ⭐
Team members:
- Albert Saiapin 🌱
- Farid Davletshin 🌲
- Fakhriddin Tojiboev 🌴
- Olga Gorbunova 🌳
- Evgeniy Garsiya 🌿
- Hai Le 🍁
- Lina Bashaeva 🌼
- Dmitriy Gilyov 🌻
We use conda package manager to install required python packages. In order to improve speed and reliability of package version resolution it is advised to use mamba-forge (installation) that works over conda. Once mamba is installed, run the following command (while in the root of the repository):
mamba env create -f environment.yml
This will create new environment named img_caption with many required packages already installed. You can install additional packages by running:
mamba install <package name>
You should run the following commands to install pytorch library:
conda activate img_caption
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c pytorch torchtext
In order to read and run Jupyter Notebooks you may follow either of two options:
- [recommended] using notebook-compatibility features of modern IDEs, e.g. via
pythonandjupyterextensions of VS Code. - install jupyter notebook packages:
either with
mamba install jupyterlabor withmamba install jupyter notebook
Note: If you prefer to use conda, just replace mamba commands with conda, e.g. instead of mamba install use conda install.
- Clone this repository
$ git clone https://github.com/tojiboyevf/image_captioning.git- Move to project's directory and download dataset Flickr8k, COCO_2014 and GloVe
$ cd image_captioning
$ bash load_flickr8k.sh
$ bash load_glove.sh
$ bash load_coco.shIf you want to try re-train our models and/or observe evaluation results you are welcome to examples folder.
Open any notebook from there and follow the instructions inside.
| bleu 1 | bleu 2 | bleu 3 | bleu 4 | ||
|---|---|---|---|---|---|
| vgg16 + lstm | trainvaltest |
55.5355.1455.41 |
34.9434.4234.34 |
21.9421.3621.13 |
14.0213.4713.29 |
| vgg16 + transformer | trainvaltest |
53.1352.7952.76 |
33.6333.0733.04 |
21.0120.1320.27 |
13.2112.3112.38 |
| densenet161 + lstm | trainvaltest |
55.0555.1855.27 |
31.1831.2330.76 |
17.7917.7517.11 |
10.8410.7810.23 |
| densenet161 + transformer | trainvaltest |
69.5565.7165.98 |
49.9344.4644.79 |
35.5529.9430.04 |
25.0320.1319.75 |
| DeiT + lstm | trainvaltest |
56.0653.2353.48 |
34.4030.8631.06 |
20.9717.6217.61 |
13.2410.9110.61 |
| DeiT + transformer | trainvaltest |
70.4362.7162.57 |
53.2243.7144.09 |
42.1634.5835.11 |
35.1529.3229.80 |
| inceptionV3 + transformer | trainvaltest |
61.4460.3760.19 |
41.0939.8439.19 |
27.5226.2625.70 |
18.2917.2516.70 |
| resnet34 + transformer | trainvaltest |
67.2363.3363.70 |
48.0542.5842.92 |
34.0828.6929.19 |
23.8419.2219.51 |
| bleu 1 | bleu 2 | bleu 3 | bleu 4 | |
|---|---|---|---|---|
| vgg16 + lstm | 46.71 |
23.75 |
12.25 |
8.39 |
| vgg16 + transformer | 50.24 |
27.14 |
16.10 |
8.80 |
| densenet161 + lstm | 49.33 |
23.25 |
11.70 |
9.46 |
| densenet161 + transformer | 55.38 |
30.71 |
17.09 |
9.79 |
| DeiT + lstm | 45.73 |
22.04 |
11.14 |
9.12 |
| DeiT + transformer | 53.09 |
29.76 |
16.92 |
9.95 |
| inceptionV3 + transformer | 49.14 |
26.49 |
14.21 |
8.11 |