- The idea of this project is to create a Deep Learning model that deliver textual description from given photographs. Thus a combination of different techniques from both Computer Vision and Natural Langue Processing are conducted.
- I used pre-trained weight on imagenet dataset of Resnet50 architecture to extract training features. Later LSTM and Reset50 are combined as one deep CNN to train this caption generator model.
- The dataset used for this dataset can be downloaded from Kaggle.
-
In order to execute notebook
test_model_image_generators.ipynb
or python scripttest_model_image_generators.py
, it is required to conduct several preprocessing stages and train the model in advance. Folder/preprocess
contains python scripts with function described as below:extract_image_features.py
: extract features with pre-trained, fine-tune ResNet50 using Imagenet weight from training dataset. All features extracted are save as a pickle file/preprocess/features.pkl
for later use (the file is not uploaded since it is too heavy)generate_tokenizer.py
: generate a tokenizer as pickle file/preprocess/tokenizer.pkl
preprocess_text_data
: preprocess descriptions of every training image and save them as/preprocess/descriptions.txt
-
After obtaining 3 files
features.pkl
,tokenizer.pkl
,descriptions.txt
in folder/preprocess
, runtrain.py
to star training model. It should be notice that the training process consume lots of time and computational power (GPU and >8Gb is required). After the training is finished, the entire model and its weight would be save as fileimage_captioning_model.h5
(this file is not uploaded since it is too heavy).