Skip to content

Latest commit

 

History

History
 
 

233-blip-visual-language-processing

Visual Question Answering and Image Captioning using BLIP and OpenVINO

BLIP is a pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks. This tutorial considers ways to use BLIP for visual question answering and image captioning.

The complete pipeline of this demo is shown below:

Image Captioning

The following image shows an example of the input image and generated caption:

Visual Question Answering

The following image shows an example of the input image, question and answer generated by model

Notebook Contents

This folder contains two notebooks that show how to convert and optimize model with OpenVINO:

  1. Convert the BLIP model using OpenVINO
  2. Optimize the OpenVINO BLIP model using NNCF

The first notebook consists of the following parts:

  1. Instantiate a BLIP model.
  2. Convert the BLIP model to OpenVINO IR.
  3. Run visual question answering and image captioning with OpenVINO.

The second notebook consists of the following parts:

  1. Download and preprocess dataset for quantization.
  2. Quantize the converted vision and text encoder OpenVINO models from notebook with NNCF.
  3. Compress weights of the OpenVINO text decoder model from notebook with NNCF.
  4. Check the model result using the same input data from the notebook.
  5. Compare model size of converted and optimized models.
  6. Compare performance of converted and optimized models.

NNCF performs optimization within the OpenVINO IR. It is required to run the first notebook before running the second notebook.

Installation Instructions

This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.