BLIP is a pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks. This tutorial considers ways to use BLIP for visual question answering and image captioning.
The complete pipeline of this demo is shown below:
The following image shows an example of the input image and generated caption:
The following image shows an example of the input image, question and answer generated by modelThis folder contains two notebooks that show how to convert and optimize model with OpenVINO:
The first notebook consists of the following parts:
- Instantiate a BLIP model.
- Convert the BLIP model to OpenVINO IR.
- Run visual question answering and image captioning with OpenVINO.
The second notebook consists of the following parts:
- Download and preprocess dataset for quantization.
- Quantize the converted vision and text encoder OpenVINO models from notebook with NNCF.
- Compress weights of the OpenVINO text decoder model from notebook with NNCF.
- Check the model result using the same input data from the notebook.
- Compare model size of converted and optimized models.
- Compare performance of converted and optimized models.
NNCF performs optimization within the OpenVINO IR. It is required to run the first notebook before running the second notebook.
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to Installation Guide.