Skip to content

Commit

Permalink
add llava notebook with optimum inference (#2461)
Browse files Browse the repository at this point in the history
  • Loading branch information
eaidova authored Oct 21, 2024
1 parent a5ce26b commit b9ade6a
Show file tree
Hide file tree
Showing 10 changed files with 859 additions and 1,697 deletions.
2 changes: 1 addition & 1 deletion .ci/ignore_convert_execution.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
notebooks/mms-massively-multilingual-speech/mms-massively-multilingual-speech.ipynb
notebooks/bark-text-to-audio/bark-text-to-audio.ipynb
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
notebooks/pix2struct-docvqa/pix2struct-docvqa.ipynb
notebooks/softvc-voice-conversion/softvc-voice-conversion.ipynb
notebooks/latent-consistency-models-image-generation/latent-consistency-models-image-generation.ipynb
Expand Down
4 changes: 0 additions & 4 deletions .ci/ignore_pip_conflicts.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,6 @@ notebooks/yolov8-optimization/yolov8-object-detection.ipynb # ultralytics==8.0.
notebooks/yolov8-optimization/yolov8-obb.ipynb # ultralytics==8.1.24
notebooks/llm-chatbot/llm-chatbot.ipynb # nncf@https://github.com/openvinotoolkit/nncf/tree/release_v280
notebooks/llm-rag-langchain/llm-rag-langchain.ipynb # nncf@https://github.com/openvinotoolkit/nncf/tree/release_v280
notebooks/bark-text-to-audio/bark-text-to-audio.ipynb # torch==1.13
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot.ipynb # transformers<4.35
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # transformers<4.35
notebooks/paint-by-example/paint-by-example.ipynb # gradio==3.44.1
notebooks/mobilevlm-language-assistant/mobilevlm-language-assistant.ipynb # transformers<4.35
notebooks/depth-anything/depth-anything.ipynb # install requirements.txt after clone repo
Expand All @@ -22,6 +19,5 @@ notebooks/stable-diffusion-torchdynamo-backend/stable-diffusion-torchdynamo-back
notebooks/sketch-to-image-pix2pix-turbo/sketch-to-image-pix2pix-turbo.ipynb
notebooks/yolov10-optimization/yolov10-optimization.ipynb # nncf from git
notebooks/person-counting-webcam/person-counting.ipynb # numpy should be installed first
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # torchvision < 0.17.0
notebooks/parler-tts-text-to-speech/parler-tts-text-to-speech.ipynb # torch >= 2.2
notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb # diffusers from git
4 changes: 2 additions & 2 deletions .ci/ignore_treon_docker.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ notebooks/tiny-sd-image-generation/tiny-sd-image-generation.ipynb
notebooks/zeroscope-text2video/zeroscope-text2video.ipynb
notebooks/mms-massively-multilingual-speech/mms-massively-multilingual-speech.ipynb
notebooks/bark-text-to-audio/bark-text-to-audio.ipynb
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot.ipynb
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
notebooks/decidiffusion-image-generation/decidiffusion-image-generation.ipynb
notebooks/pix2struct-docvqa/pix2struct-docvqa.ipynb
notebooks/fast-segment-anything/fast-segment-anything.ipynb
Expand Down
2 changes: 1 addition & 1 deletion .ci/skipped_notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@
- ubuntu-20.04
- ubuntu-22.04
- windows-2019
- notebook: notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
- notebook: notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
skips:
- os:
- macos-12
Expand Down
2 changes: 0 additions & 2 deletions notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@
- [Create an Agentic RAG using OpenVINO and LlamaIndex](./llm-agent-react/llm-agent-rag-llamaindex.ipynb)
- [Create Function-calling Agent using OpenVINO and Qwen-Agent](./llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb)
- [Visual-language assistant with LLaVA Next and OpenVINO](./llava-next-multimodal-chatbot/llava-next-multimodal-chatbot.ipynb)
- [Visual-language assistant with Video-LLaVA and OpenVINO](./llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb)
- [Visual-language assistant with LLaVA and OpenVINO Generative API](./llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb)
- [Text-to-Image Generation with LCM LoRA and ControlNet Conditioning](./latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb)
- [Latent Consistency Model using Optimum-Intel OpenVINO](./latent-consistency-models-image-generation/latent-consistency-models-optimum-demo.ipynb)
Expand Down Expand Up @@ -244,7 +243,6 @@
- [Create an Agentic RAG using OpenVINO and LlamaIndex](./llm-agent-react/llm-agent-rag-llamaindex.ipynb)
- [Create Function-calling Agent using OpenVINO and Qwen-Agent](./llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb)
- [Visual-language assistant with LLaVA Next and OpenVINO](./llava-next-multimodal-chatbot/llava-next-multimodal-chatbot.ipynb)
- [Visual-language assistant with Video-LLaVA and OpenVINO](./llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb)
- [Visual-language assistant with LLaVA and OpenVINO Generative API](./llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb)
- [Text-to-Image Generation with LCM LoRA and ControlNet Conditioning](./latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb)
- [Latent Consistency Model using Optimum-Intel OpenVINO](./latent-consistency-models-image-generation/latent-consistency-models-optimum-demo.ipynb)
Expand Down
32 changes: 8 additions & 24 deletions notebooks/llava-multimodal-chatbot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,14 @@ While LLaVA excels at image-based tasks, Video-LLaVA expands this fluency to the

In the field of artificial intelligence, the goal is to create a versatile assistant capable of understanding and executing tasks based on both visual and language inputs. Current approaches often rely on large vision models that solve tasks independently, with language only used to describe image content. While effective, these models have fixed interfaces with limited interactivity and adaptability to user instructions. On the other hand, large language models (LLMs) have shown promise as a universal interface for general-purpose assistants. By explicitly representing various task instructions in language, these models can be guided to switch and solve different tasks. To extend this capability to the multimodal domain, the [LLaVA paper](https://arxiv.org/abs/2304.08485) introduces `visual instruction-tuning`, a novel approach to building a general-purpose visual assistant.

In this tutorial series we consider how to use LLaVA and Video-LLaVA model to build multimodal chatbot with OpenVINO help.

## LLaVA
### Notebook contents
The tutorial consists from following steps:

- Install prerequisites
- Prepare input processor and tokenizer
- Download original model
- Compress model weights to 4 and 8 bits using NNCF
- Convert model to OpenVINO Intermediate Representation (IR) format
- Prepare OpenVINO-based inference pipeline
- Run OpenVINO model

## Video-LLaVA
### Notebook contents
The tutorial consists from following steps:

- Install prerequisites
- Download original model
- Compress model weights to 4 and 8 bits using NNCF
- Convert model to OpenVINO Intermediate Representation (IR) format
- Prepare OpenVINO-based inference pipeline
- Run OpenVINO model
In this tutorial series we consider how to use LLaVA model to build multimodal chatbot with OpenVINO help.

## Visual-language assistant with LLaVA and OpenVINO Generative API
This [notebook](./llava-multimodal-chatbot-genai.ipynb) demonstrate how to effectively build Visual-Language assistant using [OpenVINO Generative API](https://github.com/openvinotoolkit/openvino.genai).

## Visual-language assistant with LLaVA and Optimum Intel OpenVINO integration
This [notebook](./llava-multimodal-chatbot-optimum.ipynb) demonstrate how to effectively build Visual-Language assistant using [Optimum Intel](https://huggingface.co/docs/optimum/main/intel/index) OpenVINO integration.


## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
Expand Down
156 changes: 91 additions & 65 deletions notebooks/llava-multimodal-chatbot/gradio_helper.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
from pathlib import Path
from typing import Callable
import gradio as gr


from PIL import Image
from typing import Callable
import numpy as np
import requests
from threading import Event, Thread
Expand Down Expand Up @@ -132,70 +130,98 @@ def generate_and_signal_complete():
return demo


def make_demo_videollava(fn: Callable):
examples_dir = Path("Video-LLaVA/videollava/serve/examples")
gr.close_all()
demo = gr.Interface(
fn=fn,
inputs=[
gr.Image(label="Input Image", type="filepath"),
gr.Video(label="Input Video"),
gr.Textbox(label="Question"),
],
outputs=gr.Textbox(lines=10),
def make_demo_llava_optimum(model, processor):
from transformers import TextIteratorStreamer

has_additonal_buttons = "undo_button" in inspect.signature(gr.ChatInterface.__init__).parameters

def bot_streaming(message, history):
print(f"message is - {message}")
print(f"history is - {history}")
files = message["files"] if isinstance(message, dict) else message.files
message_text = message["text"] if isinstance(message, dict) else message.text
if files:
# message["files"][-1] is a Dict or just a string
if isinstance(files[-1], dict):
image = files[-1]["path"]
else:
if isinstance(files[-1], (str, Path)):
image = files[-1]
else:
image = files[-1] if isinstance(files[-1], (list, tuple)) else files[-1].path
else:
# if there's no image uploaded for this turn, look for images in the past turns
# kept inside tuples, take the last one
for hist in history:
if type(hist[0]) == tuple:
image = hist[0][0]
try:
if image is None:
# Handle the case where image is None
raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
except NameError:
# Handle the case where 'image' is not defined at all
raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")

conversation = []
flag = False
for user, assistant in history:
if assistant is None:
# pass
flag = True
conversation.extend([{"role": "user", "content": []}])
continue
if flag == True:
conversation[0]["content"] = [{"type": "text", "text": f"{user}"}]
conversation.append({"role": "assistant", "text": assistant})
flag = False
continue
conversation.extend([{"role": "user", "content": [{"type": "text", "text": user}]}, {"role": "assistant", "text": assistant}])

conversation.append({"role": "user", "content": [{"type": "text", "text": f"{message_text}"}, {"type": "image"}]})
prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
print(f"prompt is -\n{prompt}")
image = Image.open(image)
inputs = processor(text=prompt, images=image, return_tensors="pt")

streamer = TextIteratorStreamer(
processor,
**{
"skip_special_tokens": True,
"skip_prompt": True,
"clean_up_tokenization_spaces": False,
},
)
generation_kwargs = dict(
inputs,
streamer=streamer,
max_new_tokens=1024,
do_sample=False,
temperature=0.0,
eos_token_id=processor.tokenizer.eos_token_id,
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

buffer = ""
for new_text in streamer:
buffer += new_text
yield buffer

additional_buttons = {}
if has_additonal_buttons:
additional_buttons = {"undo_button": None, "retry_button": None}

demo = gr.ChatInterface(
fn=bot_streaming,
title="LLaVA OpenVINO Chatbot",
examples=[
[
f"{examples_dir}/extreme_ironing.jpg",
None,
"What is unusual about this image?",
],
[
f"{examples_dir}/waterview.jpg",
None,
"What are the things I should be cautious about when I visit here?",
],
[
f"{examples_dir}/desert.jpg",
None,
"If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert?",
],
[
None,
f"{examples_dir}/sample_demo_1.mp4",
"Why is this video funny?",
],
[
None,
f"{examples_dir}/sample_demo_3.mp4",
"Can you identify any safety hazards in this video?",
],
[
None,
f"{examples_dir}/sample_demo_9.mp4",
"Describe the video.",
],
[
None,
f"{examples_dir}/sample_demo_22.mp4",
"Describe the activity in the video.",
],
[
f"{examples_dir}/sample_img_22.png",
f"{examples_dir}/sample_demo_22.mp4",
"Are the instruments in the pictures used in the video?",
],
[
f"{examples_dir}/sample_img_13.png",
f"{examples_dir}/sample_demo_13.mp4",
"Does the flag in the image appear in the video?",
],
[
f"{examples_dir}/sample_img_8.png",
f"{examples_dir}/sample_demo_8.mp4",
"Are the image and the video depicting the same place?",
],
{"text": "What is on the flower?", "files": ["./bee.jpg"]},
{"text": "How to make this pastry?", "files": ["./baklava.png"]},
],
title="Video-LLaVA🚀",
allow_flagging="never",
stop_btn=None,
multimodal=True,
**additional_buttons,
)
return demo
Loading

0 comments on commit b9ade6a

Please sign in to comment.