The project aims to highlight the diverse search capabilities of Weaviate, empowered by the CLIP model. It demonstrates the potential to create robust AI applications capable of multilingual understanding and visual perception with just a few lines of code.
In particular, we will index a collection of random pictures featuring various foods from around the world.
Subsequently, we'll be able to search through them using three different inputs:
- User-provided text
- Selected image from the indexed collection
- Any uploaded image
These scenarios in Weaviate terms correspond to the following operators:
CLIP, or Contrastive Language-Image Pre-training, is a multimodal deep learning model by OpenAI that is designed to understand and generate meaningful representations of images and text, allowing it to perform tasks that involve both modalities.
CLIP is trained to learn a joint embedding space where images and text representations are aligned. This means that similar concepts in images and text are close to each other in the embedding space. In this demo we will use a multilingual CLIP model
- Python
- Weaviate
- Streamlit
- Docker
multi2vec-clip vectorizer
The multi2vec-clip module enables Weaviate to obtain vectors locally
from text or images using a Sentence-BERT CLIP model.
To be able to use it you need to enable it in the docker compose file
sentence-transformers/clip-ViT-B-32-multilingual-v1
The particular model that we'll use is sentence-transformers/clip-ViT-B-32-multilingual-v1
model. It supports
encoding of text in 50+ languages. The model is based on Multilingual Knowledge Distillation, which uses the original
clip-ViT-B-32 model as the teacher and trains a multilingual DistilBERT model as the student. As mentioned above, the
model can map text and images to a common vector space such that the distance between the
two represents their semantic similarity.
- Python3 interpreter installed
- Ability to execute docker compose (The most straightforward way to do it on Windows/Mac is to install Docker Desktop)
-
Clone this repository
-
Download the dataset (you need to be logged in to Kaggle to be able to do it) from this link and unzip it to the project root
-
Create a virtual environment and activate it
Note
This was tested using python 3.10python3 -m venv venv source venv/bin/activate
-
Install all required dependencies
pip install -r requirements.txt
-
Run containerized instance of Weaviate. It also includes vectorizer module to compute the embeddings.
Note
Make sure you don't have anything occupying port 8080
If you do, you have the option to either stop that process or change the port that Weaviate is using.docker compose up
-
Index the dataset in Weaviate. By default, 1000 pictures will be ingested
python add_data.py
If you want to have a bigger dataset you can use
--image-number
parameter to set the number of pictures to ingest:python add_data.py --image-number 3000
-
Run the Streamlit demo
streamlit run app.py
Now you can open the app on http://localhost:8501/ and also play with changing app.py on the fly
- Both streamlit app and docker compose can be stopped with
Ctrl+C
in the corresponding terminal window - To remove created docker containers and volumes use
docker compose down -v
The dataset used for this example is available on Kaggle: https://www.kaggle.com/datasets/abhijeetbhilare/world-cuisines/