Skip to content

Commit

Permalink
Getting Started Improvements (#138)
Browse files Browse the repository at this point in the history
* Update FastEmbed README.md

* Rewrite GettingStarted to use TextEmbedding instead of DefaultEmbedding

* Improve grammar

* Update Getting Started.ipynb with model information and document format
  • Loading branch information
NirantK authored Mar 7, 2024
1 parent 5603fbe commit 9ffde58
Show file tree
Hide file tree
Showing 2 changed files with 137 additions and 138 deletions.
35 changes: 19 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,13 @@ FastEmbed is a lightweight, fast, Python library built for embedding generation.

The default text embedding (`TextEmbedding`) model is Flag Embedding, the top model in the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard. It supports "query" and "passage" prefixes for the input text. Here is an example for [Retrieval Embedding Generation](https://qdrant.github.io/fastembed/examples/Retrieval_with_FastEmbed/) and how to use [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/).

1. Light & Fast
- Quantized model weights
- ONNX Runtime, no PyTorch dependency
- CPU-first design
- Data-parallelism for encoding of large datasets
## 📈 Why FastEmbed?

2. Accuracy/Recall
- Better than OpenAI Ada-002
- Default is Flag Embedding, which is top of the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard
- List of [supported models](https://qdrant.github.io/fastembed/examples/Supported_Models/) - including multilingual models
1. Light: FastEmbed is a lightweight library with few external dependencies. We don't require a GPU and don't download GBs of PyTorch dependencies, and instead use the ONNX Runtime. This makes it a great candidate for serverless runtimes like AWS Lambda.

2. Fast: FastEmbed is designed for speed. We use the ONNX Runtime, which is faster than PyTorch. We also use data-parallelism for encoding large datasets.

3. Accurate: FastEmbed is better than OpenAI Ada-002. We also [supported](https://qdrant.github.io/fastembed/examples/Supported_Models/) an ever expanding set of models, including a few multilingual models.

## 🚀 Installation

Expand All @@ -26,18 +23,24 @@ pip install fastembed
## 📖 Quickstart

```python
import numpy as np
from fastembed import TextEmbedding
from typing import List
import numpy as np

# Example list of documents
documents: List[str] = [
"passage: Hello, World!",
"query: Hello, World!", # these are two different embedding
"passage: This is an example passage.",
"fastembed is supported by and maintained by Qdrant." # You can leave out the prefix but it's recommended
"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.",
"fastembed is supported by and maintained by Qdrant.",
]
embedding_model = TextEmbedding(model_name="BAAI/bge-base-en")
embeddings: List[np.ndarray] = list(embedding_model.embed(documents)) # Note the list() call - this is a generator

# This will trigger the model download and initialization
embedding_model = TextEmbedding()
print("The model BAAI/bge-small-en-v1.5 is ready to use.")

embeddings_generator = embedding_model.embed(documents) # reminder this is a generator
embeddings_list = list(embedding_model.embed(documents))
# you can also convert the generator to a list, and that to a numpy array
len(embeddings_list[0]) # Vector of 384 dimensions
```

## Usage with Qdrant
Expand Down
240 changes: 118 additions & 122 deletions docs/Getting Started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@
"\n",
"## Quick Start\n",
"\n",
"The fastembed package is designed to be easy to use. The main class is the `Embedding` class. It takes a list of strings as input and returns a list of vectors as output. The `Embedding` class is initialized with a model file."
"The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n",
"\n",
"> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)"
]
},
{
Expand All @@ -21,15 +23,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install fastembed --upgrade --quiet # Install fastembed "
]
},
{
"cell_type": "markdown",
"id": "ed81d725",
"metadata": {},
"source": [
"Make the necessary imports, initialize the `Embedding` class, and embed your data into vectors:"
"!pip install -Uqq fastembed # Install fastembed"
]
},
{
Expand All @@ -39,186 +33,188 @@
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 76.7M/76.7M [00:05<00:00, 15.0MiB/s]\n",
"100%|██████████| 3/3 [00:00<00:00, 455.37it/s]"
]
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "890cc3b969354eec8d149d143e301a7a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(384,)\n"
"The model BAAI/bge-small-en-v1.5 is ready to use.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
"data": {
"text/plain": [
"384"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from typing import List\n",
"import numpy as np\n",
"from fastembed.embedding import DefaultEmbedding\n",
"from fastembed import TextEmbedding\n",
"from typing import List\n",
"\n",
"# Example list of documents\n",
"documents: List[str] = [\n",
" \"Hello, World!\",\n",
" \"This is an example document.\",\n",
" \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n",
" \"fastembed is supported by and maintained by Qdrant.\",\n",
"]\n",
"# Initialize the DefaultEmbedding class\n",
"embedding_model = DefaultEmbedding()\n",
"embeddings: List[np.ndarray] = list(embedding_model.embed(documents))\n",
"print(embeddings[0].shape)"
"\n",
"# This will trigger the model download and initialization\n",
"embedding_model = TextEmbedding()\n",
"print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n",
"\n",
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
"embeddings_list = list(embedding_model.embed(documents))\n",
" # you can also convert the generator to a list, and that to a numpy array\n",
"len(embeddings_list[0]) # Vector of 384 dimensions"
]
},
{
"cell_type": "markdown",
"id": "8c49ae50",
"id": "d772190b",
"metadata": {},
"source": [
"## Let's think step by step"
]
},
{
"cell_type": "markdown",
"id": "92cf4b76",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"Importing the required classes and modules:"
"> 💡 **Why do we use generators?**\n",
"> \n",
"> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c0a6f634",
"metadata": {},
"outputs": [],
"source": [
"from typing import List\n",
"import numpy as np\n",
"from fastembed.embedding import DefaultEmbedding"
]
},
{
"cell_type": "markdown",
"id": "3fd03a71",
"id": "8a225cb8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n",
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n",
"Document: fastembed is supported by and maintained by Qdrant.\n",
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n"
]
}
],
"source": [
"Notice that we are using the DefaultEmbedding -- which is a quantized, state of the Art Flag Embedding model which beats OpenAI's Embedding by a large margin. \n",
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
"\n",
"### Prepare your Documents\n",
"You can define a list of documents that you'd like to embed. These can be sentences, paragraphs, or even entire documents. \n",
"\n",
"#### Format of the Document List\n",
"1. List of Strings: Your documents must be in a list, and each document must be a string\n",
"2. For Retrieval Tasks: If you're working with queries and passages, you can add special labels to them:\n",
"- **Queries**: Add \"query:\" at the beginning of each query string\n",
"- **Passages**: Add \"passage:\" at the beginning of each passage string"
"for doc, vector in zip(documents, embeddings_generator):\n",
" print(\"Document:\", doc)\n",
" print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "145a56ce",
"execution_count": 4,
"id": "769a1be9",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"(2, 384)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Example list of documents\n",
"documents: List[str] = [\n",
" \"passage: Hello, World!\",\n",
" \"query: Hello, World!\", # these are two different embedding\n",
" \"passage: This is an example passage.\",\n",
" # You can leave out the prefix but it's recommended\n",
" \"fastembed is supported by and maintained by Qdrant.\",\n",
"]"
"embeddings_list = np.array(\n",
" list(embedding_model.embed(documents))\n",
") # you can also convert the generator to a list, and that to a numpy array\n",
"embeddings_list.shape"
]
},
{
"cell_type": "markdown",
"id": "1cb3cc87",
"id": "8c49ae50",
"metadata": {},
"source": [
"### Load the Embedding Model Weights\n",
"Next, initialize the Embedding class with the desired parameters. Here, \"BAAI/bge-small-en\" is the pre-trained model name, and max_length=512 is the maximum token length for each document.\n",
"We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n",
"\n",
"This will download the model weights, decompress to directory `local_cache` and load them into the Embedding class.\n",
"#### Format of the Document List\n",
"\n",
"#### Initialize DefaultEmbedding\n",
"1. List of Strings: Your documents must be in a list, and each document must be a string\n",
"2. For Retrieval Tasks with our default: If you're working with queries and passages, you can add special labels to them:\n",
"- **Queries**: Add \"query:\" at the beginning of each query string\n",
"- **Passages**: Add \"passage:\" at the beginning of each passage string\n",
"\n",
"We will initialize Flag Embeddings with the model name and the maximum token length. That is the DefaultEmbedding class with the model name \"BAAI/bge-small-en\" and max_length=512."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "272c8915",
"metadata": {},
"outputs": [],
"source": [
"embedding_model = DefaultEmbedding()"
]
},
{
"cell_type": "markdown",
"id": "5549d501",
"metadata": {},
"source": [
"### Embed your Documents\n",
"## Beyond the default model\n",
"\n",
"Use the embed method of the embedding model to transform the documents into a List of np.array. The method returns a generator, so we cast it to a list to get the embeddings."
"The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8013eee9",
"id": "2e9c8766",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 4/4 [00:00<00:00, 361.82it/s]\n"
]
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9470ec542f3c4400a42452c2489a1abc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"embeddings: List[np.ndarray] = list(embedding_model.embed(documents))"
]
},
{
"cell_type": "markdown",
"id": "e5b5a6ad",
"metadata": {},
"source": [
"You can print the shape of the embeddings to understand their dimensions. Typically, the shape will indicate the number of dimensions in the vector."
"multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\") # This can take a few minutes to download"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "0d8c8e08",
"id": "a9e70f0e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(384,)\n"
]
"data": {
"text/plain": [
"(4, 1024)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(embeddings[0].shape) # (384,) or similar output"
"np.array(list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))).shape # Vector of 1024 dimensions"
]
},
{
"cell_type": "markdown",
"id": "64fe20ed",
"metadata": {},
"source": [
"Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)"
]
}
],
Expand All @@ -238,7 +234,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
"version": "3.10.13"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 9ffde58

Please sign in to comment.