Skip to content

Commit

Permalink
feat: Added semantic chunking with chonkie tutorial (#133)
Browse files Browse the repository at this point in the history
* Added tutorial

* Added tutorial

* Added tutorial

* Updated readme

* Updated readme

* Updated readme

* Updated readme

* Updated readme

* Updates

* Updates

* Updates

* Updates

* Updates

* Updates

* Updates
  • Loading branch information
Pringled authored Nov 24, 2024
1 parent 55631e3 commit a01819e
Show file tree
Hide file tree
Showing 3 changed files with 289 additions and 3 deletions.
26 changes: 25 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar
- **Lightweight Dependencies**: the base package's only major dependency is `numpy`.
- **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. Go green or go home.
- **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. All you need is a model and (optionally) a custom vocabulary.
- **Integrated into Sentence Transformers and txtai**: Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) and [txtai](https://github.com/neuml/txtai).
- **Integrated into Sentence Transformers, txtai, and Chonkie**: Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers), [txtai](https://github.com/neuml/txtai), and [Chonkie](https://github.com/bhavnicksm/chonkie).
- **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). Feel free to share your own.

## What is Model2Vec?
Expand Down Expand Up @@ -374,6 +374,30 @@ result = embeddings.search("Risotto", 1)

</details>

<details>
<summary> Chonkie </summary>
<br>

Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie:

```python
from chonkie import SDPMChunker

# Create some example text to chunk
text = "It's dangerous to go alone! Take this."

# Initialize the SemanticChunker with a potion model
chunker = SDPMChunker(
embedding_model="minishlab/potion-base-8M",
similarity_threshold=0.3
)

# Chunk the text
chunks = chunker.chunk(text)
```

</details>

<details>
<summary> Transformers.js </summary>

Expand Down
5 changes: 3 additions & 2 deletions tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ This is a list of all our tutorials. They are all self-contained ipython noteboo

| | what? | Link |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| **Recipe search** | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb) |
| **Semantic deduplication** | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
| **Recipe search** 🍝 | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb) |
| **Semantic deduplication** 🧹 | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
| **Semantic chunking** 🧩 | Learn how to chunk your text into meaningful segments with [Chonkie](https://github.com/bhavnicksm/chonkie) at lightning-speed. Efficiently query your chunks with [Vicinity](https://github.com/MinishLab/vicinity). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_chunking.ipynb) |
261 changes: 261 additions & 0 deletions tutorials/semantic_chunking.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Semantic Chunking with Chonkie and Model2Vec**\n",
"\n",
"Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War and Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.\n",
"\n",
"After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install the necessary libraries\n",
"!pip install datasets model2vec numpy tqdm vicinity\n",
"\n",
"# Import the necessary libraries\n",
"import random \n",
"import re\n",
"import requests\n",
"from time import perf_counter\n",
"from chonkie import SDPMChunker\n",
"from model2vec import StaticModel\n",
"from vicinity import Vicinity\n",
"\n",
"random.seed(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Loading and pre-processing**\n",
"\n",
"First, we will download War and Peace and apply some basic pre-processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# URL for War and Peace on Project Gutenberg\n",
"url = \"https://www.gutenberg.org/files/2600/2600-0.txt\"\n",
"\n",
"# Download the book\n",
"response = requests.get(url)\n",
"book_text = response.text\n",
"\n",
"def preprocess_text(text: str, min_length: int = 5):\n",
" \"\"\"Basic text preprocessing function.\"\"\"\n",
" text = text.replace(\"\\n\", \" \")\n",
" text = text.replace(\"\\r\", \" \")\n",
" sentences = re.findall(r'[^.!?]*[.!?]', text)\n",
" # Filter out sentences shorter than the specified minimum length\n",
" filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]\n",
" # Recombine the filtered sentences\n",
" return ' '.join(filtered_sentences)\n",
"\n",
"# Preprocess the text\n",
"book_text = preprocess_text(book_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Chunking with Chonkie**\n",
"\n",
"Next, we will use Chonkie to chunk our text into semantic chunks."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of chunks: 6148\n",
"Time taken: 2.2917541670030914\n"
]
}
],
"source": [
"# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n",
"chunker = SDPMChunker(\n",
" embedding_model=\"minishlab/potion-base-8M\",\n",
" similarity_threshold=0.3,\n",
" skip_window=5,\n",
" chunk_size = 256\n",
")\n",
"\n",
"# Chunk the text\n",
"time = perf_counter()\n",
"chunks = chunker.chunk(book_text)\n",
"print(f\"Number of chunks: {len(chunks)}\")\n",
"print(f\"Time taken: {perf_counter() - time}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And that's it, we chunked the entirety of War and Peace in ~2 seconds. Not bad! Let's look at some example chunks."
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Hard as it was for Princess Mary to emerge from the realm of secluded contemplation in which she had lived till then, and sorry and almost ashamed as she felt to leave Natásha alone, yet the cares of life demanded her attention and she involuntarily yielded to them. She went through the accounts with Alpátych, conferred with Dessalles about her nephew, and gave orders and made preparations for the journey to Moscow. Natásha remained alone and, from the time Princess Mary began making preparations for departure, held aloof from her too. Princess Mary asked the countess to let Natásha go with her to Moscow, and both parents gladly accepted this offer, for they saw their daughter losing strength every day and thought that a change of scene and the advice of Moscow doctors would be good for her. “I am not going anywhere,” Natásha replied when this was proposed to her. “Do please just leave me alone! ” And she ran out of the room, with difficulty refraining from tears of vexation and irritation rather than of sorrow. \n",
"\n",
" In all these words she saw only that the danger threatening her son would not soon be over. \n",
"\n",
" When later on in his memoirs Count Rostopchín explained his actions at this time, he repeatedly says that he was then actuated by two important considerations: to maintain tranquillity in Moscow and expedite the departure of the inhabitants. If one accepts this twofold aim all Rostopchín’s actions appear irreproachable. “Why were the holy relics, the arms, ammunition, gunpowder, and stores of corn not removed? Why were thousands of inhabitants deceived into believing that Moscow would not be given up—and thereby ruined? \n",
"\n"
]
}
],
"source": [
"# Print a few example chunks\n",
"for _ in range(3):\n",
" chunk = random.choice(chunks)\n",
" print(chunk.text, \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.\n",
"\n",
"**Creating a vector search index**"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Time taken: 1.5817922909918707\n"
]
}
],
"source": [
"# Initialize an embedding model and encode the chunk texts\n",
"time = perf_counter()\n",
"model = StaticModel.from_pretrained(\"minishlab/potion-base-8M\")\n",
"chunk_texts = [chunk.text for chunk in chunks]\n",
"chunk_embeddings = model.encode(chunk_texts)\n",
"\n",
"# Create a Vicinity instance\n",
"vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)\n",
"print(f\"Time taken: {perf_counter() - time}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.\n",
"\n",
"**Querying the index**"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Query: Emperor Napoleon\n",
"--------------------------------------------------\n",
" In 1808 the Emperor Alexander went to Erfurt for a fresh interview with the Emperor Napoleon, and in the upper circles of Petersburg there was much talk of the grandeur of this important meeting. CHAPTER XXII In 1809 the intimacy between “the world’s two arbiters,” as Napoleon and Alexander were called, was such that when Napoleon declared war on Austria a Russian corps crossed the frontier to co-operate with our old enemy Bonaparte against our old ally the Emperor of Austria, and in court circles the possibility of marriage between Napoleon and one of Alexander’s sisters was spoken of. \n",
"\n",
" ) “It’s in the Emperor’s service. \n",
"\n",
" “The day before yesterday it was ‘Napoléon, France, bravoure’; yesterday, ‘Alexandre, Russie, grandeur. ’ One day our Emperor gives it and next day Napoleon. Tomorrow our Emperor will send a St. \n",
"\n",
"Query: The battle of Austerlitz\n",
"--------------------------------------------------\n",
" On the first arrival of the news of the battle of Austerlitz, Moscow had been bewildered. \n",
"\n",
" That city is taken; the Russian army suffers heavier losses than the opposing armies had suffered in the former war from Austerlitz to Wagram. \n",
"\n",
" Behave as you did at Austerlitz, Friedland, Vítebsk, and Smolénsk. \n",
"\n",
"Query: Paris\n",
"--------------------------------------------------\n",
" A man who doesn’t know Paris is a savage. You can tell a Parisian two leagues off. Paris is Talma, la Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that his conclusion was weaker than what had gone before, he added quickly: “There is only one Paris in the world. You have been to Paris and have remained Russian. \n",
"\n",
" Well, what is Paris saying? \n",
"\n",
" Paris would have been the capital of the world, and the French the envy of the nations! \n",
"\n"
]
}
],
"source": [
"queries = [\"Emperor Napoleon\", \"The battle of Austerlitz\", \"Paris\"]\n",
"for query in queries:\n",
" print(f\"Query: {query}\\n{'-' * 50}\")\n",
" query_embedding = model.encode(query)\n",
" results = vicinity.query(query_embedding, k=3)[0]\n",
"\n",
" for result in results:\n",
" print(result[0], \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in about 3.5 seconds using Chonkie, Vicinity, and Model2Vec. Lightweight and fast, just how we like it."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "3.10.12",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit a01819e

Please sign in to comment.