feat: Added semantic chunking with chonkie tutorial (#133)

* Added tutorial * Added tutorial * Added tutorial * Updated readme * Updated readme * Updated readme * Updated readme * Updated readme * Updates * Updates * Updates * Updates * Updates * Updates * Updates
MinishLab · Nov 24, 2024 · a01819e · a01819e
1 parent 55631e3
commit a01819e
Show file tree

Hide file tree

Showing 3 changed files with 289 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -111,7 +111,7 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar
 - **Lightweight Dependencies**: the base package's only major dependency is `numpy`.
 - **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. Go green or go home.
 - **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. All you need is a model and (optionally) a custom vocabulary.
-- **Integrated into Sentence Transformers and txtai**: Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) and [txtai](https://github.com/neuml/txtai).
+- **Integrated into Sentence Transformers, txtai, and Chonkie**: Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers), [txtai](https://github.com/neuml/txtai), and [Chonkie](https://github.com/bhavnicksm/chonkie).
 - **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). Feel free to share your own.
 
 ## What is Model2Vec?
@@ -374,6 +374,30 @@ result = embeddings.search("Risotto", 1)
 
 </details>
 
+<details>
+<summary>  Chonkie </summary>
+<br>
+
+Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie:
+
+```python
+from chonkie import SDPMChunker
+
+# Create some example text to chunk
+text = "It's dangerous to go alone! Take this."
+
+# Initialize the SemanticChunker with a potion model
+chunker = SDPMChunker(
+    embedding_model="minishlab/potion-base-8M",
+    similarity_threshold=0.3
+)
+
+# Chunk the text
+chunks = chunker.chunk(text)
+```
+
+</details>
+
 <details>
 <summary>  Transformers.js </summary>
 

diff --git a/tutorials/README.md b/tutorials/README.md
@@ -11,5 +11,6 @@ This is a list of all our tutorials. They are all self-contained ipython noteboo
 
 |                    | what?                                                                                                                                                                      | Link |
 |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
-| **Recipe search**   | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb)     |
-| **Semantic deduplication** | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
+| **Recipe search** 🍝 | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb)     |
+| **Semantic deduplication** 🧹 | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
+| **Semantic chunking** 🧩 | Learn how to chunk your text into meaningful segments with [Chonkie](https://github.com/bhavnicksm/chonkie) at lightning-speed. Efficiently query your chunks with [Vicinity](https://github.com/MinishLab/vicinity). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_chunking.ipynb) |
diff --git a/tutorials/semantic_chunking.ipynb b/tutorials/semantic_chunking.ipynb
@@ -0,0 +1,261 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Semantic Chunking with Chonkie and Model2Vec**\n",
+    "\n",
+    "Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War and Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.\n",
+    "\n",
+    "After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install the necessary libraries\n",
+    "!pip install datasets model2vec numpy tqdm vicinity\n",
+    "\n",
+    "# Import the necessary libraries\n",
+    "import random \n",
+    "import re\n",
+    "import requests\n",
+    "from time import perf_counter\n",
+    "from chonkie import SDPMChunker\n",
+    "from model2vec import StaticModel\n",
+    "from vicinity import Vicinity\n",
+    "\n",
+    "random.seed(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Loading and pre-processing**\n",
+    "\n",
+    "First, we will download War and Peace and apply some basic pre-processing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# URL for War and Peace on Project Gutenberg\n",
+    "url = \"https://www.gutenberg.org/files/2600/2600-0.txt\"\n",
+    "\n",
+    "# Download the book\n",
+    "response = requests.get(url)\n",
+    "book_text = response.text\n",
+    "\n",
+    "def preprocess_text(text: str, min_length: int = 5):\n",
+    "    \"\"\"Basic text preprocessing function.\"\"\"\n",
+    "    text = text.replace(\"\\n\", \" \")\n",
+    "    text = text.replace(\"\\r\", \" \")\n",
+    "    sentences = re.findall(r'[^.!?]*[.!?]', text)\n",
+    "    # Filter out sentences shorter than the specified minimum length\n",
+    "    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]\n",
+    "    # Recombine the filtered sentences\n",
+    "    return ' '.join(filtered_sentences)\n",
+    "\n",
+    "# Preprocess the text\n",
+    "book_text = preprocess_text(book_text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Chunking with Chonkie**\n",
+    "\n",
+    "Next, we will use Chonkie to chunk our text into semantic chunks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 65,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of chunks: 6148\n",
+      "Time taken: 2.2917541670030914\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n",
+    "chunker = SDPMChunker(\n",
+    "    embedding_model=\"minishlab/potion-base-8M\",\n",
+    "    similarity_threshold=0.3,\n",
+    "    skip_window=5,\n",
+    "    chunk_size = 256\n",
+    ")\n",
+    "\n",
+    "# Chunk the text\n",
+    "time = perf_counter()\n",
+    "chunks = chunker.chunk(book_text)\n",
+    "print(f\"Number of chunks: {len(chunks)}\")\n",
+    "print(f\"Time taken: {perf_counter() - time}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And that's it, we chunked the entirety of War and Peace in ~2 seconds. Not bad! Let's look at some example chunks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 113,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Hard as it was for Princess  Mary to emerge from the realm of secluded contemplation in which she  had lived till then, and sorry and almost ashamed as she felt to leave  Natásha alone, yet the cares of life demanded her attention and she  involuntarily yielded to them. She went through the accounts with  Alpátych, conferred with Dessalles about her nephew, and gave orders and  made preparations for the journey to Moscow. Natásha remained alone and, from the time Princess Mary began making  preparations for departure, held aloof from her too. Princess Mary asked the countess to let Natásha go with her to Moscow,  and both parents gladly accepted this offer, for they saw their daughter  losing strength every day and thought that a change of scene and the  advice of Moscow doctors would be good for her. “I am not going anywhere,” Natásha replied when this was proposed to  her. “Do please just leave me alone! ” And she ran out of the room, with  difficulty refraining from tears of vexation and irritation rather than  of sorrow. \n",
+      "\n",
+      " In all these words she saw only that the danger  threatening her son would not soon be over. \n",
+      "\n",
+      " When later on in his memoirs Count Rostopchín explained his actions at  this time, he repeatedly says that he was then actuated by two important  considerations: to maintain tranquillity in Moscow and expedite the  departure of the inhabitants. If one accepts this twofold aim all  Rostopchín’s actions appear irreproachable. “Why were the holy relics,  the arms, ammunition, gunpowder, and stores of corn not removed? Why  were thousands of inhabitants deceived into believing that Moscow would  not be given up—and thereby ruined? \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print a few example chunks\n",
+    "for _ in range(3):\n",
+    "    chunk = random.choice(chunks)\n",
+    "    print(chunk.text, \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.\n",
+    "\n",
+    "**Creating a vector search index**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Time taken: 1.5817922909918707\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Initialize an embedding model and encode the chunk texts\n",
+    "time = perf_counter()\n",
+    "model = StaticModel.from_pretrained(\"minishlab/potion-base-8M\")\n",
+    "chunk_texts = [chunk.text for chunk in chunks]\n",
+    "chunk_embeddings = model.encode(chunk_texts)\n",
+    "\n",
+    "# Create a Vicinity instance\n",
+    "vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)\n",
+    "print(f\"Time taken: {perf_counter() - time}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.\n",
+    "\n",
+    "**Querying the index**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Query: Emperor Napoleon\n",
+      "--------------------------------------------------\n",
+      " In 1808 the Emperor Alexander went to Erfurt for a fresh interview with  the Emperor Napoleon, and in the upper circles of Petersburg there was  much talk of the grandeur of this important meeting. CHAPTER XXII    In 1809 the intimacy between “the world’s two arbiters,” as  Napoleon and Alexander were called, was such that when Napoleon declared  war on Austria a Russian corps crossed the frontier to co-operate with  our old enemy Bonaparte against our old ally the Emperor of Austria, and  in court circles the possibility of marriage between Napoleon and one  of Alexander’s sisters was spoken of. \n",
+      "\n",
+      " ) “It’s in the Emperor’s  service. \n",
+      "\n",
+      " “The day before yesterday it was ‘Napoléon, France,  bravoure’; yesterday, ‘Alexandre, Russie, grandeur. ’ One day our  Emperor gives it and next day Napoleon. Tomorrow our Emperor will send  a St. \n",
+      "\n",
+      "Query: The battle of Austerlitz\n",
+      "--------------------------------------------------\n",
+      " On the first arrival of the news of the battle of Austerlitz, Moscow had  been bewildered. \n",
+      "\n",
+      " That  city is taken; the Russian army suffers heavier losses than the opposing  armies had suffered in the former war from Austerlitz to Wagram. \n",
+      "\n",
+      " Behave as you did at  Austerlitz, Friedland, Vítebsk, and Smolénsk. \n",
+      "\n",
+      "Query: Paris\n",
+      "--------------------------------------------------\n",
+      " A man who doesn’t know Paris  is a savage. You can tell a Parisian two leagues off. Paris is Talma, la  Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that  his conclusion was weaker than what had gone before, he added quickly:  “There is only one Paris in the world. You have been to Paris and have  remained Russian. \n",
+      "\n",
+      " Well, what is Paris saying? \n",
+      "\n",
+      " Paris would have been the capital of the world, and the French the envy  of the nations! \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "queries = [\"Emperor Napoleon\", \"The battle of Austerlitz\", \"Paris\"]\n",
+    "for query in queries:\n",
+    "    print(f\"Query: {query}\\n{'-' * 50}\")\n",
+    "    query_embedding = model.encode(query)\n",
+    "    results = vicinity.query(query_embedding, k=3)[0]\n",
+    "\n",
+    "    for result in results:\n",
+    "        print(result[0], \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in about 3.5 seconds using Chonkie, Vicinity, and Model2Vec. Lightweight and fast, just how we like it."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "3.10.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}