-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Added semantic chunking with chonkie tutorial (#133)
* Added tutorial * Added tutorial * Added tutorial * Updated readme * Updated readme * Updated readme * Updated readme * Updated readme * Updates * Updates * Updates * Updates * Updates * Updates * Updates
- Loading branch information
Showing
3 changed files
with
289 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,261 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Semantic Chunking with Chonkie and Model2Vec**\n", | ||
"\n", | ||
"Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War and Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.\n", | ||
"\n", | ||
"After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Install the necessary libraries\n", | ||
"!pip install datasets model2vec numpy tqdm vicinity\n", | ||
"\n", | ||
"# Import the necessary libraries\n", | ||
"import random \n", | ||
"import re\n", | ||
"import requests\n", | ||
"from time import perf_counter\n", | ||
"from chonkie import SDPMChunker\n", | ||
"from model2vec import StaticModel\n", | ||
"from vicinity import Vicinity\n", | ||
"\n", | ||
"random.seed(0)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Loading and pre-processing**\n", | ||
"\n", | ||
"First, we will download War and Peace and apply some basic pre-processing." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# URL for War and Peace on Project Gutenberg\n", | ||
"url = \"https://www.gutenberg.org/files/2600/2600-0.txt\"\n", | ||
"\n", | ||
"# Download the book\n", | ||
"response = requests.get(url)\n", | ||
"book_text = response.text\n", | ||
"\n", | ||
"def preprocess_text(text: str, min_length: int = 5):\n", | ||
" \"\"\"Basic text preprocessing function.\"\"\"\n", | ||
" text = text.replace(\"\\n\", \" \")\n", | ||
" text = text.replace(\"\\r\", \" \")\n", | ||
" sentences = re.findall(r'[^.!?]*[.!?]', text)\n", | ||
" # Filter out sentences shorter than the specified minimum length\n", | ||
" filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]\n", | ||
" # Recombine the filtered sentences\n", | ||
" return ' '.join(filtered_sentences)\n", | ||
"\n", | ||
"# Preprocess the text\n", | ||
"book_text = preprocess_text(book_text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Chunking with Chonkie**\n", | ||
"\n", | ||
"Next, we will use Chonkie to chunk our text into semantic chunks." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 65, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Number of chunks: 6148\n", | ||
"Time taken: 2.2917541670030914\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n", | ||
"chunker = SDPMChunker(\n", | ||
" embedding_model=\"minishlab/potion-base-8M\",\n", | ||
" similarity_threshold=0.3,\n", | ||
" skip_window=5,\n", | ||
" chunk_size = 256\n", | ||
")\n", | ||
"\n", | ||
"# Chunk the text\n", | ||
"time = perf_counter()\n", | ||
"chunks = chunker.chunk(book_text)\n", | ||
"print(f\"Number of chunks: {len(chunks)}\")\n", | ||
"print(f\"Time taken: {perf_counter() - time}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"And that's it, we chunked the entirety of War and Peace in ~2 seconds. Not bad! Let's look at some example chunks." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 113, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" Hard as it was for Princess Mary to emerge from the realm of secluded contemplation in which she had lived till then, and sorry and almost ashamed as she felt to leave Natásha alone, yet the cares of life demanded her attention and she involuntarily yielded to them. She went through the accounts with Alpátych, conferred with Dessalles about her nephew, and gave orders and made preparations for the journey to Moscow. Natásha remained alone and, from the time Princess Mary began making preparations for departure, held aloof from her too. Princess Mary asked the countess to let Natásha go with her to Moscow, and both parents gladly accepted this offer, for they saw their daughter losing strength every day and thought that a change of scene and the advice of Moscow doctors would be good for her. “I am not going anywhere,” Natásha replied when this was proposed to her. “Do please just leave me alone! ” And she ran out of the room, with difficulty refraining from tears of vexation and irritation rather than of sorrow. \n", | ||
"\n", | ||
" In all these words she saw only that the danger threatening her son would not soon be over. \n", | ||
"\n", | ||
" When later on in his memoirs Count Rostopchín explained his actions at this time, he repeatedly says that he was then actuated by two important considerations: to maintain tranquillity in Moscow and expedite the departure of the inhabitants. If one accepts this twofold aim all Rostopchín’s actions appear irreproachable. “Why were the holy relics, the arms, ammunition, gunpowder, and stores of corn not removed? Why were thousands of inhabitants deceived into believing that Moscow would not be given up—and thereby ruined? \n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Print a few example chunks\n", | ||
"for _ in range(3):\n", | ||
" chunk = random.choice(chunks)\n", | ||
" print(chunk.text, \"\\n\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.\n", | ||
"\n", | ||
"**Creating a vector search index**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 67, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Time taken: 1.5817922909918707\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Initialize an embedding model and encode the chunk texts\n", | ||
"time = perf_counter()\n", | ||
"model = StaticModel.from_pretrained(\"minishlab/potion-base-8M\")\n", | ||
"chunk_texts = [chunk.text for chunk in chunks]\n", | ||
"chunk_embeddings = model.encode(chunk_texts)\n", | ||
"\n", | ||
"# Create a Vicinity instance\n", | ||
"vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)\n", | ||
"print(f\"Time taken: {perf_counter() - time}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.\n", | ||
"\n", | ||
"**Querying the index**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 68, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Query: Emperor Napoleon\n", | ||
"--------------------------------------------------\n", | ||
" In 1808 the Emperor Alexander went to Erfurt for a fresh interview with the Emperor Napoleon, and in the upper circles of Petersburg there was much talk of the grandeur of this important meeting. CHAPTER XXII In 1809 the intimacy between “the world’s two arbiters,” as Napoleon and Alexander were called, was such that when Napoleon declared war on Austria a Russian corps crossed the frontier to co-operate with our old enemy Bonaparte against our old ally the Emperor of Austria, and in court circles the possibility of marriage between Napoleon and one of Alexander’s sisters was spoken of. \n", | ||
"\n", | ||
" ) “It’s in the Emperor’s service. \n", | ||
"\n", | ||
" “The day before yesterday it was ‘Napoléon, France, bravoure’; yesterday, ‘Alexandre, Russie, grandeur. ’ One day our Emperor gives it and next day Napoleon. Tomorrow our Emperor will send a St. \n", | ||
"\n", | ||
"Query: The battle of Austerlitz\n", | ||
"--------------------------------------------------\n", | ||
" On the first arrival of the news of the battle of Austerlitz, Moscow had been bewildered. \n", | ||
"\n", | ||
" That city is taken; the Russian army suffers heavier losses than the opposing armies had suffered in the former war from Austerlitz to Wagram. \n", | ||
"\n", | ||
" Behave as you did at Austerlitz, Friedland, Vítebsk, and Smolénsk. \n", | ||
"\n", | ||
"Query: Paris\n", | ||
"--------------------------------------------------\n", | ||
" A man who doesn’t know Paris is a savage. You can tell a Parisian two leagues off. Paris is Talma, la Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that his conclusion was weaker than what had gone before, he added quickly: “There is only one Paris in the world. You have been to Paris and have remained Russian. \n", | ||
"\n", | ||
" Well, what is Paris saying? \n", | ||
"\n", | ||
" Paris would have been the capital of the world, and the French the envy of the nations! \n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"queries = [\"Emperor Napoleon\", \"The battle of Austerlitz\", \"Paris\"]\n", | ||
"for query in queries:\n", | ||
" print(f\"Query: {query}\\n{'-' * 50}\")\n", | ||
" query_embedding = model.encode(query)\n", | ||
" results = vicinity.query(query_embedding, k=3)[0]\n", | ||
"\n", | ||
" for result in results:\n", | ||
" print(result[0], \"\\n\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in about 3.5 seconds using Chonkie, Vicinity, and Model2Vec. Lightweight and fast, just how we like it." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "3.10.12", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |