Merge pull request #45 from philippgille/add-arxiv-example

Add arXiv semantic search example
philippgille · Mar 10, 2024 · 0aad756 · 0aad756
2 parents 68ed3ab + 0a07c76
commit 0aad756
Show file tree

Hide file tree

Showing 14 changed files with 311 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -13,6 +13,8 @@ Because `chromem-go` is embeddable it enables you to add retrieval augmented gen
 
 The focus is not scale or number of features, but simplicity.
 
+Performance has not been a priority yet. Without optimizations (except some parallelization with goroutines) querying 5,000 documents takes ~500ms on a mid-range laptop CPU (11th Gen Intel i5-1135G7, like in the first generation Framework Laptop 13).
+
 > ⚠️ The project is in beta, under heavy construction, and may introduce breaking changes in releases before `v1.0.0`. All changes are documented in the [`CHANGELOG`](./CHANGELOG.md).
 
 ## Contents
@@ -51,7 +53,7 @@ Fine-tuning an LLM can help a bit, but it's more meant to improve the LLMs reaso
 4. In the question to the LLM, you provide this content alongside your question.
 5. The LLM can take this up-to-date precise content into account when answering.
 
-Check out the [example code](example) to see it in action!
+Check out the [example code](examples) to see it in action!
 
 ## Interface
 
@@ -176,7 +178,9 @@ See the Godoc for details: <https://pkg.go.dev/github.com/philippgille/chromem-g
 
 ## Usage
 
-For a full, working example, using the vector database for retrieval augmented generation (RAG) and locally running embeddings model and LLM (in Ollama), see the [example code](example).
+See the Godoc for a reference: <https://pkg.go.dev/github.com/philippgille/chromem-go>
+
+For full, working examples, using the vector database for retrieval augmented generation (RAG) and semantic search and using either OpenAI or locally running the embeddings model and LLM (in Ollama), see the [example code](examples).
 
 ## Motivation
 

diff --git a/example/go.mod b/example/go.mod
diff --git a/examples/README.md b/examples/README.md
@@ -0,0 +1,9 @@
+# Examples
+
+1. [RAG Wikipedia Ollama](rag-wikipedia-ollama)
+   - This example shows a retrieval augmented generation (RAG) application, using `chromem-go` as knowledge base for finding relevant info for a question. More specifically the app is doing *question answering*.
+   - The underlying data is 200 Wikipedia articles (or rather their lead section / introduction).
+   - We run the embeddings model and LLM in [Ollama](https://github.com/ollama/ollama), to showcase how a RAG application can run entirely offline, without relying on OpenAI or other third party APIs.
+2. [Semantic search arXiv OpenAI](semantic-search-arxiv-openai)
+   - This example shows a semantic search application, using `chromem-go` as vector database for finding semantically relevant search results.
+   - We load and search across ~5,000 arXiv papers in the "Computer Science - Computation and Language" category, which is the relevant one for Natural Language Processing (NLP) related papers.
diff --git a/example/.gitignore → examples/rag-wikipedia-ollama/.gitignore b/example/.gitignore → examples/rag-wikipedia-ollama/.gitignore
diff --git a/example/README.md → examples/rag-wikipedia-ollama/README.md b/example/README.md → examples/rag-wikipedia-ollama/README.md
@@ -1,6 +1,6 @@
-# Example
+# RAG Wikipedia Ollama
 
-This example shows a retrieval augmented generation (RAG) application, using `chromem-go` as knowledge base for finding relevant info for a question.
+This example shows a retrieval augmented generation (RAG) application, using `chromem-go` as knowledge base for finding relevant info for a question. More specifically the app is doing *question answering*. The underlying data is 200 Wikipedia articles (or rather their lead section / introduction).
 
 We run the embeddings model and LLM in [Ollama](https://github.com/ollama/ollama), to showcase how a RAG application can run entirely offline, without relying on OpenAI or other third party APIs. It doesn't require a GPU, and a CPU like an 11th Gen Intel i5-1135G7 (like in the first generation Framework Laptop 13) is fast enough.
 
@@ -29,13 +29,14 @@ The output can differ slightly on each run, but it's along the lines of:
 2024/03/02 20:02:34 Reading JSON lines...
 2024/03/02 20:02:34 Adding documents to chromem-go, including creating their embeddings via Ollama API...
 2024/03/02 20:03:11 Querying chromem-go...
+2024/03/02 20:03:11 Search took 231.672667ms
 2024/03/02 20:03:11 Document 1 (similarity: 0.723627): "Malleable Iron Range Company was a company that existed from 1896 to 1985 and primarily produced kitchen ranges made of malleable iron but also produced a variety of other related products. The company's primary trademark was 'Monarch' and was colloquially often referred to as the Monarch Company or just Monarch."
 2024/03/02 20:03:11 Document 2 (similarity: 0.550584): "The American Motor Car Company was a short-lived company in the automotive industry founded in 1906 lasting until 1913. It was based in Indianapolis Indiana United States. The American Motor Car Company pioneered the underslung design."
 2024/03/02 20:03:11 Asking LLM with augmented question...
 2024/03/02 20:03:32 Reply after augmenting the question with knowledge: "The Monarch Company existed from 1896 to 1985."
 ```
 
-The majority of the time here is spent during the embeddings creation as well as the LLM conversation, which are not part of `chromem-go`.
+The majority of the time here is spent during the embeddings creation, where we are limited by the performance of the Ollama API, which depends on your CPU/GPU and the embeddings model.
 
 ## OpenAI
 
@@ -48,10 +49,10 @@ Then, if you want to create the embeddings via OpenAI, but still use Gemma 2B as
 <details><summary>Apply this patch</summary>
 
 ```diff
-diff --git a/example/main.go b/example/main.go
+diff --git a/examples/rag-wikipedia-ollama/main.go b/examples/rag-wikipedia-ollama/main.go
 index 55b3076..cee9561 100644
---- a/example/main.go
-+++ b/example/main.go
+--- a/examples/rag-wikipedia-ollama/main.go
++++ b/examples/rag-wikipedia-ollama/main.go
 @@ -14,8 +14,6 @@ import (
 
  const (
@@ -88,10 +89,10 @@ Or alternatively, if you want to use OpenAI for everything (embeddings creation
 <details><summary>Apply this patch</summary>
 
 ```diff
-diff --git a/example/llm.go b/example/llm.go
+diff --git a/examples/rag-wikipedia-ollama/llm.go b/examples/rag-wikipedia-ollama/llm.go
 index 1fde4ec..7cb81cc 100644
---- a/example/llm.go
-+++ b/example/llm.go
+--- a/examples/rag-wikipedia-ollama/llm.go
++++ b/examples/rag-wikipedia-ollama/llm.go
 @@ -2,23 +2,13 @@ package main
 
  import (
@@ -138,10 +139,10 @@ index 1fde4ec..7cb81cc 100644
    Messages: messages,
   })
   if err != nil {
-diff --git a/example/main.go b/example/main.go
+diff --git a/examples/rag-wikipedia-ollama/main.go b/examples/rag-wikipedia-ollama/main.go
 index 55b3076..044a246 100644
---- a/example/main.go
-+++ b/example/main.go
+--- a/examples/rag-wikipedia-ollama/main.go
++++ b/examples/rag-wikipedia-ollama/main.go
 @@ -12,19 +12,11 @@ import (
   "github.com/philippgille/chromem-go"
  )

diff --git a/example/dbpedia_sample.jsonl → ...rag-wikipedia-ollama/dbpedia_sample.jsonl b/example/dbpedia_sample.jsonl → ...rag-wikipedia-ollama/dbpedia_sample.jsonl
diff --git a/examples/rag-wikipedia-ollama/go.mod b/examples/rag-wikipedia-ollama/go.mod
@@ -0,0 +1,10 @@
+module github.com/philippgille/chromem-go/examples/rag-wikipedia-ollama
+
+go 1.21
+
+require (
+	github.com/philippgille/chromem-go v0.0.0
+	github.com/sashabaranov/go-openai v1.17.9
+)
+
+replace github.com/philippgille/chromem-go => ./../..
diff --git a/example/go.sum → examples/rag-wikipedia-ollama/go.sum b/example/go.sum → examples/rag-wikipedia-ollama/go.sum
diff --git a/example/llm.go → examples/rag-wikipedia-ollama/llm.go b/example/llm.go → examples/rag-wikipedia-ollama/llm.go
diff --git a/example/main.go → examples/rag-wikipedia-ollama/main.go b/example/main.go → examples/rag-wikipedia-ollama/main.go
@@ -8,6 +8,7 @@ import (
 	"os"
 	"runtime"
 	"strconv"
+	"time"
 
 	"github.com/philippgille/chromem-go"
 )
@@ -62,6 +63,7 @@ func main() {
 		if err != nil {
 			panic(err)
 		}
+		defer f.Close()
 		d := json.NewDecoder(f)
 		log.Println("Reading JSON lines...")
 		for i := 1; ; i++ {
@@ -91,15 +93,18 @@ func main() {
 		log.Println("Not reading JSON lines because collection was loaded from persistent storage.")
 	}
 
-	// Search for documents similar to the one we added just by passing the original
-	// question.
+	// Search for documents that are semantically similar to the original question.
 	// We ask for the two most similar documents, but you can use more or less depending
 	// on your needs and the supported context size of the LLM you use.
+	// You can limit the search by filtering on content or metadata (like the article's
+	// category), but we don't do that in this example.
+	start := time.Now()
 	log.Println("Querying chromem-go...")
 	docRes, err := collection.Query(ctx, question, 2, nil, nil)
 	if err != nil {
 		panic(err)
 	}
+	log.Println("Search took", time.Since(start))
 	// Here you could filter out any documents whose similarity is below a certain threshold.
 	// if docRes[...].Similarity < 0.5 { ...
 
@@ -124,6 +129,7 @@ func main() {
 	2024/03/02 20:02:34 Reading JSON lines...
 	2024/03/02 20:02:34 Adding documents to chromem-go, including creating their embeddings via Ollama API...
 	2024/03/02 20:03:11 Querying chromem-go...
+	2024/03/02 20:03:11 Search took 231.672667ms
 	2024/03/02 20:03:11 Document 1 (similarity: 0.723627): "Malleable Iron Range Company was a company that existed from 1896 to 1985 and primarily produced kitchen ranges made of malleable iron but also produced a variety of other related products. The company's primary trademark was 'Monarch' and was colloquially often referred to as the Monarch Company or just Monarch."
 	2024/03/02 20:03:11 Document 2 (similarity: 0.550584): "The American Motor Car Company was a short-lived company in the automotive industry founded in 1906 lasting until 1913. It was based in Indianapolis Indiana United States. The American Motor Car Company pioneered the underslung design."
 	2024/03/02 20:03:11 Asking LLM with augmented question...

diff --git a/examples/semantic-search-arxiv-openai/.gitignore b/examples/semantic-search-arxiv-openai/.gitignore
@@ -0,0 +1 @@
+/db
diff --git a/examples/semantic-search-arxiv-openai/README.md b/examples/semantic-search-arxiv-openai/README.md
@@ -0,0 +1,84 @@
+# Semantic search arXiv OpenAI
+
+This example shows a semantic search application, using `chromem-go` as vector database for finding semantically relevant search results. We load and search across ~5,000 arXiv papers in the "Computer Science - Computation and Language" category, which is the relevant one for Natural Language Processing (NLP) related papers.
+
+This is not a retrieval augmented generation (RAG) app, because after *retrieving* the semantically relevant results, we don't *augment* any prompt to an LLM. No LLM is generates the final output.
+
+## How to run
+
+1. Prepare the dataset
+   1. Download `arxiv-metadata-oai-snapshot.json` from <https://www.kaggle.com/datasets/Cornell-University/arxiv>
+   2. Filter by "Computer Science - Computation and Language" category (see [taxonomy](https://arxiv.org/category_taxonomy)), filter by updates from 2023
+      1. Ensure you have [ripgrep](https://github.com/BurntSushi/ripgrep) installed, or adapt the following commands to use grep
+      2. Run `rg '"categories":"cs.CL"' ~/Downloads/arxiv-metadata-oai-snapshot.json | rg '"update_date":"2023' > /tmp/arxiv_cs-cl_2023.jsonl` (adapt input file path if necessary)
+   3. Check the data
+      1. `wc -l arxiv_cs-cl_2023.jsonl` should show ~5,000 lines
+      2. `du -h arxiv_cs-cl_2023.jsonl` should show ~8.8 MB
+2. Set the OpenAI API key in your env as `OPENAI_API_KEY`
+3. Run the example: `go run .`
+
+## Output
+
+The output can differ slightly on each run, but it's along the lines of:
+
+```log
+ 2024/03/10 18:23:55 Setting up chromem-go...
+ 2024/03/10 18:23:55 Reading JSON lines...
+ 2024/03/10 18:23:55 Read and parsed 5006 documents.
+ 2024/03/10 18:23:55 Adding documents to chromem-go, including creating their embeddings via OpenAI API...
+ 2024/03/10 18:28:12 Querying chromem-go...
+ 2024/03/10 18:28:12 Search took 529.451163ms
+ 2024/03/10 18:28:12 Search results:
+  1) Similarity 0.488895:
+   URL: https://arxiv.org/abs/2209.15469
+   Submitter: Christian Buck
+   Title: Zero-Shot Retrieval with Search Agents and Hybrid Environments
+   Abstract: Learning to search is the task of building artificial agents that learn to autonomously use a search...
+  2) Similarity 0.480713:
+   URL: https://arxiv.org/abs/2305.11516
+   Submitter: Ryo Nagata Dr.
+   Title: Contextualized Word Vector-based Methods for Discovering Semantic  Differences with No Training nor Word Alignment
+   Abstract: In this paper, we propose methods for discovering semantic differences in words appearing in two cor...
+  3) Similarity 0.476079:
+   URL: https://arxiv.org/abs/2310.14025
+   Submitter: Maria Lymperaiou
+   Title: Large Language Models and Multimodal Retrieval for Visual Word Sense  Disambiguation
+   Abstract: Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an i...
+  4) Similarity 0.474883:
+   URL: https://arxiv.org/abs/2302.14785
+   Submitter: Teven Le Scao
+   Title: Joint Representations of Text and Knowledge Graphs for Retrieval and  Evaluation
+   Abstract: A key feature of neural models is that they can produce semantic vector representations of objects (...
+  5) Similarity 0.470326:
+   URL: https://arxiv.org/abs/2309.02403
+   Submitter: Dallas Card
+   Title: Substitution-based Semantic Change Detection using Contextual Embeddings
+   Abstract: Measuring semantic change has thus far remained a task where methods using contextual embeddings hav...
+  6) Similarity 0.466851:
+   URL: https://arxiv.org/abs/2309.08187
+   Submitter: Vu Tran
+   Title: Encoded Summarization: Summarizing Documents into Continuous Vector  Space for Legal Case Retrieval
+   Abstract: We present our method for tackling a legal case retrieval task by introducing our method of encoding...
+  7) Similarity 0.461783:
+   URL: https://arxiv.org/abs/2307.16638
+   Submitter: Maiia Bocharova Bocharova
+   Title: VacancySBERT: the approach for representation of titles and skills for  semantic similarity search in the recruitment domain
+   Abstract: The paper focuses on deep learning semantic search algorithms applied in the HR domain. The aim of t...
+  8) Similarity 0.460481:
+   URL: https://arxiv.org/abs/2106.07400
+   Submitter: Clara Meister
+   Title: Determinantal Beam Search
+   Abstract: Beam search is a go-to strategy for decoding neural sequence models. The algorithm can naturally be ...
+  9) Similarity 0.460001:
+   URL: https://arxiv.org/abs/2305.04049
+   Submitter: Yuxia Wu
+   Title: Actively Discovering New Slots for Task-oriented Conversation
+   Abstract: Existing task-oriented conversational search systems heavily rely on domain ontologies with pre-defi...
+  10) Similarity 0.458321:
+   URL: https://arxiv.org/abs/2305.08654
+   Submitter: Taichi Aida
+   Title: Unsupervised Semantic Variation Prediction using the Distribution of  Sibling Embeddings
+   Abstract: Languages are dynamic entities, where the meanings associated with words constantly change with time...
+```
+
+The majority of the time here is spent during the embeddings creation, where we are limited by the performance of the OpenAI API.
diff --git a/examples/semantic-search-arxiv-openai/go.mod b/examples/semantic-search-arxiv-openai/go.mod
@@ -0,0 +1,7 @@
+module github.com/philippgille/chromem-go/examples/semantic-search-arxiv-openai
+
+go 1.21
+
+require github.com/philippgille/chromem-go v0.0.0
+
+replace github.com/philippgille/chromem-go => ./../..