Skip to content

Commit

Permalink
Merge pull request #45 from philippgille/add-arxiv-example
Browse files Browse the repository at this point in the history
Add arXiv semantic search example
  • Loading branch information
philippgille authored Mar 10, 2024
2 parents 68ed3ab + 0a07c76 commit 0aad756
Show file tree
Hide file tree
Showing 14 changed files with 311 additions and 26 deletions.
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ Because `chromem-go` is embeddable it enables you to add retrieval augmented gen

The focus is not scale or number of features, but simplicity.

Performance has not been a priority yet. Without optimizations (except some parallelization with goroutines) querying 5,000 documents takes ~500ms on a mid-range laptop CPU (11th Gen Intel i5-1135G7, like in the first generation Framework Laptop 13).

> ⚠️ The project is in beta, under heavy construction, and may introduce breaking changes in releases before `v1.0.0`. All changes are documented in the [`CHANGELOG`](./CHANGELOG.md).
## Contents
Expand Down Expand Up @@ -51,7 +53,7 @@ Fine-tuning an LLM can help a bit, but it's more meant to improve the LLMs reaso
4. In the question to the LLM, you provide this content alongside your question.
5. The LLM can take this up-to-date precise content into account when answering.

Check out the [example code](example) to see it in action!
Check out the [example code](examples) to see it in action!

## Interface

Expand Down Expand Up @@ -176,7 +178,9 @@ See the Godoc for details: <https://pkg.go.dev/github.com/philippgille/chromem-g

## Usage

For a full, working example, using the vector database for retrieval augmented generation (RAG) and locally running embeddings model and LLM (in Ollama), see the [example code](example).
See the Godoc for a reference: <https://pkg.go.dev/github.com/philippgille/chromem-go>

For full, working examples, using the vector database for retrieval augmented generation (RAG) and semantic search and using either OpenAI or locally running the embeddings model and LLM (in Ollama), see the [example code](examples).

## Motivation

Expand Down
10 changes: 0 additions & 10 deletions example/go.mod

This file was deleted.

9 changes: 9 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Examples

1. [RAG Wikipedia Ollama](rag-wikipedia-ollama)
- This example shows a retrieval augmented generation (RAG) application, using `chromem-go` as knowledge base for finding relevant info for a question. More specifically the app is doing *question answering*.
- The underlying data is 200 Wikipedia articles (or rather their lead section / introduction).
- We run the embeddings model and LLM in [Ollama](https://github.com/ollama/ollama), to showcase how a RAG application can run entirely offline, without relying on OpenAI or other third party APIs.
2. [Semantic search arXiv OpenAI](semantic-search-arxiv-openai)
- This example shows a semantic search application, using `chromem-go` as vector database for finding semantically relevant search results.
- We load and search across ~5,000 arXiv papers in the "Computer Science - Computation and Language" category, which is the relevant one for Natural Language Processing (NLP) related papers.
File renamed without changes.
25 changes: 13 additions & 12 deletions example/README.md → examples/rag-wikipedia-ollama/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Example
# RAG Wikipedia Ollama

This example shows a retrieval augmented generation (RAG) application, using `chromem-go` as knowledge base for finding relevant info for a question.
This example shows a retrieval augmented generation (RAG) application, using `chromem-go` as knowledge base for finding relevant info for a question. More specifically the app is doing *question answering*. The underlying data is 200 Wikipedia articles (or rather their lead section / introduction).

We run the embeddings model and LLM in [Ollama](https://github.com/ollama/ollama), to showcase how a RAG application can run entirely offline, without relying on OpenAI or other third party APIs. It doesn't require a GPU, and a CPU like an 11th Gen Intel i5-1135G7 (like in the first generation Framework Laptop 13) is fast enough.

Expand Down Expand Up @@ -29,13 +29,14 @@ The output can differ slightly on each run, but it's along the lines of:
2024/03/02 20:02:34 Reading JSON lines...
2024/03/02 20:02:34 Adding documents to chromem-go, including creating their embeddings via Ollama API...
2024/03/02 20:03:11 Querying chromem-go...
2024/03/02 20:03:11 Search took 231.672667ms
2024/03/02 20:03:11 Document 1 (similarity: 0.723627): "Malleable Iron Range Company was a company that existed from 1896 to 1985 and primarily produced kitchen ranges made of malleable iron but also produced a variety of other related products. The company's primary trademark was 'Monarch' and was colloquially often referred to as the Monarch Company or just Monarch."
2024/03/02 20:03:11 Document 2 (similarity: 0.550584): "The American Motor Car Company was a short-lived company in the automotive industry founded in 1906 lasting until 1913. It was based in Indianapolis Indiana United States. The American Motor Car Company pioneered the underslung design."
2024/03/02 20:03:11 Asking LLM with augmented question...
2024/03/02 20:03:32 Reply after augmenting the question with knowledge: "The Monarch Company existed from 1896 to 1985."
```

The majority of the time here is spent during the embeddings creation as well as the LLM conversation, which are not part of `chromem-go`.
The majority of the time here is spent during the embeddings creation, where we are limited by the performance of the Ollama API, which depends on your CPU/GPU and the embeddings model.

## OpenAI

Expand All @@ -48,10 +49,10 @@ Then, if you want to create the embeddings via OpenAI, but still use Gemma 2B as
<details><summary>Apply this patch</summary>

```diff
diff --git a/example/main.go b/example/main.go
diff --git a/examples/rag-wikipedia-ollama/main.go b/examples/rag-wikipedia-ollama/main.go
index 55b3076..cee9561 100644
--- a/example/main.go
+++ b/example/main.go
--- a/examples/rag-wikipedia-ollama/main.go
+++ b/examples/rag-wikipedia-ollama/main.go
@@ -14,8 +14,6 @@ import (

const (
Expand Down Expand Up @@ -88,10 +89,10 @@ Or alternatively, if you want to use OpenAI for everything (embeddings creation
<details><summary>Apply this patch</summary>

```diff
diff --git a/example/llm.go b/example/llm.go
diff --git a/examples/rag-wikipedia-ollama/llm.go b/examples/rag-wikipedia-ollama/llm.go
index 1fde4ec..7cb81cc 100644
--- a/example/llm.go
+++ b/example/llm.go
--- a/examples/rag-wikipedia-ollama/llm.go
+++ b/examples/rag-wikipedia-ollama/llm.go
@@ -2,23 +2,13 @@ package main

import (
Expand Down Expand Up @@ -138,10 +139,10 @@ index 1fde4ec..7cb81cc 100644
Messages: messages,
})
if err != nil {
diff --git a/example/main.go b/example/main.go
diff --git a/examples/rag-wikipedia-ollama/main.go b/examples/rag-wikipedia-ollama/main.go
index 55b3076..044a246 100644
--- a/example/main.go
+++ b/example/main.go
--- a/examples/rag-wikipedia-ollama/main.go
+++ b/examples/rag-wikipedia-ollama/main.go
@@ -12,19 +12,11 @@ import (
"github.com/philippgille/chromem-go"
)
Expand Down
File renamed without changes.
10 changes: 10 additions & 0 deletions examples/rag-wikipedia-ollama/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
module github.com/philippgille/chromem-go/examples/rag-wikipedia-ollama

go 1.21

require (
github.com/philippgille/chromem-go v0.0.0
github.com/sashabaranov/go-openai v1.17.9
)

replace github.com/philippgille/chromem-go => ./../..
File renamed without changes.
File renamed without changes.
10 changes: 8 additions & 2 deletions example/main.go → examples/rag-wikipedia-ollama/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"os"
"runtime"
"strconv"
"time"

"github.com/philippgille/chromem-go"
)
Expand Down Expand Up @@ -62,6 +63,7 @@ func main() {
if err != nil {
panic(err)
}
defer f.Close()
d := json.NewDecoder(f)
log.Println("Reading JSON lines...")
for i := 1; ; i++ {
Expand Down Expand Up @@ -91,15 +93,18 @@ func main() {
log.Println("Not reading JSON lines because collection was loaded from persistent storage.")
}

// Search for documents similar to the one we added just by passing the original
// question.
// Search for documents that are semantically similar to the original question.
// We ask for the two most similar documents, but you can use more or less depending
// on your needs and the supported context size of the LLM you use.
// You can limit the search by filtering on content or metadata (like the article's
// category), but we don't do that in this example.
start := time.Now()
log.Println("Querying chromem-go...")
docRes, err := collection.Query(ctx, question, 2, nil, nil)
if err != nil {
panic(err)
}
log.Println("Search took", time.Since(start))
// Here you could filter out any documents whose similarity is below a certain threshold.
// if docRes[...].Similarity < 0.5 { ...

Expand All @@ -124,6 +129,7 @@ func main() {
2024/03/02 20:02:34 Reading JSON lines...
2024/03/02 20:02:34 Adding documents to chromem-go, including creating their embeddings via Ollama API...
2024/03/02 20:03:11 Querying chromem-go...
2024/03/02 20:03:11 Search took 231.672667ms
2024/03/02 20:03:11 Document 1 (similarity: 0.723627): "Malleable Iron Range Company was a company that existed from 1896 to 1985 and primarily produced kitchen ranges made of malleable iron but also produced a variety of other related products. The company's primary trademark was 'Monarch' and was colloquially often referred to as the Monarch Company or just Monarch."
2024/03/02 20:03:11 Document 2 (similarity: 0.550584): "The American Motor Car Company was a short-lived company in the automotive industry founded in 1906 lasting until 1913. It was based in Indianapolis Indiana United States. The American Motor Car Company pioneered the underslung design."
2024/03/02 20:03:11 Asking LLM with augmented question...
Expand Down
1 change: 1 addition & 0 deletions examples/semantic-search-arxiv-openai/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/db
84 changes: 84 additions & 0 deletions examples/semantic-search-arxiv-openai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Semantic search arXiv OpenAI

This example shows a semantic search application, using `chromem-go` as vector database for finding semantically relevant search results. We load and search across ~5,000 arXiv papers in the "Computer Science - Computation and Language" category, which is the relevant one for Natural Language Processing (NLP) related papers.

This is not a retrieval augmented generation (RAG) app, because after *retrieving* the semantically relevant results, we don't *augment* any prompt to an LLM. No LLM is generates the final output.

## How to run

1. Prepare the dataset
1. Download `arxiv-metadata-oai-snapshot.json` from <https://www.kaggle.com/datasets/Cornell-University/arxiv>
2. Filter by "Computer Science - Computation and Language" category (see [taxonomy](https://arxiv.org/category_taxonomy)), filter by updates from 2023
1. Ensure you have [ripgrep](https://github.com/BurntSushi/ripgrep) installed, or adapt the following commands to use grep
2. Run `rg '"categories":"cs.CL"' ~/Downloads/arxiv-metadata-oai-snapshot.json | rg '"update_date":"2023' > /tmp/arxiv_cs-cl_2023.jsonl` (adapt input file path if necessary)
3. Check the data
1. `wc -l arxiv_cs-cl_2023.jsonl` should show ~5,000 lines
2. `du -h arxiv_cs-cl_2023.jsonl` should show ~8.8 MB
2. Set the OpenAI API key in your env as `OPENAI_API_KEY`
3. Run the example: `go run .`

## Output

The output can differ slightly on each run, but it's along the lines of:

```log
2024/03/10 18:23:55 Setting up chromem-go...
2024/03/10 18:23:55 Reading JSON lines...
2024/03/10 18:23:55 Read and parsed 5006 documents.
2024/03/10 18:23:55 Adding documents to chromem-go, including creating their embeddings via OpenAI API...
2024/03/10 18:28:12 Querying chromem-go...
2024/03/10 18:28:12 Search took 529.451163ms
2024/03/10 18:28:12 Search results:
1) Similarity 0.488895:
URL: https://arxiv.org/abs/2209.15469
Submitter: Christian Buck
Title: Zero-Shot Retrieval with Search Agents and Hybrid Environments
Abstract: Learning to search is the task of building artificial agents that learn to autonomously use a search...
2) Similarity 0.480713:
URL: https://arxiv.org/abs/2305.11516
Submitter: Ryo Nagata Dr.
Title: Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment
Abstract: In this paper, we propose methods for discovering semantic differences in words appearing in two cor...
3) Similarity 0.476079:
URL: https://arxiv.org/abs/2310.14025
Submitter: Maria Lymperaiou
Title: Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation
Abstract: Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an i...
4) Similarity 0.474883:
URL: https://arxiv.org/abs/2302.14785
Submitter: Teven Le Scao
Title: Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation
Abstract: A key feature of neural models is that they can produce semantic vector representations of objects (...
5) Similarity 0.470326:
URL: https://arxiv.org/abs/2309.02403
Submitter: Dallas Card
Title: Substitution-based Semantic Change Detection using Contextual Embeddings
Abstract: Measuring semantic change has thus far remained a task where methods using contextual embeddings hav...
6) Similarity 0.466851:
URL: https://arxiv.org/abs/2309.08187
Submitter: Vu Tran
Title: Encoded Summarization: Summarizing Documents into Continuous Vector Space for Legal Case Retrieval
Abstract: We present our method for tackling a legal case retrieval task by introducing our method of encoding...
7) Similarity 0.461783:
URL: https://arxiv.org/abs/2307.16638
Submitter: Maiia Bocharova Bocharova
Title: VacancySBERT: the approach for representation of titles and skills for semantic similarity search in the recruitment domain
Abstract: The paper focuses on deep learning semantic search algorithms applied in the HR domain. The aim of t...
8) Similarity 0.460481:
URL: https://arxiv.org/abs/2106.07400
Submitter: Clara Meister
Title: Determinantal Beam Search
Abstract: Beam search is a go-to strategy for decoding neural sequence models. The algorithm can naturally be ...
9) Similarity 0.460001:
URL: https://arxiv.org/abs/2305.04049
Submitter: Yuxia Wu
Title: Actively Discovering New Slots for Task-oriented Conversation
Abstract: Existing task-oriented conversational search systems heavily rely on domain ontologies with pre-defi...
10) Similarity 0.458321:
URL: https://arxiv.org/abs/2305.08654
Submitter: Taichi Aida
Title: Unsupervised Semantic Variation Prediction using the Distribution of Sibling Embeddings
Abstract: Languages are dynamic entities, where the meanings associated with words constantly change with time...
```

The majority of the time here is spent during the embeddings creation, where we are limited by the performance of the OpenAI API.
7 changes: 7 additions & 0 deletions examples/semantic-search-arxiv-openai/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
module github.com/philippgille/chromem-go/examples/semantic-search-arxiv-openai

go 1.21

require github.com/philippgille/chromem-go v0.0.0

replace github.com/philippgille/chromem-go => ./../..
Loading

0 comments on commit 0aad756

Please sign in to comment.