Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Notes on Anserini Build on Windows #2243

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
821 changes: 412 additions & 409 deletions README.md

Large diffs are not rendered by default.

236 changes: 118 additions & 118 deletions docs/experiments-msmarco-passage-openai-ada2.md
Original file line number Diff line number Diff line change
@@ -1,118 +1,118 @@
# Anserini: OpenAI-ada2 Embeddings for MS MARCO Passage

This guide explains how to reproduce experiments with OpenAI-ada2 emebddings on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking).
In these experiments, we are using pre-encoded queries (i.e., cached results of query embeddings).

## Corpus Download

Let's start off by downloading the corpus.
To be clear, the "corpus" here refers to the embedding vectors generated by OpenAI's ada2 embedding endpoint.

Download the tarball containing embedding vectors and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-openai-ada2.tar -P collections/
tar xvf collections/msmarco-passage-openai-ada2.tar -C collections/
```

The tarball is 109 GB and has an MD5 checksum of `a4d843d522ff3a3af7edbee789a63402`.

## Indexing

Indexing is a bit tricky because the HNSW implementation in Lucene restricts vectors to 1024 dimensions, which is not sufficient for OpenAI's 1536-dimensional embeddings.
This issue is described [here](https://github.com/apache/lucene/issues/11507).
The resolution is to make vector dimensions configurable on a per `Codec` basis, as in [this patch](https://github.com/apache/lucene/pull/12436) in Lucene.
However, as of early August 2023, there is no public release of Lucene that has these features folded in.
Thus, there is no public release of Lucene can directly index OpenAI's ada2 embedding vectors.

However, we were able to hack around this limitation in [this pull request](https://github.com/castorini/anserini/pull/2161).
Our workaround is incredibly janky, which is why we're leaving it on a branch and _not_ merging it into trunk.
The sketch of the solution is as follows: we copy relevant source files from Lucene directly into our source tree, and when we build the fatjar, the class files of our "local versions" take precedence, and hence override the vector size limitations.

So, to get the indexing working, we'll need to pull the above branch, build, and index with the following command:

```bash
java -cp target/anserini-0.21.1-SNAPSHOT-fatjar.jar io.anserini.index.IndexHnswDenseVectors \
-collection JsonDenseVectorCollection \
-input collections/msmarco-passage-openai-ada2 \
-index indexes/lucene-hnsw.msmarco-passage-openai-ada2/ \
-generator LuceneDenseVectorDocumentGenerator \
-threads 16 -M 16 -efC 100 \
>& logs/log.msmarco-passage-openai-ada2 &
```

Note that we're _not_ using `target/appassembler/bin/IndexHnswDenseVectors`.
Instead, we directly rely on the fatjar.

The indexing job takes around three hours on our `orca` server.
Upon completion, we should have an index with 8,841,823 documents.

## Retrieval

Other than the indexing trick, retrieval and evaluation are straightforward.

Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.

After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes, replacing `{SETTING}` with the desired setting out of [`msmarco-passage.dev-subset.openai-ada2`, `dl19-passage.openai-ada2`, `dl20-passage.openai-ada2`, `dl19-passage.openai-ada2-hyde`, `dl20-passage.openai-ada2-hyde`]:

```bash
target/appassembler/bin/SearchHnswDenseVectors \
-index indexes/lucene-hnsw.msmarco-passage-openai-ada2/ \
-topics tools/topics-and-qrels/topics.{SETTING}.jsonl.gz \
-topicreader JsonIntVector \
-output runs/run.{SETTING}.txt \
-querygenerator VectorQueryGenerator -topicfield vector -threads 16 -hits 1000 -efSearch 1000 &
```

## Evaluation

Evaluation can be performed using `trec_eval`.

For `msmarco-passage.dev-subset.openai-ada2`:
```bash
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.dev-subset.openai-ada2.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.dev-subset.openai-ada2.txt
```

Otherwise, set `{QRELS}` as `dl19-passage` or `dl20-passage` according to the `{SETTING}` and run:
```bash
tools/eval/trec_eval.9.0.4/trec_eval -c -l 2 -m map tools/topics-and-qrels/qrels.{QRELS}.txt runs/run.{SETTING}.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.{QRELS}.txt runs/run.{SETTING}.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -l 2 -m recall.1000 tools/topics-and-qrels/qrels.{QRELS}.txt runs/run.{SETTING}.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

```
# msmarco-passage.dev-subset.openai-ada2
recip_rank all 0.3434
recall_1000 all 0.9841

# dl19-passage.openai-ada2
map all 0.4786
ndcg_cut_10 all 0.7035
recall_1000 all 0.8625

# dl20-passage.openai-ada2
map all 0.4771
ndcg_cut_10 all 0.6759
recall_1000 all 0.8705

# dl19-passage.openai-ada2-hyde
map all 0.5124
ndcg_cut_10 all 0.7163
recall_1000 all 0.8968

# dl20-passage.openai-ada2-hyde
map all 0.4938
ndcg_cut_10 all 0.6666
recall_1000 all 0.8919
```

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally stable to the third digit after the decimal point.

## Reproduction Log[*](reproducibility.md)

# Anserini: OpenAI-ada2 Embeddings for MS MARCO Passage
This guide explains how to reproduce experiments with OpenAI-ada2 emebddings on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking).
In these experiments, we are using pre-encoded queries (i.e., cached results of query embeddings).
## Corpus Download
Let's start off by downloading the corpus.
To be clear, the "corpus" here refers to the embedding vectors generated by OpenAI's ada2 embedding endpoint.
Download the tarball containing embedding vectors and unpack into `collections/`:
```bash
wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-openai-ada2.tar -P collections/
tar xvf collections/msmarco-passage-openai-ada2.tar -C collections/
```
The tarball is 109 GB and has an MD5 checksum of `a4d843d522ff3a3af7edbee789a63402`.
## Indexing
Indexing is a bit tricky because the HNSW implementation in Lucene restricts vectors to 1024 dimensions, which is not sufficient for OpenAI's 1536-dimensional embeddings.
This issue is described [here](https://github.com/apache/lucene/issues/11507).
The resolution is to make vector dimensions configurable on a per `Codec` basis, as in [this patch](https://github.com/apache/lucene/pull/12436) in Lucene.
However, as of early August 2023, there is no public release of Lucene that has these features folded in.
Thus, there is no public release of Lucene can directly index OpenAI's ada2 embedding vectors.
However, we were able to hack around this limitation in [this pull request](https://github.com/castorini/anserini/pull/2161).
Our workaround is incredibly janky, which is why we're leaving it on a branch and _not_ merging it into trunk.
The sketch of the solution is as follows: we copy relevant source files from Lucene directly into our source tree, and when we build the fatjar, the class files of our "local versions" take precedence, and hence override the vector size limitations.
So, to get the indexing working, we'll need to pull the above branch, build, and index with the following command:
```bash
java -cp target/anserini-0.21.1-SNAPSHOT-fatjar.jar io.anserini.index.IndexHnswDenseVectors \
-collection JsonDenseVectorCollection \
-input collections/msmarco-passage-openai-ada2 \
-index indexes/lucene-hnsw.msmarco-passage-openai-ada2/ \
-generator LuceneDenseVectorDocumentGenerator \
-threads 16 -M 16 -efC 100 \
>& logs/log.msmarco-passage-openai-ada2 &
```
Note that we're _not_ using `target/appassembler/bin/IndexHnswDenseVectors`.
Instead, we directly rely on the fatjar.
The indexing job takes around three hours on our `orca` server.
Upon completion, we should have an index with 8,841,823 documents.
## Retrieval
Other than the indexing trick, retrieval and evaluation are straightforward.
Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes, replacing `{SETTING}` with the desired setting out of [`msmarco-passage.dev-subset.openai-ada2`, `dl19-passage.openai-ada2`, `dl20-passage.openai-ada2`, `dl19-passage.openai-ada2-hyde`, `dl20-passage.openai-ada2-hyde`]:
```bash
target/appassembler/bin/SearchHnswDenseVectors \
-index indexes/lucene-hnsw.msmarco-passage-openai-ada2/ \
-topics tools/topics-and-qrels/topics.{SETTING}.jsonl.gz \
-topicreader JsonIntVector \
-output runs/run.{SETTING}.txt \
-querygenerator VectorQueryGenerator -topicfield vector -threads 16 -hits 1000 -efSearch 1000 &
```
## Evaluation
Evaluation can be performed using `trec_eval`.
For `msmarco-passage.dev-subset.openai-ada2`:
```bash
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.dev-subset.openai-ada2.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.dev-subset.openai-ada2.txt
```
Otherwise, set `{QRELS}` as `dl19-passage` or `dl20-passage` according to the `{SETTING}` and run:
```bash
tools/eval/trec_eval.9.0.4/trec_eval -c -l 2 -m map tools/topics-and-qrels/qrels.{QRELS}.txt runs/run.{SETTING}.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.{QRELS}.txt runs/run.{SETTING}.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -l 2 -m recall.1000 tools/topics-and-qrels/qrels.{QRELS}.txt runs/run.{SETTING}.txt
```
## Effectiveness
With the above commands, you should be able to reproduce the following results:
```
# msmarco-passage.dev-subset.openai-ada2
recip_rank all 0.3434
recall_1000 all 0.9841
# dl19-passage.openai-ada2
map all 0.4786
ndcg_cut_10 all 0.7035
recall_1000 all 0.8625
# dl20-passage.openai-ada2
map all 0.4771
ndcg_cut_10 all 0.6759
recall_1000 all 0.8705
# dl19-passage.openai-ada2-hyde
map all 0.5124
ndcg_cut_10 all 0.7163
recall_1000 all 0.8968
# dl20-passage.openai-ada2-hyde
map all 0.4938
ndcg_cut_10 all 0.6666
recall_1000 all 0.8919
```
Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally stable to the third digit after the decimal point.
## Reproduction Log[*](reproducibility.md)
Loading
Loading