Skip to content

Commit

Permalink
Add 'who cares' to onboarding doc (#2179)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Sep 3, 2023
1 parent 0eb110b commit c17a5dc
Show file tree
Hide file tree
Showing 3 changed files with 55 additions and 8 deletions.
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,13 @@ See [Yang et al. (SIGIR 2017)](https://dl.acm.org/authorize?N47337) and [Yang et

## 🎬 Getting Started

Many Anserini features are exposed in the [Pyserini](http://pyserini.io/) Python interface.
Most Anserini features are exposed in the [Pyserini](http://pyserini.io/) Python interface.
If you're more comfortable with Python, start there, although Anserini forms an important building block of Pyserini, so it remains worthwhile to learn about Anserini.

<!--
If you're looking for basic indexing and search capabilities, you might want to start there.
A low-effort way to try out Anserini is to look at our [online notebooks](https://github.com/castorini/anserini-notebooks), which will allow you to get started with just a few clicks.
-->

You'll need Java 11 and Maven 3.3+ to build Anserini.
Clone our repo with the `--recurse-submodules` option to make sure the `eval/` submodule also gets cloned (alternatively, use `git submodule update --init`).
Expand All @@ -27,10 +31,6 @@ Then, build using using Maven:
mvn clean package appassembler:assemble
```

Note that on Windows, tests may fail due to encoding issues, see [#1466](https://github.com/castorini/anserini/issues/1466).
A simple workaround is to skip tests by adding `-Dmaven.test.skip=true` to the above `mvn` command.
See [#1121](https://github.com/castorini/pyserini/discussions/1121) for additional discussions on debugging Windows build errors.

The `tools/` directory, which contains evaluation tools and other scripts, is actually [this repo](https://github.com/castorini/anserini-tools), integrated as a [Git submodule](https://git-scm.com/book/en/v2/Git-Tools-Submodules) (so that it can be shared across related projects).
Build as follows (you might get warnings, but okay to ignore):

Expand All @@ -39,7 +39,17 @@ cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make &
cd tools/eval/ndeval && make && cd ../../..
```

With that, you should be ready to go!
With that, you should be ready to go.
The onboarding path for Anserini starts [here](docs/start-here.md)!

<details>
<summary>Windows tips</summary>

Note that on Windows, tests may fail due to encoding issues, see [#1466](https://github.com/castorini/anserini/issues/1466).
A simple workaround is to skip tests by adding `-Dmaven.test.skip=true` to the above `mvn` command.
See [#1121](https://github.com/castorini/pyserini/discussions/1121) for additional discussions on debugging Windows build errors.

</details>

## ⚗️ Regression Experiments (+ Reproduction Guides)

Expand Down
Binary file added docs/Prometheus-Model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 39 additions & 2 deletions docs/start-here.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Anserini: Start Here

This page provides the entry point for an introduction to information retrieval (i.e., search).
It also serves as an [onboarding path](https://github.com/lintool/guide/blob/master/ura.md) for University of Waterloo undergraduate (and graduate) students who wish to join my research group.
It also serves as an [onboarding path](https://github.com/castorini/onboarding) for University of Waterloo undergraduate and graduate students who wish to join my research group.

As a high-level tip for anyone going through these exercises: try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blindly copying and pasting commands into a shell).
By this, I mean, actually _read_ the surrounding explanations, understand the purpose of the commands, and use this guide as a springboard for additional explorations (for example, dig deeper into the code).
Expand All @@ -27,7 +27,44 @@ This problem has been given various names, e.g., the search problem, the informa
In most contexts, "ranking" and "retrieval" are used interchangeably.
Basically, this is what _search_ (i.e., information retrieval) is all about.

Let's try to unpack the definition a bit.
## Interlude: Who Cares?

At this point, it's worthwhile to pause and answer the question: Who cares?

LLMs are cool.
ChatGPT is cool.
Generative AI is cool.
But _search_?
That's so... last millennium!

Well, not quite.
You might have heard of this thing called "retrieval augmentation"?
That's just a fancy way of describing the technique of fetching pieces of content (e.g., paragraphs) from some external source (e.g., a collection of documents), and stuffing them into the prompt of an LLM to improve its generative capabilities.
How do we "fetch" those pieces of content?
Well, that's retrieval!
(You might have also heard about something called vector search? We'll cover exactly that later in this onboarding path.)

In fact, retrieval augmentation is exactly how the new Bing search works.
You don't have to take my word: you can directly read the blog post on [building the new Bing](https://blogs.bing.com/search-quality-insights/february-2023/Building-the-New-Bing) and find the following diagram:

<img src="Prometheus-Model.png" width="500" />

Search comprises "internal queries" to fetch content ("Bing results") that are then fed into an LLM (i.e., stuffed into the prompt) to generate answers.
If you want more evidence, here's a [NeurIPS 2020 paper](https://arxiv.org/abs/2005.11401) that basically says the same thing.

Thus, retrieval forms the foundation of answer generation with LLMs.
In fact, it's critical to the quality of the output.
We all know the adage "[garbage in, garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)", which highlights the importance of retrieval.
If the retrieval quality ain't good, the LLM output will be garbage.

How do we do retrieval effectively?
Well, that's why you should read on.
Later, we'll also see that transformers (the same neural network model that underlies LLMs) form a fundamental building block of converting content into representation vectors (called "embeddings"), which underlie vector search.

## Back to the Retrieval Problem

Hopefully, you're convinced that retrieval is important, or at least sufficiently so to read on.
Now, let's get back to the retrieval problem and try to unpack the definition a bit.

A **"query"** is a representation of an information need (i.e., the reason you're looking for information in the first place) that serves as the input to a retrieval system.

Expand Down

0 comments on commit c17a5dc

Please sign in to comment.