Skip to content

Commit

Permalink
Add monolingual regressions for CIRAL and whitespace analyzer binding…
Browse files Browse the repository at this point in the history
…s for ha and so
  • Loading branch information
Mofetoluwa authored Nov 6, 2023
1 parent f053e81 commit d2fb8a5
Show file tree
Hide file tree
Showing 16 changed files with 644 additions and 3 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@ See individual pages for details!
+ Regressions for [CLEF 2006 Monolingual French](docs/regressions/regressions-clef06-fr.md)
+ Regressions for [TREC 2002 Monolingual Arabic](docs/regressions/regressions-trec02-ar.md)
+ Regressions for FIRE 2012: [Monolingual Bengali](docs/regressions/regressions-fire12-bn.md), [Monolingual Hindi](docs/regressions/regressions-fire12-hi.md), [Monolingual English](docs/regressions/regressions-fire12-en.md)
+ Regressions for CIRAL (v1.0) baselines: [Monolingual Hausa](docs/regressions/regressions-ciral-v1.0-ha.md), [Monolingual Somali](docs/regressions/regressions-ciral-v1.0-so.md), [Monolingual Swahili](docs/regressions/regressions-ciral-v1.0-sw.md), [Monolingual Yoruba](docs/regressions/regressions-ciral-v1.0-yo.md)

</details>
<details>
Expand Down
2 changes: 1 addition & 1 deletion docs/regressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This means that anyone with the document collection should be able to reproduce
We hold this ideal in such high esteem and are so dedicated to reproducibility that if you discover a broken regression before we do, Jimmy Lin will buy you a beverage of choice (coffee, beer, etc.) at the next event you see him (e.g., SIGIR, TREC, etc.).

Here's how you can help:
In the course of reproducing one of our results, please let us know you've been successful by sending a pull request with a simple note, like what appears at the bottom of [the regressions for Disks 4 &amp; 5 page](regressions-disk45.md).
In the course of reproducing one of our results, please let us know you've been successful by sending a pull request with a simple note, like what appears at the bottom of [the regressions for Disks 4 &amp; 5 page](regressions/regressions-disk45.md).
Since the regression documentation is auto-generated, pull requests should be sent against the [raw templates](../src/main/resources/docgen/templates).
In turn, you'll be recognized as a [contributor](https://github.com/castorini/anserini/graphs/contributors).

Expand Down
62 changes: 62 additions & 0 deletions docs/regressions/regressions-ciral-v1.0-ha.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Anserini Regressions: CIRAL (v1.0) &mdash; Hausa

This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Hausa](https://github.com/ciralproject/ciral).

The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-ha.yaml).
Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-ha.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-ha
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection MrTyDiCollection \
-input /path/to/ciral-hausa \
-index indexes/lucene-index.ciral-v1.0-ha/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -storePositions -storeDocvectors -storeRaw -language ha \
>& logs/log.ciral-hausa &
```

See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.ciral-v1.0-ha/ \
-topics tools/topics-and-qrels/topics.ciral-v1.0-ha-dev-native.tsv \
-topicreader TsvInt \
-output runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt \
-bm25 -hits 1000 -language ha &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-ha-dev.tsv runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-ha-dev.tsv runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-ha-dev.tsv runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| **nDCG@20** | **BM25 (default)**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [CIRAL Hausa: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.2039 |
| **MRR@10** | **BM25 (default)**|
| [CIRAL Hausa: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.3153 |
| **R@100** | **BM25 (default)**|
| [CIRAL Hausa: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.2760 |
62 changes: 62 additions & 0 deletions docs/regressions/regressions-ciral-v1.0-so.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Anserini Regressions: CIRAL (v1.0) &mdash; Somali

This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Somali](https://github.com/ciralproject/ciral).

The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-so.yaml).
Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-so.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-so
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection MrTyDiCollection \
-input /path/to/ciral-somali \
-index indexes/lucene-index.ciral-v1.0-so/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -storePositions -storeDocvectors -storeRaw -language so \
>& logs/log.ciral-somali &
```

See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.ciral-v1.0-so/ \
-topics tools/topics-and-qrels/topics.ciral-v1.0-so-dev-native.tsv \
-topicreader TsvInt \
-output runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt \
-bm25 -hits 1000 -language so &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-so-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-so-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-so-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| **nDCG@20** | **BM25 (default)**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [CIRAL Somali: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.1500 |
| **MRR@10** | **BM25 (default)**|
| [CIRAL Somali: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.4000 |
| **R@100** | **BM25 (default)**|
| [CIRAL Somali: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.1850 |
62 changes: 62 additions & 0 deletions docs/regressions/regressions-ciral-v1.0-sw.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Anserini Regressions: CIRAL (v1.0) &mdash; Swahili

This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Swahili](https://github.com/ciralproject/ciral).

The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-sw.yaml).
Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-sw.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-sw
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection MrTyDiCollection \
-input /path/to/ciral-somali \
-index indexes/lucene-index.ciral-v1.0-sw/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -storePositions -storeDocvectors -storeRaw -language sw \
>& logs/log.ciral-somali &
```

See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.ciral-v1.0-sw/ \
-topics tools/topics-and-qrels/topics.ciral-v1.0-sw-dev-native.tsv \
-topicreader TsvInt \
-output runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt \
-bm25 -hits 1000 -language sw &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-sw-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-sw-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-sw-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| **nDCG@20** | **BM25 (default)**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [CIRAL Swahili: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.1812 |
| **MRR@10** | **BM25 (default)**|
| [CIRAL Swahili: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.1681 |
| **R@100** | **BM25 (default)**|
| [CIRAL Swahili: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.4742 |
62 changes: 62 additions & 0 deletions docs/regressions/regressions-ciral-v1.0-yo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Anserini Regressions: CIRAL (v1.0) &mdash; Yoruba

This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Yoruba](https://github.com/ciralproject/ciral).

The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-yo.yaml).
Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-yo.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-yo
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection MrTyDiCollection \
-input /path/to/ciral-yoruba \
-index indexes/lucene-index.ciral-v1.0-yo/ \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -storePositions -storeDocvectors -storeRaw -language yo \
>& logs/log.ciral-yoruba &
```

See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.ciral-v1.0-yo/ \
-topics tools/topics-and-qrels/topics.ciral-v1.0-yo-dev-native.tsv \
-topicreader TsvInt \
-output runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt \
-bm25 -hits 1000 -language yo &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-yo-dev.tsv runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-yo-dev.tsv runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-yo-dev.tsv runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| **nDCG@20** | **BM25 (default)**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [CIRAL Yoruba: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.2797 |
| **MRR@10** | **BM25 (default)**|
| [CIRAL Yoruba: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.3833 |
| **R@100** | **BM25 (default)**|
| [CIRAL Yoruba: Dev](https://huggingface.co/datasets/CIRAL/ciral) | 0.5114 |
2 changes: 1 addition & 1 deletion src/main/java/io/anserini/index/IndexCollection.java
Original file line number Diff line number Diff line change
Expand Up @@ -459,7 +459,7 @@ private Analyzer getAnalyzer() {
LOG.info("Using language-specific analyzer");
LOG.info("Language: " + args.language);
return AnalyzerMap.getLanguageSpecificAnalyzer(args.language);
} else if (args.language.equals("sw") || args.language.equals("yo")) {
} else if ( Arrays.asList("ha","so","sw","yo").contains(args.language)) {
return new WhitespaceAnalyzer();
} else if (args.pretokenized) {
return new WhitespaceAnalyzer();
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/io/anserini/search/SearchCollection.java
Original file line number Diff line number Diff line change
Expand Up @@ -686,7 +686,7 @@ private Analyzer getAnalyzer() {
LOG.info("Using language-specific analyzer");
LOG.info("Language: " + args.language);
return AnalyzerMap.getLanguageSpecificAnalyzer(args.language);
} else if (args.language.equals("sw") || args.language.equals("yo")) {
} else if (Arrays.asList("ha","so","sw","yo").contains(args.language)) {
return new WhitespaceAnalyzer();
} else if (args.pretokenized) {
return new WhitespaceAnalyzer();
Expand Down
43 changes: 43 additions & 0 deletions src/main/resources/docgen/templates/ciral-v1.0-ha.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Anserini Regressions: CIRAL (v1.0) &mdash; Hausa

This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Hausa](https://github.com/ciralproject/ciral).

The exact configurations for these regressions are stored in [this YAML file](${yaml}).
Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
```

## Indexing

Typical indexing command:

```
${index_cmds}
```

See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
For additional details, see explanation of [common indexing options](${root_path}/docs/common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
43 changes: 43 additions & 0 deletions src/main/resources/docgen/templates/ciral-v1.0-so.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Anserini Regressions: CIRAL (v1.0) &mdash; Somali

This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Somali](https://github.com/ciralproject/ciral).

The exact configurations for these regressions are stored in [this YAML file](${yaml}).
Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
```

## Indexing

Typical indexing command:

```
${index_cmds}
```

See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
For additional details, see explanation of [common indexing options](${root_path}/docs/common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
Loading

0 comments on commit d2fb8a5

Please sign in to comment.