Add monolingual regressions for CIRAL and whitespace analyzer binding…

…s for ha and so
castorini · Nov 6, 2023 · d2fb8a5 · d2fb8a5
1 parent f053e81
commit d2fb8a5
Show file tree

Hide file tree

Showing 16 changed files with 644 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -249,6 +249,7 @@ See individual pages for details!
 + Regressions for [CLEF 2006 Monolingual French](docs/regressions/regressions-clef06-fr.md)
 + Regressions for [TREC 2002 Monolingual Arabic](docs/regressions/regressions-trec02-ar.md)
 + Regressions for FIRE 2012: [Monolingual Bengali](docs/regressions/regressions-fire12-bn.md), [Monolingual Hindi](docs/regressions/regressions-fire12-hi.md), [Monolingual English](docs/regressions/regressions-fire12-en.md)
++ Regressions for CIRAL (v1.0) baselines: [Monolingual Hausa](docs/regressions/regressions-ciral-v1.0-ha.md), [Monolingual Somali](docs/regressions/regressions-ciral-v1.0-so.md), [Monolingual Swahili](docs/regressions/regressions-ciral-v1.0-sw.md), [Monolingual Yoruba](docs/regressions/regressions-ciral-v1.0-yo.md)
 
 </details>
 <details>

diff --git a/docs/regressions.md b/docs/regressions.md
@@ -21,7 +21,7 @@ This means that anyone with the document collection should be able to reproduce
 We hold this ideal in such high esteem and are so dedicated to reproducibility that if you discover a broken regression before we do, Jimmy Lin will buy you a beverage of choice (coffee, beer, etc.) at the next event you see him (e.g., SIGIR, TREC, etc.).
 
 Here's how you can help:
-In the course of reproducing one of our results, please let us know you've been successful by sending a pull request with a simple note, like what appears at the bottom of [the regressions for Disks 4 &amp; 5 page](regressions-disk45.md).
+In the course of reproducing one of our results, please let us know you've been successful by sending a pull request with a simple note, like what appears at the bottom of [the regressions for Disks 4 &amp; 5 page](regressions/regressions-disk45.md).
 Since the regression documentation is auto-generated, pull requests should be sent against the [raw templates](../src/main/resources/docgen/templates).
 In turn, you'll be recognized as a [contributor](https://github.com/castorini/anserini/graphs/contributors).
 

diff --git a/docs/regressions/regressions-ciral-v1.0-ha.md b/docs/regressions/regressions-ciral-v1.0-ha.md
@@ -0,0 +1,62 @@
+# Anserini Regressions: CIRAL (v1.0) &mdash; Hausa
+
+This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Hausa](https://github.com/ciralproject/ciral).
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-ha.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-ha.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-ha
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection MrTyDiCollection \
+  -input /path/to/ciral-hausa \
+  -index indexes/lucene-index.ciral-v1.0-ha/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 16 -storePositions -storeDocvectors -storeRaw -language ha \
+  >& logs/log.ciral-hausa &
+```
+
+See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.ciral-v1.0-ha/ \
+  -topics tools/topics-and-qrels/topics.ciral-v1.0-ha-dev-native.tsv \
+  -topicreader TsvInt \
+  -output runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt \
+  -bm25 -hits 1000 -language ha &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-ha-dev.tsv runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-ha-dev.tsv runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-ha-dev.tsv runs/run.ciral-hausa.bm25-default.topics.ciral-v1.0-ha-dev-native.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@20**                                                                                                  | **BM25 (default)**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| [CIRAL Hausa: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                              | 0.2039    |
+| **MRR@10**                                                                                                   | **BM25 (default)**|
+| [CIRAL Hausa: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                              | 0.3153    |
+| **R@100**                                                                                                    | **BM25 (default)**|
+| [CIRAL Hausa: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                              | 0.2760    |
diff --git a/docs/regressions/regressions-ciral-v1.0-so.md b/docs/regressions/regressions-ciral-v1.0-so.md
@@ -0,0 +1,62 @@
+# Anserini Regressions: CIRAL (v1.0) &mdash; Somali
+
+This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Somali](https://github.com/ciralproject/ciral).
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-so.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-so.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-so
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection MrTyDiCollection \
+  -input /path/to/ciral-somali \
+  -index indexes/lucene-index.ciral-v1.0-so/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 16 -storePositions -storeDocvectors -storeRaw -language so \
+  >& logs/log.ciral-somali &
+```
+
+See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.ciral-v1.0-so/ \
+  -topics tools/topics-and-qrels/topics.ciral-v1.0-so-dev-native.tsv \
+  -topicreader TsvInt \
+  -output runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt \
+  -bm25 -hits 1000 -language so &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-so-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-so-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-so-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-so-dev-native.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@20**                                                                                                  | **BM25 (default)**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| [CIRAL Somali: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                             | 0.1500    |
+| **MRR@10**                                                                                                   | **BM25 (default)**|
+| [CIRAL Somali: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                             | 0.4000    |
+| **R@100**                                                                                                    | **BM25 (default)**|
+| [CIRAL Somali: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                             | 0.1850    |
diff --git a/docs/regressions/regressions-ciral-v1.0-sw.md b/docs/regressions/regressions-ciral-v1.0-sw.md
@@ -0,0 +1,62 @@
+# Anserini Regressions: CIRAL (v1.0) &mdash; Swahili
+
+This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Swahili](https://github.com/ciralproject/ciral).
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-sw.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-sw.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-sw
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection MrTyDiCollection \
+  -input /path/to/ciral-somali \
+  -index indexes/lucene-index.ciral-v1.0-sw/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 16 -storePositions -storeDocvectors -storeRaw -language sw \
+  >& logs/log.ciral-somali &
+```
+
+See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.ciral-v1.0-sw/ \
+  -topics tools/topics-and-qrels/topics.ciral-v1.0-sw-dev-native.tsv \
+  -topicreader TsvInt \
+  -output runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt \
+  -bm25 -hits 1000 -language sw &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-sw-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-sw-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-sw-dev.tsv runs/run.ciral-somali.bm25-default.topics.ciral-v1.0-sw-dev-native.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@20**                                                                                                  | **BM25 (default)**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| [CIRAL Swahili: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                            | 0.1812    |
+| **MRR@10**                                                                                                   | **BM25 (default)**|
+| [CIRAL Swahili: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                            | 0.1681    |
+| **R@100**                                                                                                    | **BM25 (default)**|
+| [CIRAL Swahili: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                            | 0.4742    |
diff --git a/docs/regressions/regressions-ciral-v1.0-yo.md b/docs/regressions/regressions-ciral-v1.0-yo.md
@@ -0,0 +1,62 @@
+# Anserini Regressions: CIRAL (v1.0) &mdash; Yoruba
+
+This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Yoruba](https://github.com/ciralproject/ciral).
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/ciral-v1.0-yo.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/ciral-v1.0-yo.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression ciral-v1.0-yo
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection MrTyDiCollection \
+  -input /path/to/ciral-yoruba \
+  -index indexes/lucene-index.ciral-v1.0-yo/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 16 -storePositions -storeDocvectors -storeRaw -language yo \
+  >& logs/log.ciral-yoruba &
+```
+
+See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.ciral-v1.0-yo/ \
+  -topics tools/topics-and-qrels/topics.ciral-v1.0-yo-dev-native.tsv \
+  -topicreader TsvInt \
+  -output runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt \
+  -bm25 -hits 1000 -language yo &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.20 tools/topics-and-qrels/qrels.ciral-v1.0-yo-dev.tsv runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.ciral-v1.0-yo-dev.tsv runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.ciral-v1.0-yo-dev.tsv runs/run.ciral-yoruba.bm25-default.topics.ciral-v1.0-yo-dev-native.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@20**                                                                                                  | **BM25 (default)**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| [CIRAL Yoruba: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                             | 0.2797    |
+| **MRR@10**                                                                                                   | **BM25 (default)**|
+| [CIRAL Yoruba: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                             | 0.3833    |
+| **R@100**                                                                                                    | **BM25 (default)**|
+| [CIRAL Yoruba: Dev](https://huggingface.co/datasets/CIRAL/ciral)                                             | 0.5114    |
diff --git a/src/main/java/io/anserini/index/IndexCollection.java b/src/main/java/io/anserini/index/IndexCollection.java
@@ -459,7 +459,7 @@ private Analyzer getAnalyzer() {
         LOG.info("Using language-specific analyzer");
         LOG.info("Language: " + args.language);
         return AnalyzerMap.getLanguageSpecificAnalyzer(args.language);
-      } else if (args.language.equals("sw") || args.language.equals("yo")) {
+      } else if ( Arrays.asList("ha","so","sw","yo").contains(args.language)) {
         return new WhitespaceAnalyzer();
       } else if (args.pretokenized) {
         return new WhitespaceAnalyzer();

diff --git a/src/main/java/io/anserini/search/SearchCollection.java b/src/main/java/io/anserini/search/SearchCollection.java
@@ -686,7 +686,7 @@ private Analyzer getAnalyzer() {
         LOG.info("Using language-specific analyzer");
         LOG.info("Language: " + args.language);
         return AnalyzerMap.getLanguageSpecificAnalyzer(args.language);
-      } else if (args.language.equals("sw") || args.language.equals("yo")) {
+      } else if (Arrays.asList("ha","so","sw","yo").contains(args.language)) {
         return new WhitespaceAnalyzer();
       } else if (args.pretokenized) {
         return new WhitespaceAnalyzer();

diff --git a/src/main/resources/docgen/templates/ciral-v1.0-ha.template b/src/main/resources/docgen/templates/ciral-v1.0-ha.template
@@ -0,0 +1,43 @@
+# Anserini Regressions: CIRAL (v1.0) &mdash; Hausa
+
+This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Hausa](https://github.com/ciralproject/ciral).
+
+The exact configurations for these regressions are stored in [this YAML file](${yaml}).
+Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+${index_cmds}
+```
+
+See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
+For additional details, see explanation of [common indexing options](${root_path}/docs/common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+${ranking_cmds}
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+${eval_cmds}
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+${effectiveness}
diff --git a/src/main/resources/docgen/templates/ciral-v1.0-so.template b/src/main/resources/docgen/templates/ciral-v1.0-so.template
@@ -0,0 +1,43 @@
+# Anserini Regressions: CIRAL (v1.0) &mdash; Somali
+
+This page documents BM25 monolingual regression experiments for [CIRAL (v1.0) &mdash; Somali](https://github.com/ciralproject/ciral).
+
+The exact configurations for these regressions are stored in [this YAML file](${yaml}).
+Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+${index_cmds}
+```
+
+See [this page](https://github.com/ciralproject/ciral) for more details about the CIRAL corpus.
+For additional details, see explanation of [common indexing options](${root_path}/docs/common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+${ranking_cmds}
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+${eval_cmds}
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+${effectiveness}