Shorten the install name for filter script & update docs to match

Princeton-CDH · Oct 7, 2024 · 2af0805 · 2af0805
1 parent e86be21
commit 2af0805
Show file tree

Hide file tree

Showing 3 changed files with 9 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ pip install git+https://github.com/Princeton-CDH/[email protected]#egg=corppa
 
 ### Scripts
 
-Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter-corpus` and `corppa-ocr` respectively.
+Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter` and `corppa-ocr` respectively.
 
 #### Filtering PPA page-text corpus
 
@@ -30,7 +30,7 @@ The PPA page-level text corpus is shared as a json lines (`.jsonl`) file, which
 To create a subset corpus with _all pages_ for a set of specific volumes, create a text file with a list of **PPA work identifiers**, one id per line, and then run the filter script with the input file, desired output file, and path to id file.
 
 ```sh
-corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
+corppa-filter ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
 ```
 
 > [!NOTE]
@@ -39,19 +39,19 @@ corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
 To create a subset of _specific pages_ from specific volumes, create a CSV file that includes fields `work_id` and `page_num`, and pass that to the filter script with the `--pg-file` option:
 
 ```sh
-corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv
+corppa-filter ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv
 ```
 
 You can filter a page corpus to exclude or include pages based on exact-matches for attributes included in the jsonl data. For example, to get all pages with the original page number roman numeral 'i':
 
 ```sh
-corppa-filter-corpus ppa_pages.jsonl i_pages.jsonl --include label=i
+corppa-filter ppa_pages.jsonl i_pages.jsonl --include label=i
 ```
 
 Filters can also be combined; for example, to get the original page 10 for every volume from a list, you could specify a list of ids and the `--include` filter:
 
 ```sh
-corppa-filter-corpus ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10
+corppa-filter ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10
 ```
 
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -35,7 +35,7 @@ ocr = ["google-cloud-vision"]
 dev = ["pre-commit", "corppa[test]", "corppa[ocr]"]
 
 [project.scripts]
-corppa-filter-corpus = "corppa.utils.filter:main"
+corppa-filter = "corppa.utils.filter:main"
 corppa-ocr = "corppa.ocr.gvision_ocr:main"
 
 [tool.hatch.version]

diff --git a/src/corppa/utils/filter.py b/src/corppa/utils/filter.py
@@ -20,11 +20,11 @@
 
 Example command line usages:
 ```
-corppa-filter-corpus path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --idfile my_ids.txt
+corppa-filter path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --idfile my_ids.txt
 ```
 
 ```
-corppa-filter-corpus path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --pg-file pages.csv --include key=value
+corppa-filter path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --pg-file pages.csv --include key=value
 ```
 """
 
@@ -220,7 +220,7 @@ def __call__(self, parser, args, values, option_string=None):
 
 def main():
     """Command-line access to filtering the corpus. Available as
-    `corppa-filter-corpus` when this package is installed with pip."""
+    `corppa-filter` when this package is installed with pip."""
 
     parser = argparse.ArgumentParser(
         description="Filters PPA full-text corpus",