Skip to content

Commit

Permalink
Shorten the install name for filter script & update docs to match
Browse files Browse the repository at this point in the history
  • Loading branch information
rlskoeser committed Oct 7, 2024
1 parent e86be21 commit 2af0805
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 9 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ pip install git+https://github.com/Princeton-CDH/[email protected]#egg=corppa

### Scripts

Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter-corpus` and `corppa-ocr` respectively.
Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter` and `corppa-ocr` respectively.

#### Filtering PPA page-text corpus

Expand All @@ -30,7 +30,7 @@ The PPA page-level text corpus is shared as a json lines (`.jsonl`) file, which
To create a subset corpus with _all pages_ for a set of specific volumes, create a text file with a list of **PPA work identifiers**, one id per line, and then run the filter script with the input file, desired output file, and path to id file.

```sh
corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
corppa-filter ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
```

> [!NOTE]
Expand All @@ -39,19 +39,19 @@ corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
To create a subset of _specific pages_ from specific volumes, create a CSV file that includes fields `work_id` and `page_num`, and pass that to the filter script with the `--pg-file` option:

```sh
corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv
corppa-filter ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv
```

You can filter a page corpus to exclude or include pages based on exact-matches for attributes included in the jsonl data. For example, to get all pages with the original page number roman numeral 'i':

```sh
corppa-filter-corpus ppa_pages.jsonl i_pages.jsonl --include label=i
corppa-filter ppa_pages.jsonl i_pages.jsonl --include label=i
```

Filters can also be combined; for example, to get the original page 10 for every volume from a list, you could specify a list of ids and the `--include` filter:

```sh
corppa-filter-corpus ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10
corppa-filter ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10
```


Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ ocr = ["google-cloud-vision"]
dev = ["pre-commit", "corppa[test]", "corppa[ocr]"]

[project.scripts]
corppa-filter-corpus = "corppa.utils.filter:main"
corppa-filter = "corppa.utils.filter:main"
corppa-ocr = "corppa.ocr.gvision_ocr:main"

[tool.hatch.version]
Expand Down
6 changes: 3 additions & 3 deletions src/corppa/utils/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@
Example command line usages:
```
corppa-filter-corpus path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --idfile my_ids.txt
corppa-filter path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --idfile my_ids.txt
```
```
corppa-filter-corpus path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --pg-file pages.csv --include key=value
corppa-filter path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --pg-file pages.csv --include key=value
```
"""

Expand Down Expand Up @@ -220,7 +220,7 @@ def __call__(self, parser, args, values, option_string=None):

def main():
"""Command-line access to filtering the corpus. Available as
`corppa-filter-corpus` when this package is installed with pip."""
`corppa-filter` when this package is installed with pip."""

parser = argparse.ArgumentParser(
description="Filters PPA full-text corpus",
Expand Down

0 comments on commit 2af0805

Please sign in to comment.