-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Shorten the install name for filter script & update docs to match
- Loading branch information
Showing
3 changed files
with
9 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,7 +21,7 @@ pip install git+https://github.com/Princeton-CDH/[email protected]#egg=corppa | |
|
||
### Scripts | ||
|
||
Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter-corpus` and `corppa-ocr` respectively. | ||
Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter` and `corppa-ocr` respectively. | ||
|
||
#### Filtering PPA page-text corpus | ||
|
||
|
@@ -30,7 +30,7 @@ The PPA page-level text corpus is shared as a json lines (`.jsonl`) file, which | |
To create a subset corpus with _all pages_ for a set of specific volumes, create a text file with a list of **PPA work identifiers**, one id per line, and then run the filter script with the input file, desired output file, and path to id file. | ||
|
||
```sh | ||
corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt | ||
corppa-filter ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt | ||
``` | ||
|
||
> [!NOTE] | ||
|
@@ -39,19 +39,19 @@ corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt | |
To create a subset of _specific pages_ from specific volumes, create a CSV file that includes fields `work_id` and `page_num`, and pass that to the filter script with the `--pg-file` option: | ||
|
||
```sh | ||
corppa-filter-corpus ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv | ||
corppa-filter ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv | ||
``` | ||
|
||
You can filter a page corpus to exclude or include pages based on exact-matches for attributes included in the jsonl data. For example, to get all pages with the original page number roman numeral 'i': | ||
|
||
```sh | ||
corppa-filter-corpus ppa_pages.jsonl i_pages.jsonl --include label=i | ||
corppa-filter ppa_pages.jsonl i_pages.jsonl --include label=i | ||
``` | ||
|
||
Filters can also be combined; for example, to get the original page 10 for every volume from a list, you could specify a list of ids and the `--include` filter: | ||
|
||
```sh | ||
corppa-filter-corpus ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10 | ||
corppa-filter ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10 | ||
``` | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters