Skip to content

Commit

Permalink
Merge branch 'release/0.2'
Browse files Browse the repository at this point in the history
  • Loading branch information
laurejt committed Oct 7, 2024
2 parents 6b498c5 + 2c150b1 commit 9279ecd
Show file tree
Hide file tree
Showing 25 changed files with 2,207 additions and 139 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ repos:
rev: v0.3.4
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
args: [ --select, I, --fix, --exit-non-zero-on-fix ]
- id: ruff-format
2 changes: 1 addition & 1 deletion .python-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.10
3.12
20 changes: 19 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,26 @@
# CHANGELOG

## 0.2.0
- Now requires Python 3.12
### Corppa Utilities
- Basic readme documentation for filter script
- New script for OCR with google vision
- Updated filter script:
- Uses PPA work ids instead of source ids
- Additional filtering by volume and page
- Additional filtering by include or exclude key-pair values
- New utilities function for working with PPA corpus file paths
- New script for generating PPA page subset to be used in conjunction with the filter script
- New script for adding image relative paths to a PPA text corpus
### Poetry Detection
- New Prodigy recipes and custom CSS for image and text annotation
- Script to add PPA work-level metadata for display in Prodigy
### Misc
- Ruff precommit hook now configured to autofix import order


## 0.1.0
- Utility to filter the full text corpus by source ID
- Experimental Scripts
- OCR evaluation
- Character-level statistics

65 changes: 61 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,82 @@
# `corppa` PPA full-text corpus utilities
# corppa

This repository is research software developed as part of the [Ends of Prosody](https://cdh.princeton.edu/projects/the-ends-of-prosody/), which is associated with the [Princeton Prosody Archive](https://prosody.princeton.edu/) (PPA). This software is particularly focused on research and work related to PPA full-text and page image corpora.

> [!WARNING]
> This code is primarily for internal team use. Some portions of it may eventually be useful for participants of the [Ends of Prosody conference](https://cdh.princeton.edu/events/the-ends-of-prosody/) or be adapted or used elsewhere.
## Basic Usage

### Installation

Use pip to install as a python package directly from GitHub. Use a branch or tag name, e.g. `@develop` or `@0.1` if you need to install a specific version.

```sh
pip install git+https://github.com/Princeton-CDH/ppa-nlp.git#egg=corppa
```
or
```sh
pip install git+https://github.com/Princeton-CDH/[email protected]#egg=corppa
```

### Scripts

Installing `corppa` currently provides access to two command line scripts, for filtering a PPA page-level corpus or for generating OCR text for images using Google Vision API. These can be run as `corppa-filter` and `corppa-ocr` respectively.

#### Filtering PPA page-text corpus

The PPA page-level text corpus is shared as a json lines (`.jsonl`) file, which may or may not be compressed (e.g., `.jsonl.gz`). It's often useful to filter the full corpus to a subset of pages for a specific task, e.g. to analyze content from specific volumes or select particular pages for annotation.

To create a subset corpus with _all pages_ for a set of specific volumes, create a text file with a list of **PPA work identifiers**, one id per line, and then run the filter script with the input file, desired output file, and path to id file.

```sh
corppa-filter ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt
```

> [!NOTE]
> **PPA work identifiers** are based on source identifiers, i.e., the identifier from the original source (HathiTrust, Gale/ECCO, EEBO-TCP). In most cases the work identifier and the source identifier are the same, but _if you are working with any excerpted content the work id is NOT the same as the source identifier_. Excerpt ids are based on the combination of source identifier and the first original page included in the excerpt. In some cases PPA contains multiple excerpts from the same source, so this provides guaranteed unique work ids.
To create a subset of _specific pages_ from specific volumes, create a CSV file that includes fields `work_id` and `page_num`, and pass that to the filter script with the `--pg-file` option:

```sh
corppa-filter ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv
```

You can filter a page corpus to exclude or include pages based on exact-matches for attributes included in the jsonl data. For example, to get all pages with the original page number roman numeral 'i':

```sh
corppa-filter ppa_pages.jsonl i_pages.jsonl --include label=i
```

Filters can also be combined; for example, to get the original page 10 for every volume from a list, you could specify a list of ids and the `--include` filter:

```sh
corppa-filter ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10
```

This repository provides code and other resources associated with the [Princeton Prosody Archive](https://prosody.princeton.edu/) (PPA), with a particular focus on working with the PPA full-text corpus.

## Development instructions

This repo uses [git-flow](https://github.com/nvie/gitflow) branching conventions; **main** contains the most recent release, and work in progress will be on the **develop** branch. Pull requests for new features should be made against develop.

### Developer setup and installation

- **Recommended:** create a python virtual environment with your tool of choice (virtualenv, conda, etc); use python 3.10 or higher
- **Recommended:** create a python virtual environment with your tool of choice (virtualenv, conda, etc); use python 3.12 or higher

- Install the local checked out version of this package in editable mode (`-e`), including all python dependencies and optional dependencies for development and testing:

- Install the local checked out version of this package in editable mode (`-e`), including all python dependencies and optional dependencies for development and testing:
```sh
pip install -e ".[dev]"
```

- This repository uses [pre-commit](https://pre-commit.com/) for python code linting and consistent formatting. Run this command to initialize and install pre-commit hooks:

```sh
pre-commit install
```

## Experimental Scripts

Experimental scripts associated with `corppa` are located within the `scripts` directory.
See this directory's README for more detail.
`
28 changes: 16 additions & 12 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,13 @@ build-backend = "hatchling.build"
[project]
name = "corppa"
description = "Utilities for working with Princeton Prosody Archive full-text corpus"
requires-python = ">=3.10"
requires-python = ">=3.12"
readme = "README.md"
# license TBD
#license.file = "LICENSE"
#license = {text = "Apache-2"}
classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Development Status :: 3 - Alpha",
"Intended Audience :: Science/Research",
Expand All @@ -22,21 +20,27 @@ classifiers = [
"Topic :: Text Processing",
"Topic :: Utilities",
]
dynamic = ["version"]
dependencies = [
"orjsonl",
"tqdm"
"tqdm",
]
dynamic = ["version"]

[project.scripts]
corppa-filter-corpus = "corppa.utils.filter:main"

[tool.hatch.version]
path = "src/corppa/__init__.py"

[project.optional-dependencies]
test = [
"pytest",
"pytest-cov"
]
dev = ["pre-commit", "corppa[test]"]
ocr = ["google-cloud-vision"]
dev = ["pre-commit", "corppa[test]", "corppa[ocr]"]

[project.scripts]
corppa-filter = "corppa.utils.filter:main"
corppa-ocr = "corppa.ocr.gvision_ocr:main"

[tool.hatch.version]
path = "src/corppa/__init__.py"

[tool.ruff]
# configure src path so ruff import fixes can identify local imports
src = ["src"]
39 changes: 39 additions & 0 deletions requirements.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
cachetools==5.5.0
certifi==2024.8.30
cfgv==3.4.0
charset-normalizer==3.3.2
-e git+ssh://[email protected]/Princeton-CDH/ppa-nlp.git@30734c57bdf3e9ae63d04bc2e2585aede4b6d751#egg=corppa
coverage==7.6.1
distlib==0.3.8
filelock==3.15.4
google-api-core==2.19.2
google-auth==2.34.0
google-cloud-vision==3.7.4
googleapis-common-protos==1.65.0
grpcio==1.66.1
grpcio-status==1.66.1
identify==2.6.0
idna==3.8
iniconfig==2.0.0
nodeenv==1.9.1
orjson==3.10.7
orjsonl==1.0.0
packaging==24.1
platformdirs==4.2.2
pluggy==1.5.0
pre-commit==3.8.0
proto-plus==1.24.0
protobuf==5.28.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pytest==8.3.2
pytest-cov==5.0.0
PyYAML==6.0.2
requests==2.32.3
rsa==4.9
setuptools==72.1.0
tqdm==4.66.5
urllib3==2.2.2
virtualenv==20.26.3
wheel==0.43.0
xopen==2.0.2
3 changes: 3 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,6 @@ This module contains general-purpose auxiliary methods.

#### `ocr_helper.py`
This module contains OCR-related auxiliary methods.

### `transform-images.sh`
This bash script will copy and transform images from a PPA (sub)corpus (jsonl).
10 changes: 5 additions & 5 deletions scripts/evaluate_ocr.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
import csv
import os
import sys
import spacy
import csv
import orjsonl

from xopen import xopen
from tqdm import tqdm
import orjsonl
import spacy
from lingua import LanguageDetectorBuilder
from ocr_helper import clean_chars
from tqdm import tqdm
from xopen import xopen


class OCREvaluator:
Expand Down
11 changes: 5 additions & 6 deletions scripts/get_character_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,16 @@
env: ppa-ocr
"""

import sys
import os.path
import csv
import os.path
import sys
import unicodedata
from collections import Counter

import orjsonl
from collections import Counter
from xopen import xopen
from tqdm import tqdm
from ocr_helper import clean_chars

from tqdm import tqdm
from xopen import xopen

__cc_names = {
"\n": "Cc: LINE FEED",
Expand Down
20 changes: 20 additions & 0 deletions scripts/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,26 @@
_htid_decode_table = str.maketrans(_htid_decode_map)


def get_stub_dir(source, vol_id):
"""
Returns the stub directory for the specified volume (vol_id) and
source type (source)
For Gale, every third number (excluding the leading 0) of the volume
identifier is used.
Ex. CB0127060085 --> 100
For HathiTrust, the library portion of the volume identifier is used.
Ex. mdp.39015003633594 --> mdp
"""
if source == "Gale":
return vol_id[::3][1:]
elif source == "HathiTrust":
return vol_id.split(".", maxsplit=1)[0]
else:
raise ValueError(f"Unknown source '{source}'")


def encode_htid(htid):
"""
Returns the "clean" version of a HathiTrust volume identifier with the form:
Expand Down
1 change: 0 additions & 1 deletion scripts/ocr_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

import ftfy


_char_conversion_map = {"ſ": "s"}
_char_translation_table = str.maketrans(_char_conversion_map)

Expand Down
64 changes: 64 additions & 0 deletions scripts/transform-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#! /bin/sh

# For the images specified in the input jsonl, copy and transform images
# in from the input directory to the output directory, according to the
# mode specified.

mode=$1
in_jsonl=$2
in_dir=$3
out_dir=$4


# Arg validation
if [ $# -ne 4 ]; then
echo "Usage: [mode] [jsonl] [in dir] [out dir]"
exit 1
fi
# Check that jsonl file exists
if [ ! -f "$in_jsonl" ]; then
echo "ERROR: File $in_jsonl does not exist!"
exit 1
fi
# Check input dir exists
if [ ! -d "$in_dir" ]; then
echo "ERROR: Directory $in_dir does not exist!"
exit 1
fi
# Check output dir exists
if [ ! -d "$out_dir" ]; then
echo "ERROR: Directory $out_dir does not exist!"
exit 1
fi
# Check that the mode is valid
if [ "$mode" != "copy" ]; then
echo "ERROR: Invalid mode '$mode'"
exit 1
fi

for path_str in `jq ".image_path" $in_jsonl`; do
# strip double quotes
img_path=${path_str#'"'}
img_path=${img_path%'"'}
echo "$img_path"

# Check image exists
in_path="$in_dir/$img_path"
if [ ! -f "$in_path" ]; then
"WARNING: Image $in_path does not exist!"
fi

out_path="$out_dir/$img_path"
out_subdir=`dirname "$out_path"`
if [ ! -d "$out_subdir" ]; then
mkdir -p "$out_subdir"
fi

if [ $mode == "copy" ]; then
# For now just make copies
cp "$in_path" "$out_path"
else
echo "ERROR: Unkown mode '$mode'"
exit 1
fi
done
2 changes: 1 addition & 1 deletion src/corppa/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.1.0"
__version__ = "0.2.0"
Loading

0 comments on commit 9279ecd

Please sign in to comment.