Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't set court='scotus' for South Carolina citations #84 #105

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# This file lists commits that changed large sections of the code and are best
# ignored by git blame (which tools like PyCharm use for their "annotate"
# feature).
#
# To use this file, go to the root of this project, and run:
#
# git config blame.ignoreRevsFile .git-blame-ignore-revs
#
# That'll tell git to use this file. For this to work, you need Git 2.23.0
# (released late 2019) or later.

# Run pre-commit
1fed0e1afb9f92b3704b49f1bc46d54a03d0e68d
6 changes: 6 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,9 @@ jobs:

- name: Run tests
run: python -m unittest discover -s tests -p 'test_*.py'

# Cancel the current workflow (tests) for pull requests (head_ref) only. See:
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-a-fallback-value
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
45 changes: 45 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
repos:
- repo: https://github.com/asottile/pyupgrade
rev: v2.29.1
hooks:
- id: pyupgrade
args: [--py37-plus]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.0.1
hooks:
- id: check-added-large-files
- id: check-ast
- id: check-json
- id: check-merge-conflict
- id: check-toml
- id: check-yaml
- id: debug-statements
- id: detect-private-key
- id: fix-byte-order-marker
- id: fix-encoding-pragma
args: [--remove]
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
exclude: ^tests/examples/pacer/nef/s3/.*\.txt$

- repo: https://github.com/ikamensh/flynt/
rev: '0.69'
hooks:
- id: flynt
args: [--line-length=79, --transform-concats]

- repo: https://github.com/psf/black
rev: 21.12b0
hooks:
- id: black

- repo: https://github.com/PyCQA/isort
rev: 5.10.1
hooks:
- id: isort
name: isort (python)

- repo: https://github.com/pycqa/flake8
rev: 3.9.0
hooks:
- id: flake8
24 changes: 12 additions & 12 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Changes:
- None yet

Fixes:
- Initial support for finding short cites with non-standard regexes, including fixing short cite extraction for `Mich.`, `N.Y.2d` and `Pa.`.
- Initial support for finding short cites with non-standard regexes, including fixing short cite extraction for `Mich.`, `N.Y.2d` and `Pa.`.

## Current

Expand All @@ -22,7 +22,7 @@ Features:
- Autogenerated documentation

Changes:
- This version lands one more iteration of the APIs to make them more consistent. Sorry. Hopefully this will be the last of its kind for a while. The need for these changes became obvious when we began generating documentation. The changes are all in name only, not in functionality. So: 1) the `annotate` function is renamed as `annotate_citations`; 2) The `find_citations` module has been renamed `find` (so, do `from eyecite.find import get_citations` instead of `from eyecite.find_citations import get_citations`); 3) The `cleaners` module is now named `clean`; and 4) The `clean_text` function has been moved from `utils` to `clean` (so, do `from eyecite.clean import clean_text` instead of `from eyecite.utils import clean_text`).
- This version lands one more iteration of the APIs to make them more consistent. Sorry. Hopefully this will be the last of its kind for a while. The need for these changes became obvious when we began generating documentation. The changes are all in name only, not in functionality. So: 1) the `annotate` function is renamed as `annotate_citations`; 2) The `find_citations` module has been renamed `find` (so, do `from eyecite.find import get_citations` instead of `from eyecite.find_citations import get_citations`); 3) The `cleaners` module is now named `clean`; and 4) The `clean_text` function has been moved from `utils` to `clean` (so, do `from eyecite.clean import clean_text` instead of `from eyecite.utils import clean_text`).


**2.2.0 - 2021-06-04**
Expand All @@ -35,15 +35,15 @@ Features:
- We now use page-based heuristics while looking up the citation that a pin cite refers to. For example, if an opinion says:

> 1 U.S. 200. blah blah. 2 We Missed This 20. blah blah. Id. at 22.

We might miss the second citation for whatever reason. The pin cite refers to the second citation, not the first, and you can be sure of that because the first citation begins on page 200 and the pin cite references page 22. When resolving the pin cite, we will no longer link it up to the first citation.

Similarly, an analysis of the Caselaw Access Project's dataset indicates that all but the longest ~300 cases are shorter than 150 pages, so we also now ignore pin cites that don't make sense according to that heuristic. For example, this (made up) pin cite is also likely wrong because it's overwhelmingly unlikely that `1 U.S. 200` is 632 pages long:

> 1 U.S. 200 blah blah 1 U.S. 832
The longest case in the Caselaw Access Project collection is [United States v. Philip Morris USA, Inc](https://cite.case.law/f-supp-2d/449/1/), at 986 pages, in case you were wondering. Figures.

The longest case in the Caselaw Access Project collection is [United States v. Philip Morris USA, Inc](https://cite.case.law/f-supp-2d/449/1/), at 986 pages, in case you were wondering. Figures.

[Issue #74][74], [PR #79][79].

Changes:
Expand Down Expand Up @@ -84,15 +84,15 @@ Changes:
Fixes:
- Fixes crashing errors on some partial supra, id, and short form citations.
- Fixes unbalanced tags created by annotation.
- Fixes year parsing to move away from `isdigit`, which can capture
- Fixes year parsing to move away from `isdigit`, which can capture
unicode superscript numbers like "123 U.S. 456 (196⁴)"
- Allow years all the way back to 1600 instead of 1754. Anybody got a citation
from before then?
- Page number matching is tightened to be much more strict about how it
matches Roman numerals. This change will prevent some citations from being
matched if they have extremely common Roman numerals. See #56 for a full
- Page number matching is tightened to be much more strict about how it
matches Roman numerals. This change will prevent some citations from being
matched if they have extremely common Roman numerals. See #56 for a full
discussion.

**2.0.2** - Adds missing dependency to toml file, nukes setup.py and
requirements.txt. We're now fully in the poetry world.

Expand Down
4 changes: 2 additions & 2 deletions eyecite/find.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,10 @@ def _extract_full_citation(
# journals). Get the set of all sources that matched, preferring exact
# matches to variations:
token = cast(CitationToken, words[index])
cite_sources = set(
cite_sources = {
e.reporter.source
for e in (token.exact_editions or token.variation_editions)
)
}

# get citation_class based on cite_sources
citation_class: Type[ResourceCitation]
Expand Down
15 changes: 8 additions & 7 deletions eyecite/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
POST_SHORT_CITATION_REGEX,
YEAR_REGEX,
)
from eyecite.utils import strip_punct

BACKWARD_SEEK = 28 # Median case name length in the CL db is 28 (2016-02-26)

Expand All @@ -40,18 +39,20 @@ def get_court_by_paren(paren_string: str) -> Optional[str]:
Does not work on SCOTUS, since that court lacks parentheticals, and
needs to be handled after disambiguation has been completed.
"""
court_str = strip_punct(paren_string)

# remove punctuation and convert to upper case
court_str = re.sub(r"[^\w\s]", "", paren_string).upper()
court_code = None
if court_str:
# Map the string to a court, if possible.
for court in courts:
# Use startswith because citations are often missing final period,
# e.g. "2d Cir"
if court["citation_string"].startswith(court_str):
# remove punctuation and convert to upper case because punctuation
# is often unreliable
if (
re.sub(r"[^\w\s]", "", court["citation_string"]).upper()
== court_str
):
court_code = court["id"]
break

return court_code


Expand Down
2 changes: 1 addition & 1 deletion eyecite/resolve.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def _resolve_shortcase_citation(
candidates.append((full_citation, resource))

# Remove duplicates and only accept if one candidate remains
if len(set(resource for full_citation, resource in candidates)) == 1:
if len({resource for full_citation, resource in candidates}) == 1:
return candidates[0][1]

# Otherwise, if there is an antecedent guess, try to refine further
Expand Down
6 changes: 2 additions & 4 deletions eyecite/tokenizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,9 +362,7 @@ class AhocorasickTokenizer(Tokenizer):
def __post_init__(self):
"""Set up helpers to narrow down possible extractors."""
# Build a set of all extractors that don't list required strings
self.unfiltered_extractors = set(
e for e in EXTRACTORS if not e.strings
)
self.unfiltered_extractors = {e for e in EXTRACTORS if not e.strings}
# Build a pyahocorasick filter for all case-sensitive extractors
self.case_sensitive_filter = self.make_ahocorasick_filter(
(s, e)
Expand Down Expand Up @@ -445,7 +443,7 @@ def on_match(index, start, end, flags, context):
byte_to_str_offset = {}
last_byte_offset = 0
str_offset = 0
byte_offsets = sorted(set(i for m in matches for i in m[1]))
byte_offsets = sorted({i for m in matches for i in m[1]})
for byte_offset in byte_offsets:
try:
str_offset += len(
Expand Down
Loading