-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command to update case names #4647
base: main
Are you sure you want to change the base?
Conversation
cl/corpus_importer/management/commands/update_resource_casenames.py
Outdated
Show resolved
Hide resolved
update date string formats
…_resource_casenames
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
add_citations_to_cluster( | ||
[ | ||
f"{cite.get('volume')} {cite.get('reporter')} {cite.get('page')}" | ||
for cite in valid_citations | ||
], | ||
matches[0].cluster_id, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reporter or corrected_reporter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reporter, because we generate a list of dicts with volume, reporter(we get it with corrected_reporter()) and page in parse_citations()
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
case_name_match = check_case_names_match( | ||
west_case_name, cl_case_name | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still aren't properly preparing case names.
def combine_initials(case_name: str) -> str:
"""Combine initials in case captions
:param case_name: the case caption
:return: the cleaned case caption
"""
pattern = r"((?:[A-Z]\.?\s?){2,})(\s|$)"
return re.sub(pattern, lambda m: m.group(0).replace(".", ""), case_name)
I think something like this should work. this combines initials
we have a number of rows I checked that would fail otherwise.
"In re K.K.","United States Court of Appeals, Ninth Circuit.","July 01, 2014","756 F.3d 1169","2014 WL 2937488","14-71875","756"
for example would reduce to []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex needs some tweaking, it almost works fine, i found some cases where there are spaces between abbreviations:
"In re P. I.","Court of Appeal, First District, Division 3, California.","January 20, 1989","207 Cal.App.3d 316","254 Cal.Rptr. 774","A041221","254"
https://www.courtlistener.com/opinion/2168480/in-re-pi/
"In re A. M.","Court of Appeal, First District, Division 2, California.","December 01, 1989","216 Cal.App.3d 319","264 Cal.Rptr. 666","A042237","264"
https://www.courtlistener.com/opinion/2175113/in-re-am/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about this:
def combine_initials(case_name: str) -> str:
"""Combine initials in case captions
:param case_name: the case caption
:return: the cleaned case caption
"""
pattern = r"\b[A-Z](?:[A-Z\.]|\s)*[A-Z]\b\."
return re.sub(
pattern,
lambda match: match.group(0).replace(" ", "").replace(".", ""),
case_name,
)
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer but a few things jump out. thanks @quevon24
cl/corpus_importer/management/commands/update_casenames_wl_dataset.py
Outdated
Show resolved
Hide resolved
remove unused words in FALSE_POSITIVES update tokenize_case_name() function improve code readability
…_resource_casenames
…_resource_casenames
pattern = r"((?:[A-Z]\.?\s?){2,})(\s|$)" | ||
|
||
return re.sub(pattern, lambda m: m.group(0).replace(".", ""), case_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern doesn't appear to removal initials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps
initials_pattern = re.compile(r"([A-Z]{1}\.?\s?){2,}(\s|$)")
match = initials_pattern.search(case_name) # Search for the initials pattern
if match:
initials = match.group()
compressed_initials = re.sub(r"(?!\s$)[\s\.]", "", initials)
case_name = case_name.replace(initials, compressed_initials)
this would work? I tested it on
case_names = [
"M. X. Smith",
"M.X. Smith",
"M X Smith",
"M. X. J. Smith",
"M X J Smith",
]
and it worked as expected .
add_citations_to_cluster( | ||
[ | ||
f"{cite.get('volume')} {cite.get('reporter')} {cite.get('page')}" | ||
for cite in valid_citations | ||
], | ||
matches[0].cluster_id, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id love to reuse this - but I dont think we should - its not as robust and causes use to re-create the citation to parse it again and do a bunch of eyecite. I think it would be smarter to identify the type when we validate teh citations and just create the citation here.
something like ...
matched_cluster = matches[0].cluster
if dry_run:
# Dry run, don't save anything
continue
# Update case names
with transaction.atomic():
cluster_updated, docket_updated = update_matched_case_name(
matched_cluster, west_case_name
)
for citation in valid_citations:
if Citation.obects.filter(reporter=citation.get("reporter"), cluster=matched_cluster.id).exists():
logger.warning("Can not save mismatched citation")
raise("issue with reporter already here... if its not the same one")
citation['cluster_id'] = matched_cluster.id
Citation.objects.get_or_create(**citation)
I think.
This means the parse citation would look something like this
def parse_citations(citation_strings: list[str]) -> list[dict]:
"""Validate citations with Eyecite.
:param citation_strings: List of citation strings to validate.
:return: List of validated citation dictionaries with volume, reporter, and page.
"""
validated_citations = []
for cite_str in citation_strings:
# Get citations from the string
found_cites = get_citations(cite_str, tokenizer=HYPERSCAN_TOKENIZER)
if len(found_cites) != 1:
continue
citation = found_cites[0]
if isinstance(citation, FullCaseCitation):
volume = citation.groups.get("volume")
# Validate the volume
if not volume or not volume.isdigit():
continue
if not citation[0].corrected_reporter():
reporter_type = Citation.STATE
else:
cite_type_str = citation[0].all_editions[0].reporter.cite_type
reporter_type = map_reporter_db_cite_type(cite_type_str)
validated_citations.append(
{
"volume": citation.groups["volume"],
"reporter": citation.corrected_reporter(),
"page": citation.groups["page"],
"type": reporter_type
}
)
return validated_citations
I took this from the other function - but im not sure why we need the odd if citation reporter use Citation.State ... ?
…_resource_casenames
A command to update the case names using the metadata from datasets. This will update all possible names, not just those from Resource or a source combined with Resource.
You can specify the delay to between updates to avoid issues with redis (updating the case names will trigger indexing)
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --delay 0.1
You perform a dry run to verify that everything is fine
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --dry-run
You can control the chunk size when reading the csv to avoid memory issues:
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --chunk-size 100000