New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Command to update case names #4647

Open

quevon24 wants to merge 37 commits into main from new_resource_casenames

Member

quevon24 commented Nov 4, 2024 •

edited

Loading

A command to update the case names using the metadata from datasets. This will update all possible names, not just those from Resource or a source combined with Resource.

You can specify the delay to between updates to avoid issues with redis (updating the case names will trigger indexing)
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --delay 0.1

You perform a dry run to verify that everything is fine
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --dry-run

You can control the chunk size when reading the csv to avoid memory issues:
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --chunk-size 100000

quevon24 added 3 commits

November 2, 2024 09:49


          feat(resource): new command to update resource casenames

b693643


          feat(resource): New command to update resource casenames

158bb5c


          Merge branch 'main' into new_resource_casenames

semgrep-app bot reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_resource_casenames.py Outdated Show resolved Hide resolved

quevon24 marked this pull request as ready for review

November 4, 2024 23:11

quevon24 requested a review from flooie

November 4, 2024 23:11

quevon24 added 4 commits

November 4, 2024 17:17


          Merge branch 'main' into new_resource_casenames

826584a


          feat(resource): clean docket numbers before matching them, improve wi…

95c7d01

…nnow_case_name function


          feat(resource): clean docket numbers

f5e6637


          feat(resource): update regex in winnow_case_name

202c649

quevon24 marked this pull request as draft

November 6, 2024 15:25

quevon24 added 8 commits

November 6, 2024 09:25


          Merge branch 'main' into new_resource_casenames

58ec6b0


          feat(resource): handle no docket number in matched cluster docket

update date string formats


          Merge remote-tracking branch 'origin/new_resource_casenames' into new…

b1d42b9

…_resource_casenames


          Merge branch 'main' into new_resource_casenames

32b8ac9


          Merge branch 'main' into new_resource_casenames

96df44a


          feat(resource): command to update case names using wl dataset

23e3ef7


          feat(resource): command to update case names using wl dataset

54f5f61


          feat(casenames): refactor code

bdf5adc

semgrep-app bot reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/utils.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

quevon24 added 3 commits

November 13, 2024 15:42


          feat(casenames): refactor code

18164ae


          feat(casenames): refactor code

d262178


          feat(casenames): log if we have both citations, but still try to impr…

6ca4f3f

…ove case name

quevon24 marked this pull request as ready for review

November 14, 2024 15:54


          Merge branch 'main' into new_resource_casenames

bf59085

mlissner assigned quevon24

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py

Comment on lines +305 to +311

+                          add_citations_to_cluster(
+                              [
+                                  f"{cite.get('volume')} {cite.get('reporter')} {cite.get('page')}"
+                                  for cite in valid_citations
+                              ],
+                              matches[0].cluster_id,
+                          )

Contributor

flooie Nov 19, 2024

reporter or corrected_reporter ?

Member Author

quevon24 Nov 19, 2024

reporter, because we generate a list of dicts with volume, reporter(we get it with corrected_reporter()) and page in parse_citations()

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated

Comment on lines 269 to 271

+                              case_name_match = check_case_names_match(
+                                  west_case_name, cl_case_name
+                              )

Contributor

flooie Nov 19, 2024

we still aren't properly preparing case names.

def combine_initials(case_name: str) -> str:
    """Combine initials in case captions

    :param case_name: the case caption
    :return: the cleaned case caption
    """
    pattern = r"((?:[A-Z]\.?\s?){2,})(\s|$)"
    return re.sub(pattern, lambda m: m.group(0).replace(".", ""), case_name)

I think something like this should work. this combines initials

we have a number of rows I checked that would fail otherwise.

"In re K.K.","United States Court of Appeals, Ninth Circuit.","July 01, 2014","756 F.3d 1169","2014 WL 2937488","14-71875","756"

for example would reduce to []

Member Author

quevon24 Nov 19, 2024

The regex needs some tweaking, it almost works fine, i found some cases where there are spaces between abbreviations:

"In re P. I.","Court of Appeal, First District, Division 3, California.","January 20, 1989","207 Cal.App.3d 316","254 Cal.Rptr. 774","A041221","254" https://www.courtlistener.com/opinion/2168480/in-re-pi/

"In re A. M.","Court of Appeal, First District, Division 2, California.","December 01, 1989","216 Cal.App.3d 319","264 Cal.Rptr. 666","A042237","264" https://www.courtlistener.com/opinion/2175113/in-re-am/

Member Author

quevon24 Nov 19, 2024

what do you think about this:

def combine_initials(case_name: str) -> str:
    """Combine initials in case captions

    :param case_name: the case caption
    :return: the cleaned case caption
    """
    pattern = r"\b[A-Z](?:[A-Z\.]|\s)*[A-Z]\b\."

    return re.sub(
        pattern,
        lambda match: match.group(0).replace(" ", "").replace(".", ""),
        case_name,
    )

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

flooie requested changes

View reviewed changes

Contributor

flooie left a comment

Getting closer but a few things jump out. thanks @quevon24

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated Show resolved Hide resolved

quevon24 added 7 commits

November 19, 2024 17:11


          feat(casenames): chunksize argument removed

a5c106e

remove unused words in FALSE_POSITIVES
update tokenize_case_name() function
improve code readability


          Merge remote-tracking branch 'origin/new_resource_casenames' into new…

fc2b9ed

…_resource_casenames


          feat(casenames): improve code readability

12d899f


          Merge branch 'main' into new_resource_casenames

417dcbd


          feat(casenames): Join abbreviations/acronyms

76ecb76


          Merge remote-tracking branch 'origin/new_resource_casenames' into new…

a5ab4ef

…_resource_casenames


          Merge branch 'main' into new_resource_casenames

1ad2c97

quevon24 assigned quevon24 and unassigned quevon24

quevon24 requested a review from flooie

November 20, 2024 15:29

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py

Comment on lines +218 to +220

		pattern = r"((?:[A-Z]\.?\s?){2,})(\s\|$)"

		return re.sub(pattern, lambda m: m.group(0).replace(".", ""), case_name)

Contributor

flooie Nov 21, 2024

this pattern doesn't appear to removal initials

Contributor

flooie Nov 21, 2024

perhaps

initials_pattern = re.compile(r"([A-Z]{1}\.?\s?){2,}(\s|$)")

    match = initials_pattern.search(case_name)  # Search for the initials pattern
    if match:
        initials = match.group()
        compressed_initials = re.sub(r"(?!\s$)[\s\.]", "", initials)
        case_name = case_name.replace(initials, compressed_initials)

this would work? I tested it on

case_names = [
"M. X. Smith",
"M.X. Smith",
"M X Smith",
"M. X. J. Smith",
"M X J Smith",
]
and it worked as expected .

flooie reviewed

View reviewed changes

cl/corpus_importer/management/commands/update_casenames_wl_dataset.py Outdated

Comment on lines 320 to 326

+                      add_citations_to_cluster(
+                          [
+                              f"{cite.get('volume')} {cite.get('reporter')} {cite.get('page')}"
+                              for cite in valid_citations
+                          ],
+                          matches[0].cluster_id,
+                      )

Contributor

flooie Nov 21, 2024

id love to reuse this - but I dont think we should - its not as robust and causes use to re-create the citation to parse it again and do a bunch of eyecite. I think it would be smarter to identify the type when we validate teh citations and just create the citation here.

something like ...


            matched_cluster = matches[0].cluster

            if dry_run:
                # Dry run, don't save anything
                continue

            # Update case names

            with transaction.atomic():
                cluster_updated, docket_updated = update_matched_case_name(
                matched_cluster, west_case_name
                )

                for citation in valid_citations:
                    if Citation.obects.filter(reporter=citation.get("reporter"), cluster=matched_cluster.id).exists():
                        logger.warning("Can not save mismatched citation")
                        raise("issue with reporter already here... if its not the same one")
                    citation['cluster_id'] = matched_cluster.id
                    Citation.objects.get_or_create(**citation)

I think.

This means the parse citation would look something like this


def parse_citations(citation_strings: list[str]) -> list[dict]:
    """Validate citations with Eyecite.

    :param citation_strings: List of citation strings to validate.
    :return: List of validated citation dictionaries with volume, reporter, and page.
    """
    validated_citations = []

    for cite_str in citation_strings:
        # Get citations from the string
        found_cites = get_citations(cite_str, tokenizer=HYPERSCAN_TOKENIZER)
        if len(found_cites) != 1:
            continue

        citation = found_cites[0]

        if isinstance(citation, FullCaseCitation):
            volume = citation.groups.get("volume")

            # Validate the volume
            if not volume or not volume.isdigit():
                continue

            if not citation[0].corrected_reporter():
                reporter_type = Citation.STATE
            else:
                cite_type_str = citation[0].all_editions[0].reporter.cite_type
                reporter_type = map_reporter_db_cite_type(cite_type_str)

            validated_citations.append(
                {
                    "volume": citation.groups["volume"],
                    "reporter": citation.corrected_reporter(),
                    "page": citation.groups["page"],
                    "type": reporter_type
                }
            )

    return validated_citations

I took this from the other function - but im not sure why we need the odd if citation reporter use Citation.State ... ?

quevon24 and others added 6 commits

November 21, 2024 12:59


          Merge branch 'main' into new_resource_casenames

8fd8acd


          Merge branch 'main' into new_resource_casenames

4f819a9


          feat(casenames): refactor code to parse and add citations

c05c40a


          Merge branch 'main' into new_resource_casenames

841e072


          feat(casenames): add new date format found in dataset

9bef8df


          Merge remote-tracking branch 'origin/new_resource_casenames' into new…

856c22b

…_resource_casenames

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet