Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command to update case names #4647

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open

Command to update case names #4647

wants to merge 37 commits into from

Conversation

quevon24
Copy link
Member

@quevon24 quevon24 commented Nov 4, 2024

A command to update the case names using the metadata from datasets. This will update all possible names, not just those from Resource or a source combined with Resource.

You can specify the delay to between updates to avoid issues with redis (updating the case names will trigger indexing)
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --delay 0.1

You perform a dry run to verify that everything is fine
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --dry-run

You can control the chunk size when reading the csv to avoid memory issues:
docker exec -it cl-django python /opt/courtlistener/manage.py update_resource_casenames --filepath /opt/courtlistener/cl/assets/media/federal_3d.csv --chunk-size 100000

@quevon24 quevon24 marked this pull request as ready for review November 4, 2024 23:11
@quevon24 quevon24 requested a review from flooie November 4, 2024 23:11
@quevon24 quevon24 marked this pull request as draft November 6, 2024 15:25
cl/corpus_importer/utils.py Outdated Show resolved Hide resolved
@quevon24 quevon24 marked this pull request as ready for review November 14, 2024 15:54
Comment on lines +305 to +311
add_citations_to_cluster(
[
f"{cite.get('volume')} {cite.get('reporter')} {cite.get('page')}"
for cite in valid_citations
],
matches[0].cluster_id,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporter or corrected_reporter ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporter, because we generate a list of dicts with volume, reporter(we get it with corrected_reporter()) and page in parse_citations()

Comment on lines 269 to 271
case_name_match = check_case_names_match(
west_case_name, cl_case_name
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still aren't properly preparing case names.

def combine_initials(case_name: str) -> str:
    """Combine initials in case captions

    :param case_name: the case caption
    :return: the cleaned case caption
    """
    pattern = r"((?:[A-Z]\.?\s?){2,})(\s|$)"
    return re.sub(pattern, lambda m: m.group(0).replace(".", ""), case_name)

I think something like this should work. this combines initials

we have a number of rows I checked that would fail otherwise.

"In re K.K.","United States Court of Appeals, Ninth Circuit.","July 01, 2014","756 F.3d 1169","2014 WL 2937488","14-71875","756"

for example would reduce to []

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex needs some tweaking, it almost works fine, i found some cases where there are spaces between abbreviations:

"In re P. I.","Court of Appeal, First District, Division 3, California.","January 20, 1989","207 Cal.App.3d 316","254 Cal.Rptr. 774","A041221","254" https://www.courtlistener.com/opinion/2168480/in-re-pi/

"In re A. M.","Court of Appeal, First District, Division 2, California.","December 01, 1989","216 Cal.App.3d 319","264 Cal.Rptr. 666","A042237","264" https://www.courtlistener.com/opinion/2175113/in-re-am/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about this:

def combine_initials(case_name: str) -> str:
    """Combine initials in case captions

    :param case_name: the case caption
    :return: the cleaned case caption
    """
    pattern = r"\b[A-Z](?:[A-Z\.]|\s)*[A-Z]\b\."

    return re.sub(
        pattern,
        lambda match: match.group(0).replace(" ", "").replace(".", ""),
        case_name,
    )

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer but a few things jump out. thanks @quevon24

@quevon24 quevon24 assigned quevon24 and unassigned quevon24 Nov 19, 2024
@quevon24 quevon24 requested a review from flooie November 20, 2024 15:29
Comment on lines +218 to +220
pattern = r"((?:[A-Z]\.?\s?){2,})(\s|$)"

return re.sub(pattern, lambda m: m.group(0).replace(".", ""), case_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern doesn't appear to removal initials

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps

initials_pattern = re.compile(r"([A-Z]{1}\.?\s?){2,}(\s|$)")

    match = initials_pattern.search(case_name)  # Search for the initials pattern
    if match:
        initials = match.group()
        compressed_initials = re.sub(r"(?!\s$)[\s\.]", "", initials)
        case_name = case_name.replace(initials, compressed_initials)  

this would work? I tested it on

case_names = [
"M. X. Smith",
"M.X. Smith",
"M X Smith",
"M. X. J. Smith",
"M X J Smith",
]
and it worked as expected .

Comment on lines 320 to 326
add_citations_to_cluster(
[
f"{cite.get('volume')} {cite.get('reporter')} {cite.get('page')}"
for cite in valid_citations
],
matches[0].cluster_id,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id love to reuse this - but I dont think we should - its not as robust and causes use to re-create the citation to parse it again and do a bunch of eyecite. I think it would be smarter to identify the type when we validate teh citations and just create the citation here.

something like ...


            matched_cluster = matches[0].cluster

            if dry_run:
                # Dry run, don't save anything
                continue

            # Update case names

            with transaction.atomic():
                cluster_updated, docket_updated = update_matched_case_name(
                matched_cluster, west_case_name
                )

                for citation in valid_citations:
                    if Citation.obects.filter(reporter=citation.get("reporter"), cluster=matched_cluster.id).exists():
                        logger.warning("Can not save mismatched citation")
                        raise("issue with reporter already here... if its not the same one")
                    citation['cluster_id'] = matched_cluster.id
                    Citation.objects.get_or_create(**citation)

I think.

This means the parse citation would look something like this


def parse_citations(citation_strings: list[str]) -> list[dict]:
    """Validate citations with Eyecite.

    :param citation_strings: List of citation strings to validate.
    :return: List of validated citation dictionaries with volume, reporter, and page.
    """
    validated_citations = []

    for cite_str in citation_strings:
        # Get citations from the string
        found_cites = get_citations(cite_str, tokenizer=HYPERSCAN_TOKENIZER)
        if len(found_cites) != 1:
            continue

        citation = found_cites[0]

        if isinstance(citation, FullCaseCitation):
            volume = citation.groups.get("volume")

            # Validate the volume
            if not volume or not volume.isdigit():
                continue

            if not citation[0].corrected_reporter():
                reporter_type = Citation.STATE
            else:
                cite_type_str = citation[0].all_editions[0].reporter.cite_type
                reporter_type = map_reporter_db_cite_type(cite_type_str)

            validated_citations.append(
                {
                    "volume": citation.groups["volume"],
                    "reporter": citation.corrected_reporter(),
                    "page": citation.groups["page"],
                    "type": reporter_type
                }
            )

    return validated_citations

I took this from the other function - but im not sure why we need the odd if citation reporter use Citation.State ... ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

2 participants