Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(recap.mergers): Update PACER attachment processing #4665

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

flooie
Copy link
Contributor

@flooie flooie commented Nov 6, 2024

@ERosendo

This change attempts to address the blackhole that is doppelgänger criminal attachments.

In cases where we have "doppelgänger" dockets its possible that the pacer case id is filtering out
the docket that is available for the attachment.

@flooie flooie requested a review from ERosendo November 6, 2024 21:58
@flooie
Copy link
Contributor Author

flooie commented Nov 6, 2024

Should fix issue #4664

cl/recap/mergers.py Outdated Show resolved Hide resolved
@flooie flooie requested a review from grossir November 7, 2024 12:44
except RECAPDocument.DoesNotExist as exc:
# In cases where we have "doppelgänger" dockets drop pacer
# case id and check if the docket exists once more.
if params.get("pacer_case_id"):
Copy link
Contributor

@grossir grossir Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the params dict key you want to look for is actually "docket_entry__docket__pacer_case_id", as in line 1641 in this same function. search_recapdocument does not have a field for pacer_case_id, so this key on the params dict would never be found

                                             Table "public.search_recapdocument"
         Column          |           Type           | Collation | Nullable |                     Default                      
-------------------------+--------------------------+-----------+----------+--------------------------------------------------
 id                      | integer                  |           | not null | nextval('search_recapdocument_id_seq'::regclass)
 date_created            | timestamp with time zone |           | not null | 
 date_modified           | timestamp with time zone |           | not null | 
 date_upload             | timestamp with time zone |           |          | 
 document_type           | integer                  |           | not null | 
 document_number         | character varying(32)    |           | not null | 
 attachment_number       | smallint                 |           |          | 
 pacer_doc_id            | character varying(64)    |           | not null | 
 is_available            | boolean                  |           |          | 
 sha1                    | character varying(40)    |           | not null | 
 filepath_local          | character varying(1000)  |           | not null | 
 filepath_ia             | character varying(1000)  |           | not null | 
 docket_entry_id         | integer                  |           | not null | 
 description             | text                     |           | not null | 
 ocr_status              | smallint                 |           |          | 
 plain_text              | text                     |           | not null | 
 page_count              | integer                  |           |          | 
 is_free_on_pacer        | boolean                  |           |          | 
 ia_upload_failure_count | smallint                 |           |          | 
 file_size               | integer                  |           |          | 
 thumbnail               | character varying(100)   |           |          | 
 thumbnail_status        | smallint                 |           | not null | 
 is_sealed               | boolean                  |           |          | 
 acms_document_guid      | character varying(64)    |           | not null | 

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests using the doppelgänger oknd case with pacer_case_id 27765 and 27767, for document 811. Here are my findings:

  • The approach of removing the pacer_case_id from the RD lookup params will help only when the main RD related to the attachment page being uploaded is not present in the docket related to the pacer_case_id used for the attachment page upload.

For instance, if the attachment page upload is sent with the following params:

upload_type=2
court=oknd
pacer_case_id=27767

And considering the attachment page has the following metadata:

document_number=811
pacer_case_id=27767
pacer_doc_id=14703438821
attachments:
  attachment_number=1
  pacer_doc_id=14713438822

If the docket with pacer_case_id 27767 doesn't have a main RD with pacer_doc_id=14703438821, but the docket with pacer_case_id 27765 does have it, then removing the pacer_case_id from the lookup will match main RECAPDocument 14703438821 from the 27765 docket and merge it there.

However, I see an issue here:

  • If there are more than 2 dockets for the doppelgänger case and if the main RECAPDocument for the provided pacer_doc_id exists in more than one of them, this lookup will return a RECAPDocument.MultipleObjectsReturned exception.

So it seems that the main issue is that attachment page uploads for doppelgänger case documents (at least for this court) always belong to the same pacer_case_id, even if you clicked the document from a different pacer_case_id. You can confirm this by:

Going to case 27765:
https://ecf.oknd.uscourts.gov/cgi-bin/DktRpt.pl?27765
Request documents from 811 to 811 and run the report.
Click document 811.
On the attachment page, inspect Document Number 811. You will see:
<a href="https://ecf.oknd.uscourts.gov/doc1/14713438821" onclick="goDLS('/doc1/14713438821','27767','4230','','1','1','','','');return(false);">811</a>

The link for this document contains the pacer_case_id 27767, which I believe the extension uses to upload the attachment page. (@ERosendo could help to confirm this).

So it seems, at least in this court, we won't get attachment pages for other case_ids other than the latest pacer_case_id.
We can confirm this by looking for attachment page uploads for 27765:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27765&upload_type=2
None can be found.

While there are attachment uploads for 27767:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27767&upload_type=2

Unless this issue can be solved from the extension, I think we'll need to apply a different approach. For instance, I believe we can lookup RECAPDocuments with the provided pacer_doc_id and court, and if more than one is found, merge the attachment page in all of them. If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there. Otherwise, we'd need to apply it every time, assuming multiple documents with the same pacer_doc_id and court are only possible on doppelgänger dockets.

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

@mlissner
Copy link
Member

If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there.

Not that I know of, unfortunately.

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

@albertisfu
Copy link
Contributor

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

Sure, just regarding this comment so we can get the right size:

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

@mlissner
Copy link
Member

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

I think PR's always take priority, except over P0's, which I see as "something is burning".

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

@mlissner
Copy link
Member

mlissner commented Nov 13, 2024

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

Would you have space for both solutions in this sprint?

@albertisfu
Copy link
Contributor

I think PR's always take priority, except over P0's, which I see as "something is burning".

Correct, just in this case the solution required to implement would be completely different as the one in the PR.

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

Yes, that's correct. The idea I have, though I still need to analyze it further to ensure it doesn’t interfere with other uploads, is that as a first step, we always ignore the pacer_case_id if we encounter a RECAPDocument.MultipleObjectsReturned. Then, we check if the RDs belong to different dockets and merge the attachment pages and PDFs across all of them.

If we find duplicate RDs within the same docket, we apply the current logic to select the best RD and clean up duplicates.

This approach would essentially make the pacer_case_id parameter useless. That’s why I was concerned about finding a way to detect potential doppelgängers and only apply the logic in those cases. So now my question is: Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

The first one, which belongs to this PR, will focus on merging the attachment page for a document across all related doppelgänger cases.

The other one, related to PDF uploads, will focus on merging PDFs for both main documents and attachments across all related doppelgänger cases.

So I think it’s best to have a separate issue and PR for the PDFs.

I’d say each of both tasks are medium-sized, as we need to analyze the best way to adjust the current merge code to ensure it works for all possible sources from which we receive attachments and PDF uploads.

Would you have space for both solutions in this sprint?

I think we have space for at least the attachments one. And if it's not too problematic, we could complete the PDF one as well. However, I think it would be better to leave them for the end of the sprint, after we finish the API-related issues, to ensure the API priority is completed.

@mlissner
Copy link
Member

Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

I wondered the same thing. I think it's OK and we haven't seen anything like that that I know of.

Sounds great about prioritization. If this makes it in, cool. If not, it's tricky stuff and not a priority, so that's fine too!

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

4 participants