feat(recap.mergers): Update PACER attachment processing #4665

flooie · 2024-11-06T21:58:47Z

This change attempts to address the blackhole that is doppelgänger criminal attachments.

In cases where we have "doppelgänger" dockets its possible that the pacer case id is filtering out
the docket that is available for the attachment.

flooie · 2024-11-06T22:06:06Z

Should fix issue #4664

cl/recap/mergers.py

This should fix tests

grossir · 2024-11-07T14:37:54Z

cl/recap/mergers.py

+            except RECAPDocument.DoesNotExist as exc:
+                # In cases where we have "doppelgänger" dockets drop pacer
+                # case id and check if the docket exists once more.
+                if params.get("pacer_case_id"):


albertisfu

I did some tests using the doppelgänger oknd case with pacer_case_id 27765 and 27767, for document 811. Here are my findings:

The approach of removing the pacer_case_id from the RD lookup params will help only when the main RD related to the attachment page being uploaded is not present in the docket related to the pacer_case_id used for the attachment page upload.

For instance, if the attachment page upload is sent with the following params:

upload_type=2
court=oknd
pacer_case_id=27767

And considering the attachment page has the following metadata:

document_number=811
pacer_case_id=27767
pacer_doc_id=14703438821
attachments:
  attachment_number=1
  pacer_doc_id=14713438822

If the docket with pacer_case_id 27767 doesn't have a main RD with pacer_doc_id=14703438821, but the docket with pacer_case_id 27765 does have it, then removing the pacer_case_id from the lookup will match main RECAPDocument 14703438821 from the 27765 docket and merge it there.

However, I see an issue here:

If there are more than 2 dockets for the doppelgänger case and if the main RECAPDocument for the provided pacer_doc_id exists in more than one of them, this lookup will return a RECAPDocument.MultipleObjectsReturned exception.

So it seems that the main issue is that attachment page uploads for doppelgänger case documents (at least for this court) always belong to the same pacer_case_id, even if you clicked the document from a different pacer_case_id. You can confirm this by:

Going to case 27765:
https://ecf.oknd.uscourts.gov/cgi-bin/DktRpt.pl?27765
Request documents from 811 to 811 and run the report.
Click document 811.
On the attachment page, inspect Document Number 811. You will see:
<a href="https://ecf.oknd.uscourts.gov/doc1/14713438821" onclick="goDLS('/doc1/14713438821','27767','4230','','1','1','','','');return(false);">811</a>

The link for this document contains the pacer_case_id 27767, which I believe the extension uses to upload the attachment page. (@ERosendo could help to confirm this).

So it seems, at least in this court, we won't get attachment pages for other case_ids other than the latest pacer_case_id.
We can confirm this by looking for attachment page uploads for 27765:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27765&upload_type=2
None can be found.

While there are attachment uploads for 27767:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27767&upload_type=2

Unless this issue can be solved from the extension, I think we'll need to apply a different approach. For instance, I believe we can lookup RECAPDocuments with the provided pacer_doc_id and court, and if more than one is found, merge the attachment page in all of them. If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there. Otherwise, we'd need to apply it every time, assuming multiple documents with the same pacer_doc_id and court are only possible on doppelgänger dockets.

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

mlissner · 2024-11-13T00:54:16Z

If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there.

Not that I know of, unfortunately.

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

albertisfu · 2024-11-13T01:32:25Z

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

Sure, just regarding this comment so we can get the right size:

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

mlissner · 2024-11-13T01:39:29Z

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

I think PR's always take priority, except over P0's, which I see as "something is burning".

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

mlissner · 2024-11-13T01:40:12Z

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

Would you have space for both solutions in this sprint?

albertisfu · 2024-11-13T15:46:58Z

I think PR's always take priority, except over P0's, which I see as "something is burning".

Correct, just in this case the solution required to implement would be completely different as the one in the PR.

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

Yes, that's correct. The idea I have, though I still need to analyze it further to ensure it doesn’t interfere with other uploads, is that as a first step, we always ignore the pacer_case_id if we encounter a RECAPDocument.MultipleObjectsReturned. Then, we check if the RDs belong to different dockets and merge the attachment pages and PDFs across all of them.

If we find duplicate RDs within the same docket, we apply the current logic to select the best RD and clean up duplicates.

This approach would essentially make the pacer_case_id parameter useless. That’s why I was concerned about finding a way to detect potential doppelgängers and only apply the logic in those cases. So now my question is: Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

The first one, which belongs to this PR, will focus on merging the attachment page for a document across all related doppelgänger cases.

The other one, related to PDF uploads, will focus on merging PDFs for both main documents and attachments across all related doppelgänger cases.

So I think it’s best to have a separate issue and PR for the PDFs.

I’d say each of both tasks are medium-sized, as we need to analyze the best way to adjust the current merge code to ensure it works for all possible sources from which we receive attachments and PDF uploads.

Would you have space for both solutions in this sprint?

I think we have space for at least the attachments one. And if it's not too problematic, we could complete the PDF one as well. However, I think it would be better to leave them for the end of the sprint, after we finish the API-related issues, to ensure the API priority is completed.

mlissner · 2024-11-14T00:11:40Z

Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

I wondered the same thing. I think it's OK and we haven't seen anything like that that I know of.

Sounds great about prioritization. If this makes it in, cool. If not, it's tricky stuff and not a priority, so that's fine too!

Thank you.

feat(recap.mergers): Update PACER attachment processing

437110f

flooie requested a review from ERosendo November 6, 2024 21:58

grossir reviewed Nov 7, 2024

View reviewed changes

cl/recap/mergers.py Outdated Show resolved Hide resolved

fix(recap.mergers): Reraise exception

a4fbacb

This should fix tests

flooie requested a review from grossir November 7, 2024 12:44

grossir reviewed Nov 7, 2024

View reviewed changes

fix(recap.mergers): Use Correct key in params

d32cf49

albertisfu reviewed Nov 12, 2024

View reviewed changes

mlissner assigned albertisfu Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recap.mergers): Update PACER attachment processing #4665

feat(recap.mergers): Update PACER attachment processing #4665

flooie commented Nov 6, 2024

flooie commented Nov 6, 2024

grossir Nov 7, 2024 •

edited

Loading

albertisfu left a comment

mlissner commented Nov 13, 2024

albertisfu commented Nov 13, 2024

mlissner commented Nov 13, 2024

mlissner commented Nov 13, 2024 •

edited

Loading

albertisfu commented Nov 13, 2024

mlissner commented Nov 14, 2024

feat(recap.mergers): Update PACER attachment processing #4665

Are you sure you want to change the base?

feat(recap.mergers): Update PACER attachment processing #4665

Conversation

flooie commented Nov 6, 2024

flooie commented Nov 6, 2024

grossir Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

albertisfu left a comment

Choose a reason for hiding this comment

mlissner commented Nov 13, 2024

albertisfu commented Nov 13, 2024

mlissner commented Nov 13, 2024

mlissner commented Nov 13, 2024 • edited Loading

albertisfu commented Nov 13, 2024

mlissner commented Nov 14, 2024

grossir Nov 7, 2024 •

edited

Loading

mlissner commented Nov 13, 2024 •

edited

Loading