feat(pacer): Refine multi-document page handling logic #402

ERosendo · 2024-09-30T14:52:05Z

Key changes:

Refines the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods to accurately identify multi-document pages containing only one PDF file. By analyzing the HTML structure, I noticed that receipt tables are enclosed within center divs, and the number of these divs corresponds to the number of files in the combined PDF. Both methods now check for the presence of center nodes to determine if a warning should be displayed.

In appellate pages, an additional filter was implemented to ensure accurate counting, as center divs may also be used to wrap the page's main content.
In both district and appellate courts, the document ID is often not directly accessible within the HTML structure of the page. While some courts use the document ID as the entry number, this is not a consistent practice across all jurisdictions. To address this challenge, this PR introduces two helper methods that uses the URL of the PACER page and the existing DocToCases mapping stored in our local storage:
- District court URLs frequently contain a query parameter named exclude_attachments. This parameter is a comma-separated list of shortened document IDs that are not included in the combined PDF. By parsing this list and comparing it to the DocToCases mapping, we can identify the missing document ID.
  
  This PR introduces the getPacerDocIdFromExcludeList helper function. It takes a list of excluded document IDs as input and returns the corresponding document ID based on the DocToCases mapping.
- Appellate court URLs often include a query parameter named dls. This parameter is a comma-separated list of shortened document IDs that are included in the combined PDF. By filtering the DocToCases mapping based on this list, we can determine the document ID.
  
  The getPacerDocIdFromPartialId method implements this filtering process, taking the partial as input and returning the extracted document ID.
Introduces a new utility function, parseDataFromReceiptTable, to extract data from receipt tables in appellate courts. While parsing the title alone is often enough for single-document pages, it lacks the necessary information to identify the document in multi-document pages. To address this limitation, this function extracts data directly from the receipt table, providing a more reliable and comprehensive approach.
Integrate all helper functions into the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods. This will enable us to insert banners for available documents and upload the PDFs to the recap archive.

Here are GIFs showing how our extension works in appellate and district courts:

District Court:

Appellate Court:

This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.

Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.

Adds a new utility function to retrieve the `DocToCases` mapping from storage

Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.

This commit introduces a new utility function to efficiently extract data from receipt tables, addressing the limitation of multi-document pages. This enhancement improves the extension's ability to accurately process documents.

ERosendo · 2024-10-01T23:20:10Z

@mlissner in my last commit, I implemented a MIME type validation to prevent the upload of invalid file formats. During testing, I encountered an issue with certain district courts, such as case 2:24-mj-00100, where downloading a single document from a multi-document page seemed restricted. Despite attempts in both Chrome and Firefox with and without extensions, I consistently received the error message: Cannot redisplay /tmp/1727589-2--109361.pdf, it has already been shown once. While some court tips and tricks page suggests it might be a Chrome-related issue, my testing indicated that the error was not browser-specific.

Upon further investigation, I discovered that the extension was sending the HTML page containing the error message to the CL API (not great). By implementing the validation, we can prevent the upload of the invalid HTML content.

Here are gifs showing the error message in different browsers:

Chrome:

Firefox:

Safari:

ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 7 times, most recently from c562279 to 6ba914a Compare October 1, 2024 11:58

ERosendo added 2 commits October 1, 2024 08:30

feat(appellate): Refine multi-document page handling logic

2277d9f

feat(district): Refine multi-document page handling logic

ea84011

ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 5 times, most recently from 8e2c8ed to e679a22 Compare October 1, 2024 19:54

ERosendo added 11 commits October 1, 2024 18:54

feat(docs): Changelog Update

4d18676

feat(utils): Adds helper method to get pacer doc ids using exclude lists

c241760

feat(district): Adds helper function to check document availability

4a8db9b

This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.

feat(appellate): Tweaks the findDocLinksFromAnchors method

3ba7123

Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.

feat(utils): Introduces getDocToCasesMapping helper function

16d580c

Adds a new utility function to retrieve the `DocToCases` mapping from storage

feat(utils): Adds helper method to get docId using shortened version

affc549

feat(appellate): Adds helper function to check document availability

6ed14c5

Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.

feat(district): Adds logic to upload file from multi-doc page

1a1656a

feat(appellate): Adds logic to upload file from multi-doc page

43b6a8a

feat(pdf_upload): Add MIME type validation to ensure data integrity

dbb4b31

ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch from e679a22 to dbb4b31 Compare October 1, 2024 22:54

ERosendo marked this pull request as ready for review October 1, 2024 22:54

ERosendo requested a review from mlissner October 1, 2024 22:55

mlissner assigned elisa-a-v Nov 12, 2024

mlissner requested a review from elisa-a-v November 12, 2024 22:25

ERosendo mentioned this pull request Nov 13, 2024

Incorrectly identified split pages freelawproject/recap#349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pacer): Refine multi-document page handling logic #402

feat(pacer): Refine multi-document page handling logic #402

ERosendo commented Sep 30, 2024 •

edited

Loading

ERosendo commented Oct 1, 2024

feat(pacer): Refine multi-document page handling logic #402

Are you sure you want to change the base?

feat(pacer): Refine multi-document page handling logic #402

Conversation

ERosendo commented Sep 30, 2024 • edited Loading

ERosendo commented Oct 1, 2024

ERosendo commented Sep 30, 2024 •

edited

Loading