Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pacer): Refine multi-document page handling logic #402

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Sep 30, 2024

Key changes:

  • Refines the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods to accurately identify multi-document pages containing only one PDF file. By analyzing the HTML structure, I noticed that receipt tables are enclosed within center divs, and the number of these divs corresponds to the number of files in the combined PDF. Both methods now check for the presence of center nodes to determine if a warning should be displayed.

    In appellate pages, an additional filter was implemented to ensure accurate counting, as center divs may also be used to wrap the page's main content.

  • In both district and appellate courts, the document ID is often not directly accessible within the HTML structure of the page. While some courts use the document ID as the entry number, this is not a consistent practice across all jurisdictions. To address this challenge, this PR introduces two helper methods that uses the URL of the PACER page and the existing DocToCases mapping stored in our local storage:

    • District court URLs frequently contain a query parameter named exclude_attachments. This parameter is a comma-separated list of shortened document IDs that are not included in the combined PDF. By parsing this list and comparing it to the DocToCases mapping, we can identify the missing document ID.

      This PR introduces the getPacerDocIdFromExcludeList helper function. It takes a list of excluded document IDs as input and returns the corresponding document ID based on the DocToCases mapping.

    • Appellate court URLs often include a query parameter named dls. This parameter is a comma-separated list of shortened document IDs that are included in the combined PDF. By filtering the DocToCases mapping based on this list, we can determine the document ID.

      The getPacerDocIdFromPartialId method implements this filtering process, taking the partial as input and returning the extracted document ID.

  • Introduces a new utility function, parseDataFromReceiptTable, to extract data from receipt tables in appellate courts. While parsing the title alone is often enough for single-document pages, it lacks the necessary information to identify the document in multi-document pages. To address this limitation, this function extracts data directly from the receipt table, providing a more reliable and comprehensive approach.

  • Integrate all helper functions into the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods. This will enable us to insert banners for available documents and upload the PDFs to the recap archive.

Here are GIFs showing how our extension works in appellate and district courts:

  • District Court:

Screen Recording 2024-10-01 at 4 48 23 PM

  • Appellate Court:

Screen Recording 2024-10-01 at 3 52 34 PM

Fixes freelawproject/recap#349

@ERosendo ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 7 times, most recently from c562279 to 6ba914a Compare October 1, 2024 11:58
@ERosendo ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 5 times, most recently from 8e2c8ed to e679a22 Compare October 1, 2024 19:54
This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.
Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.
Adds a new utility function to retrieve the `DocToCases` mapping from storage
Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.
This commit introduces a new utility function to efficiently extract data from receipt tables, addressing the limitation of multi-document pages. This enhancement improves the extension's ability to accurately process documents.
@ERosendo ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch from e679a22 to dbb4b31 Compare October 1, 2024 22:54
@ERosendo ERosendo marked this pull request as ready for review October 1, 2024 22:54
@ERosendo ERosendo requested a review from mlissner October 1, 2024 22:55
@ERosendo
Copy link
Contributor Author

ERosendo commented Oct 1, 2024

@mlissner in my last commit, I implemented a MIME type validation to prevent the upload of invalid file formats. During testing, I encountered an issue with certain district courts, such as case 2:24-mj-00100, where downloading a single document from a multi-document page seemed restricted. Despite attempts in both Chrome and Firefox with and without extensions, I consistently received the error message: Cannot redisplay /tmp/1727589-2--109361.pdf, it has already been shown once. While some court tips and tricks page suggests it might be a Chrome-related issue, my testing indicated that the error was not browser-specific.

Upon further investigation, I discovered that the extension was sending the HTML page containing the error message to the CL API (not great). By implementing the validation, we can prevent the upload of the invalid HTML content.

Here are gifs showing the error message in different browsers:

  • Chrome:

Screen Recording 2024-10-01 at 7 10 16 PM

  • Firefox:

Screen Recording 2024-10-01 at 7 11 44 PM

  • Safari:

Screen Recording 2024-10-01 at 7 13 51 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔎In Review
Development

Successfully merging this pull request may close these issues.

Incorrectly identified split pages
1 participant