feat(pacer): Refine multi-document page handling logic #402
+359
−37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Key changes:
Refines the
handleCombinedPdfPageView
(appellate) andhandleCombinedPDFView
(district) methods to accurately identify multi-document pages containing only one PDF file. By analyzing the HTML structure, I noticed that receipt tables are enclosed within center divs, and the number of these divs corresponds to the number of files in the combined PDF. Both methods now check for the presence of center nodes to determine if a warning should be displayed.In appellate pages, an additional filter was implemented to ensure accurate counting, as center divs may also be used to wrap the page's main content.
In both district and appellate courts, the document ID is often not directly accessible within the HTML structure of the page. While some courts use the document ID as the entry number, this is not a consistent practice across all jurisdictions. To address this challenge, this PR introduces two helper methods that uses the URL of the PACER page and the existing
DocToCases
mapping stored in our local storage:District court URLs frequently contain a query parameter named
exclude_attachments
. This parameter is a comma-separated list of shortened document IDs that are not included in the combined PDF. By parsing this list and comparing it to the DocToCases mapping, we can identify the missing document ID.This PR introduces the
getPacerDocIdFromExcludeList
helper function. It takes a list of excluded document IDs as input and returns the corresponding document ID based on the DocToCases mapping.Appellate court URLs often include a query parameter named
dls
. This parameter is a comma-separated list of shortened document IDs that are included in the combined PDF. By filtering the DocToCases mapping based on this list, we can determine the document ID.The
getPacerDocIdFromPartialId
method implements this filtering process, taking the partial as input and returning the extracted document ID.Introduces a new utility function,
parseDataFromReceiptTable
, to extract data from receipt tables in appellate courts. While parsing the title alone is often enough for single-document pages, it lacks the necessary information to identify the document in multi-document pages. To address this limitation, this function extracts data directly from the receipt table, providing a more reliable and comprehensive approach.Integrate all helper functions into the
handleCombinedPdfPageView
(appellate) andhandleCombinedPDFView
(district) methods. This will enable us to insert banners for available documents and upload the PDFs to the recap archive.Here are GIFs showing how our extension works in appellate and district courts:
Fixes freelawproject/recap#349