Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

612 Introduced Percolator Recap Search Alerts #4200

Open
wants to merge 30 commits into
base: 612-introduced-recap-search-alerts
Choose a base branch
from

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented Jul 13, 2024

This PR introduces the Percolator approach for RECAP Search Alerts as planed in #612

It works as follows:

  • Dockets and RECAPDocuments are percolated into the RECAPPercolator index to match alerts. Documents are percolated via their indexing signals:
    • Docket:
      • On creation/update
      • When related BankruptcyInformation is added/updated
      • When related parties are added/updated. In this case, it is not via signals since parties use their own method to update the DocketDocument in ES, index_docket_parties_in_es, so percolation is included within this method.
      • As discussed, for now, when related BankruptcyInformation or Parties are percolated, there will be a previous percolation for the same docket triggered when the Docket is saved. A future performance improvement would be to avoid percolating the Docket if we know we will add BankruptcyInformation or Parties, so it gets percolated only once this additional data is added.
    • RECAPDocument creation/update:
      • Considering a RECAPDocument is always created/updated after its related DocketEntry is added/updated, we don't percolate on DocketEntry changes and wait for the RECAPDocument to be added/updated.
  • Docket documents use the same approach as OA percolation, referring to the document by its ID from the main index where it was indexed.
  • For RECAPDocuments, the approach is different. We need to percolate a plain version of the RECAPDocument that contains Parties and other docket fields required to match and render alerts. To build this plain document, the new ESRECAPDocumentPlain mapping is used. The resultant dict is percolated into RECAPPercolator instead of referring to a document.

Other Auxiliary Indices:

To avoid triggering alerts when they shouldn't be triggered, such as when a RECAPDocument is ingested and it matches Docket-only query alerts, we use two auxiliary percolator indices (similar to the sweep approach):

  • RECAPDocumentPercolator: Only contains RECAPDocument fields
  • DocketDocumentPercolator: Only contains Docket fields

It works as follows:

When an alert is matched by the main percolator RECAPPercolator, it uses the auxiliary indices to avoid including a RECAPDocument hit in a Docket-only query alert according to the following boolean table:

Alert matched in DocketDocumentPercolator Alert matched in RECAPDocumentPercolator Trigger Alert Description
False False True AND Cross-object queries
False True True RD-only queries.
True False False Docket-only queries.
True True True OR Cross-object queries

Whenever a new RECAP search Alert is created, it's indexed into the main RECAPPercolator index and also into RECAPDocumentPercolator and DocketDocumentPercolator. However, not all alerts are indexed into the auxiliary indices. If an alert contains a Docket field as a filter, it is not indexed into the RECAPDocumentPercolator because this index doesn't contain docket fields, and vice versa for RECAPDocument field filters and DocketDocumentPercolator. This is ok because filters already discard documents that don't contain the field, so auxiliary indices are required for filtering out the text query.

Since the Percolator doesn't support parent-child queries, queries are stored in the percolator as plain. A new method, build_plain_percolator_query, was created to transform parent-child queries to plain.

To prevent a document from triggering the same alert more than once, similar to the sweep index approach, we use a Redis set:

  • alert_hits:id.d stores Dockets that have triggered an alert.
  • alert_hits:id.r stores RECAPDocuments that have triggered the alert.

These sets are checked and updated before and after an alert is matched by the document.

Grouping Alert Emails:

To avoid triggering one email per each alert matched by a document ingestion, as it currently works in OA, RECAP alert hits for all rates (including RT) are stored using the ScheduledAlertHit. RT alerts are sent every 5 minutes using the new daemon cl_send_rt_percolator_alerts. Before sending them, hits are grouped per user and alert matched, so if multiple dockets or RECAPDocuments match an alert, they're grouped and nested within the same alert and Docket (in the case of RECAPDocuments). Other alert rates are sent according to their rate via the cl_send_scheduled_alerts command.

To limit the number of ScheduledAlertHit that can be stored, I added a content_object field to the ScheduledAlertHit model. This allows easy querying and counting of scheduled alert hits by its model, limiting the number of hits an alert can have (20) and the number of nested RECAPDocuments a hit can have (5).

For webhooks, there is no limit; all matched hits will be sent. I'll open a different issue to add a rate-limit or throttling to webhooks as we discussed.

Webhooks:

Webhooks for all rates are always triggered in real-time as alerts are matched by the percolator. The serializer used in RECAP Search Alerts webhooks is RECAPESResultSerializer, the same used in V4 RECAP Search API, supporting nested RECAPDocuments into the Docket.

Highlights:

RECAP Search alerts, both in emails and webhooks, support the same fields highlighted in the front end, either for the Docket or nested RECAPDocuments.

Screenshots and Examples:

Email with multiple alerts and grouping applied:
Screenshot 2024-07-25 at 4 11 19 p m
Screenshot 2024-07-25 at 4 11 31 p m

Webhook with nested document and HL.

{
   "payload":{
      "alert":{
         "id":866,
         "name":"Test Alert Cross-object",
         "rate":"rt",
         "user":344,
         "query":"q=\"File Amicus Curiae\" AND \"Motion to File 1\" AND \"plain text lorem\" AND \"410 Civil\" AND id:531&docket_number=1:21-bk-123&case_name=\"SUBPOENAS SERVED CASE\"&type=r",
         "alert_type":"r",
         "secret_key":"sCN6rjYMVD6HChMhxvzL3lCIiuMFaWlUJm5x2Wth",
         "date_created":"2024-07-25T14:15:08.083431-07:00",
         "date_last_hit":"None",
         "date_modified":"2024-07-25T14:15:08.083442-07:00"
      },
      "results":[
         {
            "firm":[
               
            ],
            "meta":{
               "timestamp":"2024-07-25T21:15:08.274909Z",
               "date_created":"2024-07-25T21:15:07.839735Z"
            },
            "cause":"<strong>410 Civil</strong>",
            "court":"Superior court for the dragons",
            "party":[
               
            ],
            "chapter":"None",
            "firm_id":[
               
            ],
            "attorney":[
               
            ],
            "caseName":"<strong>SUBPOENAS SERVED CASE</strong>",
            "court_id":"canb",
            "party_id":[
               
            ],
            "dateFiled":"None",
            "docket_id":663,
            "assignedTo":"None",
            "dateArgued":"1972-05-21",
            "juryDemand":"",
            "referredTo":"None",
            "suitNature":"",
            "attorney_id":[
               
            ],
            "trustee_str":"None",
            "docketNumber":"<strong>1:21-bk-123</strong>",
            "pacer_case_id":"242568",
            "assigned_to_id":"None",
            "case_name_full":"Stephenson and Sons, Stephens, Lowery and Beck, Duke Ltd, Adkins, Price and Stevens, and Williams and Sons v. Mark Kelly, Michael Guzman, Anthony Hansen, Gerald Tate, and Maria Vazquez",
            "dateTerminated":"None",
            "referred_to_id":"None",
            "recap_documents":[
               {
                  "id":531,
                  "meta":{
                     "timestamp":"2024-07-25T21:15:08.274909Z",
                     "date_created":"2024-07-25T21:15:07.839735Z"
                  },
                  "cites":[
                     
                  ],
                  "snippet":"<strong>plain text lorem</strong>",
                  "page_count":"None",
                  "description":"MOTION for Leave to <strong>File Amicus Curiae</strong> Lorem Served",
                  "absolute_url":"/docket/663/1/subpoenas-served-case/",
                  "entry_number":1,
                  "is_available":false,
                  "pacer_doc_id":"01803665981",
                  "document_type":"PACER Document",
                  "filepath_local":"None",
                  "docket_entry_id":269,
                  "document_number":1,
                  "entry_date_filed":"2024-08-19",
                  "attachment_number":"None",
                  "short_description":"<strong>Motion to File 1</strong>"
               }
            ],
            "jurisdictionType":"",
            "docket_absolute_url":"/docket/663/subpoenas-served-case/",
            "court_citation_string":"SCOTUS"
         }
      ]
   },
   "webhook":{
      "version":1,
      "event_type":2,
      "date_created":"2024-07-25T21:15:06.968890+00:00",
      "deprecation_date":"None"
   }
}

Oral Arguments Alerts:

They'll continue as usual after this PR is merged. I'll open a different issue to apply grouping to OA Search Alerts.

Old alert tasks

I duplicated celery tasks and related methods used by OA. So we can prevent that alerts scheduled fail once this is deployed. After a couple of days of this is deployed, we could remove those tasks.

Finally, I added a new setting:
PERCOLATOR_SEARCH_ALERTS_ENABLED, which is useful to avoid conflicts in tests that create RECAP-related documents but the Percolator indices do not exist. This setting is only enabled in Alert tests. Once we are ready to start percolating RECAP documents, we should set this setting to True.

Additionally, we'll need to create the following indices manually:

  • RECAPPercolator.init()
  • RECAPDocumentPercolator.init()
  • DocketDocumentPercolator.init()

Let me know what do you think.

@albertisfu albertisfu changed the base branch from main to 612-introduced-recap-search-alerts July 13, 2024 03:31
Copy link

semgrep-app bot commented Jul 13, 2024

Semgrep found 12 baseclass-attribute-override findings:

Class RECAPPercolator inherits from both DocketDocument and ESRECAPDocument which both have a method named prepare_trustee_str; one of these methods will be overwritten.

Ignore this finding from baseclass-attribute-override.

Semgrep found 5 template-unescaped-with-safe findings:

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

Semgrep found 6 avoid-query-set-extra findings:

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

@albertisfu albertisfu force-pushed the 612-recap-search-alerts-percolator branch from 6570ca5 to 32f8ca9 Compare July 17, 2024 02:04
@CLAassistant
Copy link

CLAassistant commented Jul 19, 2024

CLA assistant check
All committers have signed the CLA.

@albertisfu albertisfu force-pushed the 612-recap-search-alerts-percolator branch from 13a604d to 8373409 Compare July 19, 2024 15:43
@albertisfu albertisfu marked this pull request as ready for review July 25, 2024 21:36
@albertisfu albertisfu requested a review from mlissner July 25, 2024 21:36
@albertisfu
Copy link
Contributor Author

When adding grouping for OA Search Alerts, I noticed that I missed omitting the sending of RT email alerts for non-members for RECAP Alerts.

I've fixed this issue and tweaked the tests.

Note that webhooks in RECAP as in OA are always triggered regardless of the user's membership status.

Additionally, I noticed an error in ES (document_parsing_exception) when indexing queries using advanced syntax with fields that are not contained in the auxiliary indices. This error is not being caught by tests. So I will analyze if this is an issue in terms of filtering queries and, if so, look for an alternative approach.

@albertisfu
Copy link
Contributor Author

The issue related to the document_parsing_exception error is now solved.

The reason it was not being caught by tests was because the test query that was triggering the error:
q="405 Civil" AND pacer_doc_id:018036652436&type=r

Fell into the following case in the auxiliary percolator queries boolean table:

Alert matched in DocketDocumentPercolator Alert matched in RECAPDocumentPercolator Trigger Alert Description
False False True AND Cross-object queries

As a result, none of the auxiliary queries required returning results.

The test that could have failed was:
q="plain text for 018036652436" OR cause:"405 Civil"

But it didn't fail because the alert indexing issue was only related to advanced queries using a field and its value without quotes pacer_doc_id:018036652436 if the field value contained quotes like cause:"405 Civil" it was accepted by the ES parser.

So I updated that test to q="018036652436" OR cause:405&type=r.

And I could confirm that this issue indexing some edge cases alerts could impact Alerts filtering.

The solution:

After analyzing the problem, I determined that the best approach to solve this issue was to change the strategy regarding auxiliary percolator queries. Instead of creating two additional Percolator indices to save RECAPDocument-only queries and Docket-only queries and percolating the full document, it's better to use a single Percolator index.
This index is the same for the main document where full alerts are stored. For the auxiliary percolator queries, instead of percolating the full document, we can percolate one document with only child fields and another percolator query with only parent fields.
This way, auxiliary queries can tell us when a document is a Docket-only query or not, allowing us to filter out RECAPDocuments.

This approach also simplifies the code, as we no longer need the additional RECAPDocumentPercolator and DocketDocumentPercolator indices, so they have been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Status: RECAP Alerts
Development

Successfully merging this pull request may close these issues.

2 participants