-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
612 Introduced Percolator Recap Search Alerts #4200
base: 612-introduced-recap-search-alerts
Are you sure you want to change the base?
612 Introduced Percolator Recap Search Alerts #4200
Conversation
Semgrep found 12
Class RECAPPercolator inherits from both Semgrep found 5
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability. Ignore this finding from template-unescaped-with-safe.Semgrep found 6
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as |
- Group and send RT RECAP Search Alerts.
6570ca5
to
32f8ca9
Compare
- Fixed parent highlighting for cross-object hits
…proach. - Disabled percolation for all tests except the specific ones for alerts
13a604d
to
8373409
Compare
- Avoid indexing RECAP alerts in this test where the Percolator index doesn't exist.
- Including an integration sweep-percolator test
…olator indices. - Improved merge_alert_child_documents logic - Added sql migration file
…avoid issues with OA scheduled tasks - These methods and tasks can be removed a few days after rolling them out.
…ch-alerts-percolator
When adding grouping for OA Search Alerts, I noticed that I missed omitting the sending of RT email alerts for non-members for RECAP Alerts. I've fixed this issue and tweaked the tests. Note that webhooks in RECAP as in OA are always triggered regardless of the user's membership status. Additionally, I noticed an error in ES ( |
The issue related to the The reason it was not being caught by tests was because the test query that was triggering the error: Fell into the following case in the auxiliary percolator queries boolean table:
As a result, none of the auxiliary queries required returning results. The test that could have failed was: But it didn't fail because the alert indexing issue was only related to advanced queries using a field and its value without quotes So I updated that test to And I could confirm that this issue indexing some edge cases alerts could impact Alerts filtering. The solution:After analyzing the problem, I determined that the best approach to solve this issue was to change the strategy regarding auxiliary percolator queries. Instead of creating two additional Percolator indices to save RECAPDocument-only queries and Docket-only queries and percolating the full document, it's better to use a single Percolator index. This approach also simplifies the code, as we no longer need the additional |
This PR introduces the Percolator approach for RECAP Search Alerts as planed in #612
It works as follows:
index_docket_parties_in_es
, so percolation is included within this method.DocketEntry
is added/updated, we don't percolate on DocketEntry changes and wait for theRECAPDocument
to be added/updated.ESRECAPDocumentPlain
mapping is used. The resultant dict is percolated intoRECAPPercolator
instead of referring to a document.Other Auxiliary Indices:
To avoid triggering alerts when they shouldn't be triggered, such as when a RECAPDocument is ingested and it matches Docket-only query alerts, we use two auxiliary percolator indices (similar to the sweep approach):
RECAPDocumentPercolator
: Only contains RECAPDocument fieldsDocketDocumentPercolator
: Only contains Docket fieldsIt works as follows:
When an alert is matched by the main percolator RECAPPercolator, it uses the auxiliary indices to avoid including a RECAPDocument hit in a Docket-only query alert according to the following boolean table:
Whenever a new RECAP search Alert is created, it's indexed into the main
RECAPPercolator
index and also intoRECAPDocumentPercolator
andDocketDocumentPercolator
. However, not all alerts are indexed into the auxiliary indices. If an alert contains a Docket field as a filter, it is not indexed into the RECAPDocumentPercolator because this index doesn't contain docket fields, and vice versa for RECAPDocument field filters andDocketDocumentPercolator
. This is ok because filters already discard documents that don't contain the field, so auxiliary indices are required for filtering out the text query.Since the Percolator doesn't support parent-child queries, queries are stored in the percolator as plain. A new method,
build_plain_percolator_query
, was created to transform parent-child queries to plain.To prevent a document from triggering the same alert more than once, similar to the sweep index approach, we use a Redis set:
alert_hits:id.d
stores Dockets that have triggered an alert.alert_hits:id.r
stores RECAPDocuments that have triggered the alert.These sets are checked and updated before and after an alert is matched by the document.
Grouping Alert Emails:
To avoid triggering one email per each alert matched by a document ingestion, as it currently works in OA, RECAP alert hits for all rates (including RT) are stored using the
ScheduledAlertHit
. RT alerts are sent every 5 minutes using the new daemoncl_send_rt_percolator_alerts
. Before sending them, hits are grouped per user and alert matched, so if multiple dockets or RECAPDocuments match an alert, they're grouped and nested within the same alert and Docket (in the case of RECAPDocuments). Other alert rates are sent according to their rate via thecl_send_scheduled_alerts
command.To limit the number of
ScheduledAlertHit
that can be stored, I added acontent_object
field to theScheduledAlertHit
model. This allows easy querying and counting of scheduled alert hits by its model, limiting the number of hits an alert can have (20) and the number of nested RECAPDocuments a hit can have (5).For webhooks, there is no limit; all matched hits will be sent. I'll open a different issue to add a rate-limit or throttling to webhooks as we discussed.
Webhooks:
Webhooks for all rates are always triggered in real-time as alerts are matched by the percolator. The serializer used in RECAP Search Alerts webhooks is
RECAPESResultSerializer
, the same used in V4 RECAP Search API, supporting nested RECAPDocuments into the Docket.Highlights:
RECAP Search alerts, both in emails and webhooks, support the same fields highlighted in the front end, either for the Docket or nested RECAPDocuments.
Screenshots and Examples:
Email with multiple alerts and grouping applied:
Webhook with nested document and HL.
Oral Arguments Alerts:
They'll continue as usual after this PR is merged. I'll open a different issue to apply grouping to OA Search Alerts.
Old alert tasks
I duplicated celery tasks and related methods used by OA. So we can prevent that alerts scheduled fail once this is deployed. After a couple of days of this is deployed, we could remove those tasks.
Finally, I added a new setting:
PERCOLATOR_SEARCH_ALERTS_ENABLED
, which is useful to avoid conflicts in tests that create RECAP-related documents but the Percolator indices do not exist. This setting is only enabled in Alert tests. Once we are ready to start percolating RECAP documents, we should set this setting toTrue
.Additionally, we'll need to create the following indices manually:
RECAPPercolator.init()
RECAPDocumentPercolator.init()
DocketDocumentPercolator.init()
Let me know what do you think.