-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
612 Introduced RECAP Search Alerts sweep index #4127
base: main
Are you sure you want to change the base?
Conversation
- Test filter out queries and hits based on fields that matched.
for more information, see https://pre-commit.ci
- Added tests to assert nested child documents in case alerts.
{% endif %} | ||
{% if doc.plain_text %} | ||
{% contains_highlights doc.plain_text.0 True as highlighted %} | ||
<span style="display: block; margin-top: 5px;">{% if highlighted %}… {% endif %}{{ doc.plain_text|render_string_or_list|safe|underscore_to_space }} …</span> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
Ignore this finding from template-unescaped-with-safe.<a href="https://www.courtlistener.com{% if doc.absolute_url %}{{ doc.absolute_url }}{% else %}{{ result.docket_absolute_url }}#minute-entry-{{ doc.docket_entry_id }}{% endif %}" class="visitable">{% if doc.short_description %}{{ doc.short_description|render_string_or_list|safe }}<span class="gray"> — </span>{% endif %}Document #{% if doc.document_number %}{{ doc.document_number }}{% endif %}{% if doc.attachment_number %}, Attachment #{{ doc.attachment_number }}{% endif %} | ||
</a> | ||
{% if doc.description %} | ||
<span style="display: block; margin-top: 5px;">Description: {{ doc.description|render_string_or_list|safe }}</span> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
Ignore this finding from template-unescaped-with-safe.{% for doc in result.child_docs %} | ||
{% with doc=doc|get_attrdict:"_source" %} | ||
<li> | ||
<a href="https://www.courtlistener.com{% if doc.absolute_url %}{{ doc.absolute_url }}{% else %}{{ result.docket_absolute_url }}#minute-entry-{{ doc.docket_entry_id }}{% endif %}" class="visitable">{% if doc.short_description %}{{ doc.short_description|render_string_or_list|safe }}<span class="gray"> — </span>{% endif %}Document #{% if doc.document_number %}{{ doc.document_number }}{% endif %}{% if doc.attachment_number %}, Attachment #{{ doc.attachment_number }}{% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
Ignore this finding from template-unescaped-with-safe.{{ forloop.counter }}. {{ result|get_highlight:"caseName"|safe }} | ||
({% if result.court_id != 'scotus' %}{{ result|get_highlight:"court_citation_string"|nbsp|safe }} {% endif %}{% if type == 'o' %}{{ result.dateFiled|date:"Y" }}{% elif type == 'oa' %}{{ result.dateArgued|date:"Y" }}{% endif %}) | ||
({% if result.court_id != 'scotus' %}{{ result|get_highlight:"court_citation_string"|nbsp|safe }} {% endif %}{% if type == 'o' %}{{ result.dateFiled|date:"Y" }}{% elif type == 'oa' %}{{ result.dateArgued|date:"Y" }}{% elif type == 'r' %}{{ result.dateFiled|date:"Y" }}{% endif %}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
Ignore this finding from template-unescaped-with-safe.
Semgrep found 1 Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability. Ignore this finding from template-unescaped-with-safe. |
…ing the Re Index API
- Enabled RECAP Search alerts UI behind a waffle. - Added alert frequency estimation for RECAP
Semgrep found 6
Class RECAPSweepDocument inherits from both |
…ets + RD hits - Fixed RECAP MLY and WLY scheduled alerts content.
…dex. - Ensure document timestamps get updated on partial updates.
- Fixed email templates - Refactored retrieve_task_info
…g extra Docket-only and RD-only queries.
child_search = child_search.extra( | ||
from_=0, | ||
size=settings.SCHEDULED_ALERT_HITS_LIMIT | ||
* settings.RECAP_CHILD_HITS_PER_RESULT, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
parent_search = parent_search.extra( | ||
from_=0, size=settings.SCHEDULED_ALERT_HITS_LIMIT | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
- Fixed issues and improved command resumability
I’ve added the solution to the issue I found while working on the Percolator approach. The problem was as follows: The previous approach to determine whether an alert was a Docket-only query or a RECAPDocument-only/Cross-object alert query relied on two checks:
If no RD fields were found in the query, it was classified as a possible "Docket-only" query.
The theory behind this was that if an alert contained an RD field as a filter or a fielded text query, like:
or:
It would always match a Docket with at least one RECAPDocument. Therefore, these types of queries couldn’t be considered Docket-only queries, which is true for filters since they're always combined internally with an However, I realized that queries like the following:
can match either only empty Dockets or Dockets + RDs. For instance: So, these kinds of queries can't be considered as RECAPDocument-only alerts or cross-object alerts since they partially behave as Docket-only queries. This issue can only occur with simple text queries that could match either a Docket or a RECAPDocument field, for instance:
https://www.courtlistener.com/?q=%22United+States%22&type=r&order_by=score+desc& where it can match either Dockets with no documents or Dockets with RECAPDocuments due to containing the search terms within one of their searchable fields. So, the method of using The problems with not correctly classifying Docket-only hits versus RECAPDocument-only or Cross-object hits were two:
The solution.Solution applied was to perform two additional queries alongside the main query: one that can only match Dockets (Docket-only query) and one that can only match RECAPDocuments (RECAPDocument-only query). The additional Docket-only query is performed against the main sweep index but only targets Docket documents, and the results in this query only retrieve the The second additional query, the RECAP-only one, is performed against a different new index called The re-index process also changed. In addition to the main re-index, an additional re-index is performed, which copies only RECAPDocuments indexed or modified during the day to the During the alert query process, three queries are sent to ES in a single request: one for the main query, one for the Docket-only query, and one for the RECAPDocument-only query. The latter two are faster since they're not join queries and only retrieve the The results of these two additional queries are used as follows: For each docket hit returned by the main query, it checks if the docket hit ID is also contained within the docket-only query results. If so, it's a possible docket-only hit. Then, as a second step, RECAPDocuments matched are filtered using the results returned by the RECAPDocument-only query. Each RECAPDocument hit ID is checked if it’s included within the RECAPDocument-only query results; if so, it should be included within the results. If, after performing the previous filters, no RDs remain in the docket hit, it's considered a Docket-only hit, and no RECAPDocuments are nested within the hit. Also, if a docket returned by the main query is not found in the docket-only query results, it's directly classified as a RECAPDocument-only hit or a cross-object hit, and the additional corresponding filters are applied. So following this new approach solved the issue, and now the sweep index approach is more reliable and will trigger alerts similarly to the Percolator approach. |
@@ -1826,3 +1831,24 @@ def prepare_non_participating_judge_ids(self, instance): | |||
|
|||
def prepare_cluster_child(self, instance): | |||
return "opinion_cluster" | |||
|
|||
|
|||
class RECAPSweepDocument(DocketDocument, ESRECAPDocument): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class RECAPSweepDocument inherits from both DocketDocument
and ESRECAPDocument
which both have a method named $F
; one of these methods will be overwritten.
…due to scheduled task - This can be removed after tasks in the queue have been processed.
fc5720e
to
d102664
Compare
…CAP_CHILD_HITS_PER_RESULT value
child_total_query = child_docs_count_query.extra( | ||
size=0, track_total_hits=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
main_doc_count_query = main_doc_count_query.extra( | ||
size=0, track_total_hits=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
…ies returning values change
…n HL to filter out RD hits
Just confirming that the RECAP Search Alerts UI will be controlled by |
request = context["request"] | ||
return ( | ||
search_type == SEARCH_TYPES.OPINION | ||
or search_type == SEARCH_TYPES.ORAL_ARGUMENT | ||
or ( | ||
search_type == SEARCH_TYPES.RECAP | ||
and waffle.flag_is_active(request, "recap-alerts-active") | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can refactor this code to make it more readable.
request = context["request"] | |
return ( | |
search_type == SEARCH_TYPES.OPINION | |
or search_type == SEARCH_TYPES.ORAL_ARGUMENT | |
or ( | |
search_type == SEARCH_TYPES.RECAP | |
and waffle.flag_is_active(request, "recap-alerts-active") | |
) | |
) | |
request = context["request"] | |
if search_type == SEARCH_TYPES.RECAP: | |
return waffle.flag_is_active(request, "recap-alerts-active") | |
return search_type in (SEARCH_TYPES.OPINION, SEARCH_TYPES.ORAL_ARGUMENT) |
from cl.lib.types import CleanData | ||
from cl.search.constants import ( | ||
ALERTS_HL_TAG, | ||
SEARCH_RECAP_CHILD_HL_FIELDS, | ||
recap_document_filters, | ||
recap_document_indexed_fields, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using any of these elements. Let's remove them to clean up the imports.
recap_document_indexed_fields = [ | ||
"id", | ||
"docket_entry_id", | ||
"description", | ||
"entry_number", | ||
"entry_date_filed", | ||
"short_description", | ||
"document_type", | ||
"document_number", | ||
"pacer_doc_id", | ||
"plain_text", | ||
"attachment_number", | ||
"is_available", | ||
"page_count", | ||
"cites", | ||
] | ||
|
||
recap_document_filters = [ | ||
"available_only", | ||
"description", | ||
"document_number", | ||
"attachment_number", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arrays are unused. They're likely remnants of an older approach. We can remove them.
@@ -1073,22 +1076,25 @@ def build_es_base_query( | |||
cd: CleanData, | |||
child_highlighting: bool = True, | |||
api_version: Literal["v3", "v4"] | None = None, | |||
) -> tuple[Search, QueryString | None]: | |||
alerts: bool = False, | |||
) -> tuple[Search, QueryString | None, QueryString | None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a dataclass instead of a tuple as the return type for this helper method. During my initial review of the PR, I consistently assumed the parent query was the second element in the tuple. I only realized the correct order (child query, parent query) after referring to the docstring 😅
Also, the addition of a new element to the tuple has increased the number of _
variables used to discard unwanted values.
This PR introduces the Sweep index approach (discussed in #612) to send RECAP Search Alerts that might have been missed by the percolator approach during the day.
The
cl_send_recap_alerts
command will perform the following tasks:Remove the
RECAPSweepDocument
index from the previous day and recreate it to get a clean index (I'll create a follow-up issue to apply the logic to store the last 7-14 days indices for debugging purposes).Index all documents added/changed during the day from the main RECAP index to the
RECAPSweepDocument
. The indexing process uses the ES re-index API with a custom query for efficiency. The documents included in the re-index are:This ensures that every document that should be included in the day's alerts is indexed.
Considering that the re-index process can take considerable time depending on the number of documents, it is scheduled as an ES task. This task retrieves an ID to monitor its progress. Initially, the process will wait for one minute after the task is scheduled. Depending on the task's progress, the estimated waiting time will dynamically change before checking the task status again, repeating the process until the task is completed. If after 10 failed tries of getting the task status (possible due to a ES cluster overhead), the process is aborted and a error is logged so we can take an action manually.
Some variables are stored in Redis to make the command resumable in case of a failure or if the Pod dies:
alert_sweep:re_index_completed
: If the command fails after the re-index process is completed, this step can be skipped when the command is resumed.alert_sweep:query_date
: This stores the date from which we are sending alerts. The idea is that the command can be started during the day we are sending alerts, and in case it is close to 00:00 of the next day, it will still know that the alerts belong to the previous day in the event of a failure occurring after midnight.alert_sweep:task_id
: In case of failure during the ES re-index process, monitoring can continue from the previously scheduled task instead of starting a new one.Send RT and Daily alerts. After the re-index process is completed, Real-time and Daily alerts are sent. The process is the same for both rates:
Get alerts for each user.
Filter out RT alerts for non-members.
Search each query alert against the
RECAPSweepDocument
and retrieve the hits.Each query is limited to retrieving up to
SCHEDULED_ALERT_HITS_LIMIT
hits, defaulting to 20 (the current max number of hits in ES OA Search Alerts).Hits are processed to ensure only the correct hits are included and no hits/alerts are duplicated:
description
orplain_text
.case_name
anddocument_number
filter, or a text query matchingcase_name
andplain_text
simultaneously.The reason to differentiate alert types is to avoid sending alerts incorrectly based on matched content.
In practice, we only need to differentiate Docket-only alerts from RECAP-only or Cross-object alerts. If a hit in an alert doesn't include RD fields in the query or filters and the hit doesn't match RD highlights (Docket-only), we need to ensure the matched Docket was added/updated during the day. This ensures the alert should be triggered and no child hits are included. To confirm that, the Docket
date_modified
must belong to the same day, indicating the Docket was added or modified that day. We want to avoid cases where a Docket is indexed due to one of its RECAPDocuments being updated independently.For RECAP-only and Cross-object alerts, RECAPDocuments are matched as inner hits. The filtering process confirms that the query contains a child field as a filter or within the text query using advanced syntax, or if a child field is highlighted. If true, the child hit is included in the alert.
An additional filter checks if the Docket hit or the RECAPDocument hit has already triggered the same alert. We keep two sets per alert:
alert_hits:id.d
stores Dockets that have triggered an alert.alert_hits:id.r
stores RECAPDocuments that have triggered the alert.For Docket-only alerts, we check if the Docket hit ID is already within
alert_hits:id.d
. If so, the hit is excluded from the alert. For RECAP-only or Cross-object alerts, we check if the RD hits are withinalert_hits:id.r
. Only RDs not in the set are included in the alert. If all RD hits have previously triggered the alert, the hit is omitted from the alert.Finally, after filtering hits and child hits, alert emails are sent, along with their related webhooks.
WLY and MLY rates:
ScheduledAlertHit
to be sent according to their rate by thecl_send_scheduled_alerts
command.The Alerts UI is enabled for RECAP Search behind a waffle flag:
Here are some examples of alert emails:
Docket-only alert: Only the Docket is included in the alert with no child hits.
RECAPDocument-only alert: The docket fields are shown with the RECAPDocument nested below the docket fields.
Cross-object alert: Here we can see how multiple cases are includes in the alert. In this case, the cross-object alert matched a hit by its
case_name
and also matched a RD belonging to the case that included the keywords within the document description. The second hit only matched the case by itscase_name
with no RDs matched.Also you can notice the
View Additional Results
for this case is shown in the first case.This is because the original search matched more
RECAPDocuments
due to thecase_name
being indexed into each RD, which is the behavior in the frontend.In the alert, we filter out the RDs that actually matched the alert.
One question here: should we keep the
View Additional Results
button as in the frontend, or only show it if there are still 5 RDs matched after filtering? It's important to note that when clicking that button in the frontend, more results can be shown since it includes RDs filtered from the alert.Notes and additional questions: