Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store error message for streaming job execution in Flint metadata log #433

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Jul 16, 2024

Description

This PR enhances the Flint Spark Index monitor by storing the error messages in the Flint index metadata log. Previously, error messages were not captured in the metadata log, which made it difficult to diagnose issues when they occurred. Specifically, this PR converts the ErrorMessages utility class to Java code and introduces a new method to extract, redact, and truncate error messages.

TODO

  1. Consider merging with the logic below (this may happen after FlintSparkIndexMonitor is moved out of Flint core, or the logic below is moved to core):
  2. Currently, there is a limitation where no error message is stored if the streaming job exits early. In Spark 4.0, StreamingQueryListener provides exceptions, which can help remove this limitation.

Example

Create a Flint index:

CREATE INDEX streaming_error ON glue.default.http_logs (
  clientip
)
WITH (
  auto_refresh = true,
  checkpoint_location = 's3://checkpoint/streaming-error'
);

Change index setting to block write for the index:

PUT flint_glue_default_http_logs_streaming_error_index/_block/write

Check error message in metadata log (empty before the changes)

  ...
        "_source": {
          "version": "1.0",
          "latestId": "ZmxpbnRfZ2x1ZV9kZWZhdWx0X2h0dHBfbG9nc19zdHJlYW1pbmdfZXJyb3JfM19pbmRleA==",
          "type": "flintindexstate",
          "state": "failed",
          "applicationId": "XXXXXX",
          "jobId": "XXXXXX",
          "dataSourceName": "glue",
          "jobStartTime": 1721150179877,
          "lastUpdateTime": 1721150641171,
          "error": """failure in bulk execution:
[0]: index [flint_glue_default_http_logs_streaming_error_index], id [kqSSvJAB69S8YQePXjzu], message [OpenSearchException[OpenSearch exception [type=cluster_block_exception, reason=index [flint_glue_default_http_logs_streaming_error_index] blocked by: [FORBIDDEN/8/index write (api)];]]]
[1]: index [flint_glue_default_http_logs_streaming_error_index], id [k6SSvJAB69S8YQePXjzu], message [OpenSearchException[OpenSearch exception [type=cluster_block_exception, reason=index [flint_glue_default_http_logs_streaming_error_index] blocked by: [FORBIDDEN/8/index write (api)];]]]
[2]: index [flint_glue_default_http_logs_streaming_error_index], id [lKSSvJAB69S8YQePXjzu], message [OpenSearchException[OpenSearch exception [type=cluster_block_exception, reason=index [flint_glue_default_http_logs_streaming_error_index] blocked by: [FORBIDDEN/8/index write (api)];]]]
[3]: index [flint_glue_default_http_logs_streaming_error_index], id [laSSvJAB69S8YQePXjzu], messag..."""
        }
  ...

Issues Resolved

#405

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added enhancement New feature or request 0.5 labels Jul 16, 2024
@dai-chen dai-chen self-assigned this Jul 16, 2024
@dai-chen dai-chen marked this pull request as ready for review July 16, 2024 22:55
Comment on lines +54 to +55
return String.format("%s: serviceName=[%s], statusCode=[%d]",
S3ErrorPrefix, e.getServiceName(), e.getStatusCode());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this is relevant directly and it seems binding to Log4j API. May reconsider when we refactor the entire extractRootCause and processQueryException after #435. Thanks!

@dai-chen dai-chen merged commit 43b14f4 into opensearch-project:main Jul 18, 2024
4 checks passed
@dai-chen dai-chen deleted the store-error-message-for-streaming-job-rebased branch July 18, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.5 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants