Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

Open
JannikBrand opened this issue Apr 22, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@JannikBrand
Copy link

What is the bug?
After setting up an Alert Monitor which is sending alerts to a custom webhook and verifying that in general the alerts are sent to the webhook successfully, alerts are very often not being sent due to a connection reset issue after some time has passed.

Example: When editing a Monitor there is the possibility to "Send test message". When creating the monitor sending the test message works. After waiting ~30-60 mins it fails with status code 500 and "Failed to send webhook message Connection reset" (which you can see in the browser developer tools and the OpenSearch Dashboard notification popup in that case).
But also if you just wait for the actual alerts being sent by the monitor the often result in this failed state.

When getting the "connection reset" error using the "Send test message" button, the first retry always succeeded.
First idea: Improve / Introduce retry logic in the OpenSearch client when this happens.

It could have something todo with the monitor setup, not sure, I will share the monitor information below.
With this configuration I could reproduce it for OpenSearch 1.3.15 and 2.12.0. It occurred for us when sending alerts to both, MS Teams and Slack as well as for the destination types slack and custom webhook.

Assumption: The issue seems to be on the OpenSearch side since otherwise there would also be the error message "Connection reset by peer".

Respective log of the OpenSearch data node:

[2024-04-12T06:52:05,486][ERROR][o.o.a.d.f.CustomWebhookDestinationFactory] [opensearch-data-1] Exception publishing Message: DestinationType: CUSTOMWEBHOOK, DestinationName:my-teams-channel, Url: https://<I_redacted_it>, scheme: HTTPS, Host: null, Port: -1, Path: null, Method: POST, Message: {
    "@type": "MessageCard",
    "@context": "http://schema.org/extensions",
    "themeColor": "0076D7",
    "summary": "My-Summary",
    "sections": [{
        "activityTitle": "OpenSearch Alert",
        "activitySubtitle": "Trigger: test-trigger",
        "activityImage": "https://teamsnodesample.azurewebsites.net/static/img/image5.png",
        "facts": [{
            "name": "Period start",
            "value": "2024-04-12T04:52:05.475Z"
        }, {
            "name": "Period end",
            "value": "2024-04-12T06:52:05.475Z"
        }, {
            "name": "Error",
            "value": ""
        }, {
            "name": "Raw",
            "value": "{monitor={_id=Ow03zY4BedGE4-daHuXs, _version=5, name=Test alerts, enabled=true}, results=[{_shards={total=1, failed=0, successful=1, skipped=0}, hits={hits=[], total={value=2, relation=eq}, max_score=null}, took=3, timed_out=false}], periodStart=2024-04-12T04:52:05.475Z, periodEnd=2024-04-12T06:52:05.475Z, error=null, trigger={id=OQ03zY4BedGE4-daHuXj, name=test-trigger, severity=1, actions=[{name=My Action}]}, alert={acknowledged_time=null, id=Qw04zY4BedGE4-daDuUP, version=-1, end_time=null, error_message=null, last_notification_time=1712897527013, severity=1, start_time=1712839986700, state=ACTIVE, bucket_keys=null, parent_bucket_path=null}}"
        }],
        "markdown": true
    }]
}
java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1459) ~[?:?]
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070) ~[?:?]
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar:4.5.13]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.getHttpResponse(DestinationHttpClient.java:127) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.execute(DestinationHttpClient.java:90) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:32) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:19) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.Notification.lambda$publish$0(Notification.java:35) [alerting-notification-1.3.15.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) [?:?]
    at org.opensearch.alerting.destination.Notification.publish(Notification.java:33) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.model.destination.Destination.publish(Destination.kt:298) [opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at org.opensearch.alerting.MonitorRunner$runAction$2.invokeSuspend(MonitorRunner.kt:659) [opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.6.10.jar:1.6.10-release-923(1.6.10)]
    at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]
[2024-04-12T06:52:05,490][INFO ][o.o.a.m.MonitorRunResult ] [opensearch-data-1] Internal error: java.net.SocketException: Connection reset. See the opensearch.log for details
java.lang.IllegalStateException: java.net.SocketException: Connection reset
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:36) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:19) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.Notification.lambda$publish$0(Notification.java:35) ~[alerting-notification-1.3.15.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
    at org.opensearch.alerting.destination.Notification.publish(Notification.java:33) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.model.destination.Destination.publish(Destination.kt:298) ~[opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at org.opensearch.alerting.MonitorRunner$runAction$2.invokeSuspend(MonitorRunner.kt:659) ~[opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.6.10.jar:1.6.10-release-923(1.6.10)]
    at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1459) ~[?:?]
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070) ~[?:?]
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar:4.5.13]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.getHttpResponse(DestinationHttpClient.java:127) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.execute(DestinationHttpClient.java:90) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:32) ~[alerting-notification-1.3.15.0.jar:?]
    ... 11 more 

How can one reproduce the bug?
Steps to reproduce the behavior:

  • Prerequisite: Setup a custom webhook you can send alert notifications to. I created a MS Teams channel & webhook as described here. There might be better online tools for testing webhooks.
  1. Create a sample index with 1 document like this:

    PUT sample-index/_doc/<id_placeholder>
    {
      "numberField": 5,
      "my-time": "2024/04/22 12:38:00" //current date
    }
    
  2. Create a destination (in newer versions called "notification channel") with the webhook URL and in my case method POST and Content-Type: application/json

  3. Create a Monitor:

    • per query monitor; running every 2 hours; using sample-index as data source and my-time as time field;
    • For the query use the visual editor; configure to check the document count for the last 24days
    • For the trigger: trigger when doc count is above 0
    • For the action: send a message to the created destination, e.g. for teams (reference):
    {
        "@type": "MessageCard",
        "@context": "http://schema.org/extensions",
        "themeColor": "0076D7",
        "summary": "My-Summary",
        "sections": [{
            "activityTitle": "OpenSearch Alert",
            "activitySubtitle": "Trigger: {{ctx.trigger.name}}",
            "activityImage": "https://teamsnodesample.azurewebsites.net/static/img/image5.png",
            "facts": [{
                "name": "Period start",
                "value": "{{ctx.periodStart}}"
            }, {
                "name": "Period end",
                "value": "{{ctx.periodEnd}}"
            }, {
                "name": "Error",
                "value": "{{ctx.error}}"
            }, {
                "name": "Raw",
                "value": "{{ctx}}"
            }],
            "markdown": true
        }]
    } 
    
  4. Verify that sending the test message works.

  5. Create the monitor.

  6. Wait for 1 hour & send the test message again to see the connection reset issue
    Or: Just wait e.g. 1 day until the alerts were triggered (every 2 hours). Some of them will be sent successfully and some of them are failed. For that: Click on the alert under "Alerts" to see its state.

What is the expected behavior?
If the connection gets reset, I expect that OpenSearch will take care of re-establishing the connection and/or retry the requests.

What is your host/environment?

  • Versions: 1.3.15 & 2.12.0
  • Plugins: OpenSearch alerting

Do you have any screenshots?
Screenshot 2024-04-22 at 18 25 43

Do you have any additional context?

@JannikBrand JannikBrand added bug Something isn't working untriaged labels Apr 22, 2024
@JannikBrand
Copy link
Author

JannikBrand commented Apr 23, 2024

When querying the alerts via GET _plugins/_alerting/monitors/alerts?monitorId=<id_placeholder>, this is a sample output (with OS 2.12.):
The monitor was running 1 day, many alerts are send successfully while other alerts are not send due to the connection reset error.
I also tried to run the monitor in 20 min intervals instead of 2hours. Almost every second alert is not sent.

{
  "alerts": [
    {
      "id": "sJSsBo8BomAnsTMDF0PN",
      "version": 11,
      "monitor_id": "kHM-Bo8Bgaj1Kgd4LjPY",
      "workflow_id": "",
      "workflow_name": "",
      "associated_alert_ids": [],
      "schema_version": 5,
      "monitor_version": 1,
      "monitor_name": "My Test Monitor",
      "execution_id": "kHM-Bo8Bgaj1Kgd4LjPY_2024-04-22T16:38:09.502091053_6c5c58a6-3eaf-4283-ae57-88801bb703f4",
      "trigger_id": "jnM-Bo8Bgaj1Kgd4LTP6",
      "trigger_name": "My-Test-Trigger",
      "finding_ids": [],
      "related_doc_ids": [],
      "state": "ACTIVE",
      "error_message": null,
      "alert_history": [
        {
          "timestamp": 1713868689538,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713854289430,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713839889444,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713825489518,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713811089532,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        }
      ],
      "severity": "1",
      "action_execution_results": [
        {
          "action_id": "notification167289",
          "last_execution_time": 1713875890185,
          "throttled_count": 0
        }
      ],
      "start_time": 1713803892665,
      "last_notification_time": 1713875890370,
      "end_time": null,
      "acknowledged_time": null
    }
  ],
  "totalAlerts": 1
}

@sbcd90
Copy link
Collaborator

sbcd90 commented Apr 29, 2024

added to backlog.

@sbcd90 sbcd90 removed the untriaged label Apr 29, 2024
@ultratoto14
Copy link

Any news on this one ? I encounter the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants