[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

JannikBrand · 2024-04-22T16:32:00Z

What is the bug?
After setting up an Alert Monitor which is sending alerts to a custom webhook and verifying that in general the alerts are sent to the webhook successfully, alerts are very often not being sent due to a connection reset issue after some time has passed.

Example: When editing a Monitor there is the possibility to "Send test message". When creating the monitor sending the test message works. After waiting ~30-60 mins it fails with status code 500 and "Failed to send webhook message Connection reset" (which you can see in the browser developer tools and the OpenSearch Dashboard notification popup in that case).
But also if you just wait for the actual alerts being sent by the monitor the often result in this failed state.

When getting the "connection reset" error using the "Send test message" button, the first retry always succeeded.
First idea: Improve / Introduce retry logic in the OpenSearch client when this happens.

It could have something todo with the monitor setup, not sure, I will share the monitor information below.
With this configuration I could reproduce it for OpenSearch 1.3.15 and 2.12.0. It occurred for us when sending alerts to both, MS Teams and Slack as well as for the destination types slack and custom webhook.

Assumption: The issue seems to be on the OpenSearch side since otherwise there would also be the error message "Connection reset by peer".

Respective log of the OpenSearch data node:

[2024-04-12T06:52:05,486][ERROR][o.o.a.d.f.CustomWebhookDestinationFactory] [opensearch-data-1] Exception publishing Message: DestinationType: CUSTOMWEBHOOK, DestinationName:my-teams-channel, Url: https://<I_redacted_it>, scheme: HTTPS, Host: null, Port: -1, Path: null, Method: POST, Message: {
    "@type": "MessageCard",
    "@context": "http://schema.org/extensions",
    "themeColor": "0076D7",
    "summary": "My-Summary",
    "sections": [{
        "activityTitle": "OpenSearch Alert",
        "activitySubtitle": "Trigger: test-trigger",
        "activityImage": "https://teamsnodesample.azurewebsites.net/static/img/image5.png",
        "facts": [{
            "name": "Period start",
            "value": "2024-04-12T04:52:05.475Z"
        }, {
            "name": "Period end",
            "value": "2024-04-12T06:52:05.475Z"
        }, {
            "name": "Error",
            "value": ""
        }, {
            "name": "Raw",
            "value": "{monitor={_id=Ow03zY4BedGE4-daHuXs, _version=5, name=Test alerts, enabled=true}, results=[{_shards={total=1, failed=0, successful=1, skipped=0}, hits={hits=[], total={value=2, relation=eq}, max_score=null}, took=3, timed_out=false}], periodStart=2024-04-12T04:52:05.475Z, periodEnd=2024-04-12T06:52:05.475Z, error=null, trigger={id=OQ03zY4BedGE4-daHuXj, name=test-trigger, severity=1, actions=[{name=My Action}]}, alert={acknowledged_time=null, id=Qw04zY4BedGE4-daDuUP, version=-1, end_time=null, error_message=null, last_notification_time=1712897527013, severity=1, start_time=1712839986700, state=ACTIVE, bucket_keys=null, parent_bucket_path=null}}"
        }],
        "markdown": true
    }]
}
java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1459) ~[?:?]
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070) ~[?:?]
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar:4.5.13]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.getHttpResponse(DestinationHttpClient.java:127) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.execute(DestinationHttpClient.java:90) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:32) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:19) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.Notification.lambda$publish$0(Notification.java:35) [alerting-notification-1.3.15.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) [?:?]
    at org.opensearch.alerting.destination.Notification.publish(Notification.java:33) [alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.model.destination.Destination.publish(Destination.kt:298) [opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at org.opensearch.alerting.MonitorRunner$runAction$2.invokeSuspend(MonitorRunner.kt:659) [opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.6.10.jar:1.6.10-release-923(1.6.10)]
    at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]
[2024-04-12T06:52:05,490][INFO ][o.o.a.m.MonitorRunResult ] [opensearch-data-1] Internal error: java.net.SocketException: Connection reset. See the opensearch.log for details
java.lang.IllegalStateException: java.net.SocketException: Connection reset
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:36) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:19) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.Notification.lambda$publish$0(Notification.java:35) ~[alerting-notification-1.3.15.0.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
    at org.opensearch.alerting.destination.Notification.publish(Notification.java:33) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.model.destination.Destination.publish(Destination.kt:298) ~[opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at org.opensearch.alerting.MonitorRunner$runAction$2.invokeSuspend(MonitorRunner.kt:659) ~[opensearch-alerting-1.3.15.0.jar:1.3.15.0]
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.6.10.jar:1.6.10-release-923(1.6.10)]
    at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) ~[?:?]
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1459) ~[?:?]
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070) ~[?:?]
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.12.jar:4.4.12]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar:4.5.13]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.getHttpResponse(DestinationHttpClient.java:127) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.client.DestinationHttpClient.execute(DestinationHttpClient.java:90) ~[alerting-notification-1.3.15.0.jar:?]
    at org.opensearch.alerting.destination.factory.CustomWebhookDestinationFactory.publish(CustomWebhookDestinationFactory.java:32) ~[alerting-notification-1.3.15.0.jar:?]
    ... 11 more

How can one reproduce the bug?
Steps to reproduce the behavior:

Prerequisite: Setup a custom webhook you can send alert notifications to. I created a MS Teams channel & webhook as described here. There might be better online tools for testing webhooks.

Create a sample index with 1 document like this:

PUT sample-index/_doc/<id_placeholder>
{
  "numberField": 5,
  "my-time": "2024/04/22 12:38:00" //current date
}

Create a destination (in newer versions called "notification channel") with the webhook URL and in my case method POST and Content-Type: application/json

Create a Monitor:

per query monitor; running every 2 hours; using sample-index as data source and my-time as time field;
For the query use the visual editor; configure to check the document count for the last 24days
For the trigger: trigger when doc count is above 0
For the action: send a message to the created destination, e.g. for teams (reference):

{
    "@type": "MessageCard",
    "@context": "http://schema.org/extensions",
    "themeColor": "0076D7",
    "summary": "My-Summary",
    "sections": [{
        "activityTitle": "OpenSearch Alert",
        "activitySubtitle": "Trigger: {{ctx.trigger.name}}",
        "activityImage": "https://teamsnodesample.azurewebsites.net/static/img/image5.png",
        "facts": [{
            "name": "Period start",
            "value": "{{ctx.periodStart}}"
        }, {
            "name": "Period end",
            "value": "{{ctx.periodEnd}}"
        }, {
            "name": "Error",
            "value": "{{ctx.error}}"
        }, {
            "name": "Raw",
            "value": "{{ctx}}"
        }],
        "markdown": true
    }]
}

Verify that sending the test message works.
Create the monitor.
Wait for 1 hour & send the test message again to see the connection reset issue
Or: Just wait e.g. 1 day until the alerts were triggered (every 2 hours). Some of them will be sent successfully and some of them are failed. For that: Click on the alert under "Alerts" to see its state.

What is the expected behavior?
If the connection gets reset, I expect that OpenSearch will take care of re-establishing the connection and/or retry the requests.

What is your host/environment?

Versions: 1.3.15 & 2.12.0
Plugins: OpenSearch alerting

Do you have any screenshots?

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

JannikBrand · 2024-04-23T14:02:59Z

When querying the alerts via GET _plugins/_alerting/monitors/alerts?monitorId=<id_placeholder>, this is a sample output (with OS 2.12.):
The monitor was running 1 day, many alerts are send successfully while other alerts are not send due to the connection reset error.
I also tried to run the monitor in 20 min intervals instead of 2hours. Almost every second alert is not sent.

{
  "alerts": [
    {
      "id": "sJSsBo8BomAnsTMDF0PN",
      "version": 11,
      "monitor_id": "kHM-Bo8Bgaj1Kgd4LjPY",
      "workflow_id": "",
      "workflow_name": "",
      "associated_alert_ids": [],
      "schema_version": 5,
      "monitor_version": 1,
      "monitor_name": "My Test Monitor",
      "execution_id": "kHM-Bo8Bgaj1Kgd4LjPY_2024-04-22T16:38:09.502091053_6c5c58a6-3eaf-4283-ae57-88801bb703f4",
      "trigger_id": "jnM-Bo8Bgaj1Kgd4LTP6",
      "trigger_name": "My-Test-Trigger",
      "finding_ids": [],
      "related_doc_ids": [],
      "state": "ACTIVE",
      "error_message": null,
      "alert_history": [
        {
          "timestamp": 1713868689538,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713854289430,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713839889444,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713825489518,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        },
        {
          "timestamp": 1713811089532,
          "message": """Failed running action:
OpenSearchStatusException[{"event_status_list": [{"config_id":"iXM7Bo8Bgaj1Kgd4bDNq","config_type":"webhook","config_name":"My Channel","email_recipient_status":[],"delivery_status":{"status_code":"500","status_text":"Failed to send webhook message Connection reset"}}]}]"""
        }
      ],
      "severity": "1",
      "action_execution_results": [
        {
          "action_id": "notification167289",
          "last_execution_time": 1713875890185,
          "throttled_count": 0
        }
      ],
      "start_time": 1713803892665,
      "last_notification_time": 1713875890370,
      "end_time": null,
      "acknowledged_time": null
    }
  ],
  "totalAlerts": 1
}

sbcd90 · 2024-04-29T18:43:10Z

added to backlog.

ultratoto14 · 2024-12-02T08:25:24Z

Any news on this one ? I encounter the same issue.

JannikBrand added bug Something isn't working untriaged labels Apr 22, 2024

sbcd90 removed the untriaged label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

JannikBrand commented Apr 22, 2024

JannikBrand commented Apr 23, 2024 •

edited

Loading

sbcd90 commented Apr 29, 2024

ultratoto14 commented Dec 2, 2024

[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525

Comments

JannikBrand commented Apr 22, 2024

JannikBrand commented Apr 23, 2024 • edited Loading

sbcd90 commented Apr 29, 2024

ultratoto14 commented Dec 2, 2024

JannikBrand commented Apr 23, 2024 •

edited

Loading