-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Alerts cannot be sent to webhook due to "java.net.SocketException: Connection reset" #1525
Comments
When querying the alerts via
|
added to backlog. |
Any news on this one ? I encounter the same issue. |
What is the bug?
After setting up an Alert Monitor which is sending alerts to a custom webhook and verifying that in general the alerts are sent to the webhook successfully, alerts are very often not being sent due to a connection reset issue after some time has passed.
Example: When editing a Monitor there is the possibility to "Send test message". When creating the monitor sending the test message works. After waiting ~30-60 mins it fails with status code 500 and "Failed to send webhook message Connection reset" (which you can see in the browser developer tools and the OpenSearch Dashboard notification popup in that case).
But also if you just wait for the actual alerts being sent by the monitor the often result in this failed state.
When getting the "connection reset" error using the "Send test message" button, the first retry always succeeded.
First idea: Improve / Introduce retry logic in the OpenSearch client when this happens.
It could have something todo with the monitor setup, not sure, I will share the monitor information below.
With this configuration I could reproduce it for OpenSearch 1.3.15 and 2.12.0. It occurred for us when sending alerts to both, MS Teams and Slack as well as for the destination types slack and custom webhook.
Assumption: The issue seems to be on the OpenSearch side since otherwise there would also be the error message "Connection reset by peer".
Respective log of the OpenSearch data node:
How can one reproduce the bug?
Steps to reproduce the behavior:
Create a sample index with 1 document like this:
Create a destination (in newer versions called "notification channel") with the webhook URL and in my case method POST and Content-Type: application/json
Create a Monitor:
Verify that sending the test message works.
Create the monitor.
Wait for 1 hour & send the test message again to see the connection reset issue
Or: Just wait e.g. 1 day until the alerts were triggered (every 2 hours). Some of them will be sent successfully and some of them are failed. For that: Click on the alert under "Alerts" to see its state.
What is the expected behavior?
If the connection gets reset, I expect that OpenSearch will take care of re-establishing the connection and/or retry the requests.
What is your host/environment?
Do you have any screenshots?
Do you have any additional context?
The text was updated successfully, but these errors were encountered: