Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #1526

Open
ggt opened this issue Apr 23, 2024 · 8 comments
Open

[BUG] #1526

ggt opened this issue Apr 23, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@ggt
Copy link

ggt commented Apr 23, 2024

Environement
Docker

What is the bug?
Triggers & Alerts created in schema version 0 and produces java.lang.NullPointerException: null.

How can one reproduce the bug?

Steps to reproduce the behavior:

  • Create an alert and trigger on an old opensearch version
  • Update it to 2.13.0

What is the expected behavior?
sudo docker-compose logs opensearch -f --tail 100

Do you have any screenshots?
opensearch-node1 | [2024-04-23T10:58:21,094][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [P00APLOG-D01] uncaught exception in thread [DefaultDispatcher-worker-4] opensearch-node1 | java.lang.NullPointerException: null opensearch-node1 | at org.opensearch.alerting.MonitorRunnerService$runJob$2.invokeSuspend(MonitorRunnerService.kt:335) ~[opensearch-alerting-2.13.0.0.jar:2.13.0.0] opensearch-node1 | at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.8.21.jar:1.8.21-release-380(1.8.21)] opensearch-node1 | at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?] opensearch-node1 | at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?] opensearch-node1 | at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?] opensearch-node1 | at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?] opensearch-node1 | uncaught exception in thread [DefaultDispatcher-worker-4]

Differences

<     "schema_version": 0,
>     "schema_version": 8,

<       "source": "Alerting Notification action",
>       "source": "",      //   Suspecting that to be the issue!

Temporary solution
Copy informations from old and create a new alert and trigger
new.txt
old.txt

TODO
Check in the code if "source" is Null produces that error, (no more infos in debug)

Thanks!

@ggt ggt added bug Something isn't working untriaged labels Apr 23, 2024
@sbcd90
Copy link
Collaborator

sbcd90 commented Apr 29, 2024

looking into it. added to backlog.

@sbcd90 sbcd90 removed the untriaged label Apr 29, 2024
@zakisaad
Copy link

Can confirm this occurred to 5 of our monitors on AWS hosted OS - the frustrating part about this is the complete silence/"green status" on the OpenSearch dashboards, which makes it look like everything is firing per usual. Only tell is the NPE log thrown every time the monitor was supposed to run.

@diegargon
Copy link

diegargon commented Jun 18, 2024

I got similar problem.

[2024-06-18T08:45:45,552][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch] uncaught exception in thread [DefaultDispatcher-worker-8]
java.lang.NullPointerException: null
at org.opensearch.alerting.MonitorRunnerService$runJob$1.invokeSuspend(MonitorRunnerService.kt:345) ~[opensearch-alerting-2.14.0.0.jar:2.14.0.0]
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.8.21.jar:1.8.21-release-380(1.8.21)]
at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:233) [kotlinx-coroutines-core-1.1.1.jar:?]
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594) [kotlinx-coroutines-core-1.1.1.jar:?]
at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60) [kotlinx-coroutines-core-1.1.1.jar:?]
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:742) [kotlinx-coroutines-core-1.1.1.jar:?]

I got this problem a month ago, i fixit (i try few things but not remember) after a reboot yesterday the error appeared again.

Not use docker, just Logstash->Opensearch in debian box.

@zakisaad
Copy link

We had to export -> import all of our monitors using the JSON Export feature, as we had used the UI to define the monitors directly. The 2.13 upgrade has been painful for us.

I suggest you export each of the monitors in JSON form, disable them, go to the Dev Tools console in the sidebar, and re-import each of them using a POST request (you might want to strip the id fields from the exported JSON).

If you define monitors using IaC (Terraform or some other in-house tooling), deleting the monitors and re-creating them via your pipelines should also work as a simple solution.

@diegargon
Copy link

diegargon commented Jun 20, 2024

Yes, that's what I think I remember doing the first time it happened but manually, but this time it didn't work.

edit: Some detectors work but i restart opensearch and everything begin fail again

@ggt
Copy link
Author

ggt commented Jun 20, 2024 via email

@jowg-amazon
Copy link
Collaborator

Hi, the null pointer exception is a bug in 2.13 coming from a log statement here:

logger.debug("lock ${lock!!.lockId} released")

It has been fixed in this PR: https://github.com/opensearch-project/alerting/pull/1630/files. Until the code fix is released here are the steps you can perform for a temporary solution.

// Check if there are any stuck locks
POST .opensearch-alerting-config-lock/_search?pretty
{
  "query": {
    "match": {
      "released": "false"
    }
  }
}

// Delete all stuck locks
POST .opensearch-alerting-config-lock/_delete_by_query?pretty
{
  "query": {
    "match": {
      "released": "false"
    }
  }
}

@DashamoolmDamu
Copy link

@jowg-amazon I also have the same setup and nothing seems to fix the issue.Deleting and adding the monitors got the alerts to trigger again but NPE error is still there in the logs.Tried deleting the alerting plugin,and also .opensearch-alerting-* and .opendistro-alerting* indices but still see the same error despite removing the plugin.Did i forget to delete any alerting plugin related stuff ? Would appreciate any help regarding fixing this NPE issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants