-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for JVM bug causing crashes on exception access #3486
Comments
Hi @jackshirazi We are currently also struggling with the problem of JVM crashes related to JDK17 and ElasticAPM. We have now switched to Eclipse Temurin JDK 17.0.10+7 and are using the Elastic Apm Java Agent 1.45. The applications run in Docker containers orchestrated with Docker Compose. In #3257 you already recommended switching to the latest JDK 17. When we rolled out a version with the new JDK today, there was an instance with a JVM crash. We noticed that this problem mostly occurs in a Scheduled Task. Here is an excerpt from the today's crash report:
If you need the full report or additional excerpts, please let me know. |
Thanks, that's a different issue, you need to stop using the inferred spans option |
Oracle have considered this hypothesis, and can't see how it would occur, so I'm closing this specific issue now. If anyone has crashes NOT related to the asyncprofiler, please open a discussion thread at our discussion forum. Any crashes related to asyncprofiler should be resolved by going back to the default |
@jackshirazi thx for quick reply! Sorry that I commented on the wrong issue |
Describe the bug
We have a number of different reports where the agent accessing exceptions has caused JVM crashes in JVM 17+ (but only after many hours of JVM load). This has happened in scenarios where the agent is not loading any native code (ie inferred spans is left disabled which is the default), including in simple situations where the application raises the exception, and the advice synchronously tries to read the exception (the first touch by the agent) and the JVM crashes with SIGSEGV based on the exception native side having been nulled
Given the agent doesn't have any native code loaded, and the bytecode transformations are standard ones that many agents do using Byte Buddy, we've looked for what our agent does differently from other agents (since we haven't heard of similar crashes from other agents). There is one significant difference, while most agents inline their advice code (effectively transforming a method to include the advice code) the Elastic agent uses Byte Buddy's non-inlined invokedynamic based advice which inserts a bytecode to a dynamic dispatch call out to the advice code
We hypothesize that between JVM 11 and 17, G1 processing changed to be more aggressive about nulling native-side data of exception objects, based presumably on escape analysis (or similar) determining that the exception has gone out of scope of the application. In the case of inlined code that accesses the exception, the escape analysis would determine that the exception was still in scope. We hypothesize however that the case where the bytecode has been retransformed to add in an invokedynamic bytecode which does a callsite lookup, the escape analysis incorrectly fails to identify that the exception object lifetime has changed to now have a longer life and continues to inform the GC that the exception can be nulled. In that scenario there is a race condition between the GC and the agent. In most cases the agent will quickly access the exception to get the information for error reporting, and add that information to traces, and then the exception is indeed out of scope of application (and agent). But every once in a while a GC will be triggered just before the agent accesses the exception, the GC erroneously thinks the exception is out of scope and nulls it, then the agent access the actually still live exception and the JVM crashes with SIGSEGV.
If this hypothesis is correct, we could workaround the JVM bug by inlining the exception processing
Steps to reproduce
Not reproducible in test scenarios, all crash reports have been after multiple hours (often days) of load in production systems
Expected behavior
JVM doesn't crash
The text was updated successfully, but these errors were encountered: