False positive error 500 by java agent #3454

metabsd · 2023-12-04T15:53:12Z

Describe the bug

We are experiencing a 100% fail rate on Java microservices, but the service is working as expected. We believe that the APM agent is misinterpreting the termination of a thread during execution. Maybe we need to adjust things on our end, but everything seems to be functioning well.

I have noticed that the agent attaches to the thread, and I believe that at the end of processing, the thread is killed. Therefore, the agent misinterprets this and thinks there is a problem.

agent.activation_method == programmatic-self-attach

Exception message:

java.io.IOException: The current thread was interrupted

Full stack trace:

org.apache.catalina.connector.ClientAbortException: java.io.IOException: The current thread was interrupted
at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:305)
at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:267)
at org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:118)
at com.fasterxml.jackson.core.json.UTF8JsonGenerator.flush(UTF8JsonGenerator.java:1200)
at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1063)
at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.writeInternal(AbstractJackson2HttpMessageConverter.java:483)
at org.springframework.http.converter.AbstractGenericHttpMessageConverter.write(AbstractGenericHttpMessageConverter.java:103)
at org.springframework.web.servlet.mvc.method.annotation.AbstractMessageConverterMethodProcessor.writeWithMessageConverters(AbstractMessageConverterMethodProcessor.java:297)
at org.springframework.web.servlet.mvc.method.annotation.HttpEntityMethodProcessor.handleReturnValue(HttpEntityMethodProcessor.java:245)
at org.springframework.web.method.support.HandlerMethodReturnValueHandlerComposite.handleReturnValue(HandlerMethodReturnValueHandlerComposite.java:78)
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:136)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:884)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:797)
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1081)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:974)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1011)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:903)
at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:537)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885)
at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:631)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at org.springframework.web.filter.ServerHttpObservationFilter.doFilterInternal(ServerHttpObservationFilter.java:109)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:166)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:738)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:341)

Steps to reproduce

We are making a call to our web service, and the call is normal, and we receive our response as expected. However, the result of the call in APM seems to be problematic, but we do not see any issue with this call, only from the APM perspective.

Expected behavior

That APM does not return an HTTP 500 error for the requests in question.

The text was updated successfully, but these errors were encountered:

SylvainJuge · 2023-12-04T16:38:33Z

Thanks for reporting this @metabsd.

This ClientAbortException is often triggered when HTTP client does not completely wait for the end of the response and from the server perspective it's often associated with broken pipe error message, but here I am not sure if it's related or not.

Here we need to have a bit of clarification:

does the exception you reported here is thrown with/without the agent ?
does the application behavior is altered by the agent or is just the APM agent capturing what looks like a 500 error while from the HTTP client everything works as expected (thus no 500) ?

What is likely happening here is that a transaction is captured by the agent and the HTTP client might be closing the connection earlier than expected, then on the server side, a ClientAbortException is thrown to abruptly end the HTTP response sending (as there isn't any client listening).

If that's what I think, the challenge here would be to make the agent aware of those "expected exceptions" if we can, and another challenge might be that this kind of exception could also be the symptom of a misbehaving HTTP client, thus it's quite hard to always mark those as "not really errors". Currently the agent assumes that when there is an exception the HTTP response code is a 500, but here it has already been sent and it might actually be fine even from the client-side, but that's definitely an implementation detail of the Tomcat server (thus might need a dedicated instrumentation if we need to handle that automatically).

We already had a similar report #979.

metabsd · 2023-12-05T12:59:02Z

Hi @SylvainJuge,

I'm reaching out to seek your insights on an issue we're encountering with the interpretation of HTTP 500 errors by the agent. We have observed something curious: when using Postman for our requests, we receive a HTTP 200 response, which indicates that everything is functioning correctly. This observation is corroborated by our JMeter load tests, where 100% of the requests returned without any errors.

Given this context, could the agent be misinterpreting the normal termination of threads in a multi-threaded environment as errors? We suspect that the agent might be attaching itself to a thread and then mistakenly interpreting its natural termination as an error. Is it possible that this is causing the discrepancy we are observing?

Also, I'm curious about the Java agent's design in relation to multi-threaded environments. Is the agent typically configured to handle the nuances of thread lifecycle management in such environments, particularly without misidentifying routine thread terminations as errors?

Could you shed some light on how the agent might be handling thread lifecycles and error detection, and whether there might be a need for further investigation into this aspect to resolve the apparent discrepancy between the agent's error logs and the results from testing tools like Postman and JMeter?

Thank you in advance for your guidance on this matter.

SylvainJuge · 2023-12-05T13:18:11Z

Hi,

From your description it seems that the application behavior is not altered, and only what the agent reports is not what you'd expect.

You mentioned JMeter, which means you are probably load-testing the application, that might also explain those unexpected ClientAbortException. When using the application directly through a browser do you also see those exceptions and errors ?

The agent is not attached to the thread lifecycle, in the sense that the fact a thread is terminated or started does not automatically create or stops an existing span/transaction.

The agent usually instruments one or more methods and wrap their content in a big try/catch/finally block, that allows to define the scope of transactions and spans as the time it takes to execute the method body of the instrumented methods. When an exception occurs, like when the thread is interrupted (but it could also be any kind of exception), then the catch/finally part of the instrumentation is executed, and this is where the agent tries to guess the status of the HTTP response as it's not already explicitly set.

So here I would first suggest to validate if you can reproduce when using the application like a human would (not JMeter), then try the ignore_exceptions configuration option https://www.elastic.co/guide/en/apm/agent/java/current/config-core.html#config-ignore-exceptions, while this might not help with the HTTP status code it should at least help with the transaction outcome field.

metabsd · 2023-12-05T13:38:27Z

Hi,

Yes, we are currently conducting load tests. I have also reviewed our APM logs since November 21, and this type of error was already present. We also encounter HTTP 500 errors when making calls with Postman without JMeter in action. Thank you for the recommendation regarding the ignore-exceptions parameter, I will try that out.

I have come to the conclusion that the exception might be misinterpreted by the Java agent because, in the transaction details, I noticed that the activation_methods of the agent is programmatic-self-attach. This leads me to believe that the agent attaches itself to a thread. Then, in the exception mentioned by the agent, I noticed java.io.IOException: The current thread was interrupted. I am going to ask the Devs to evaluate this in depth to make sure we are not overlooking a real problem. Once again, thank you!!

SylvainJuge · 2023-12-05T14:27:32Z

The "activation method" refers to how the agent is setup in the JVM, in the case of programmatic-self-attach it means it has been setup with the attacher artifact in your application which has been modified to initialize the agent when starting: doc for reference.

If this exception also can be triggered through postman, then the question is to see if this issue is visible in the usual tomcat logs:

if the error is present in the logs, then the agent captures it as expected.
if it's not present, double check the log level and try with defaults as it might have been explicitly muted.
if it's still not present in the logs even with verbose logging, then it's likely that the agent is reporting something as an error when it's just an implementation detail of tomcat.

metabsd · 2023-12-05T16:41:42Z

We have found references in the code to Thread.currentThread().interrupt() which would explain our problem. Do you know if we need to inform the Java Agent of this type of behavior? If we need to prepare it for a thread interruption. Thank you!

SylvainJuge · 2023-12-05T17:32:10Z

What do you mean here, in the code of the application or in the code of the agent ? Also do you know if those references are the ones that trigger this behavior or not ?

To me the relevant question that needs to be answered is more "why is the thread being interrupted while processing and HTTP request/response".

jackshirazi · 2024-01-04T14:12:55Z

Leaving this open for a little longer to allow an update, but without any further updates here, will likely close this soon

metabsd · 2024-01-04T15:07:35Z

Source of the Problem: The issue we're discussing is indeed in our application's code, not in the Java agent code. This clarification is important as it narrows down our focus to the internal workings of our own application for troubleshooting and fixing the issue.

jackshirazi · 2024-01-05T12:53:55Z

So are you okay to close this issue?

github-actions bot added agent-java community Issues and PRs created by the community triage labels Dec 4, 2023

metabsd closed this as completed Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False positive error 500 by java agent #3454

False positive error 500 by java agent #3454

metabsd commented Dec 4, 2023

SylvainJuge commented Dec 4, 2023

metabsd commented Dec 5, 2023

SylvainJuge commented Dec 5, 2023

metabsd commented Dec 5, 2023

SylvainJuge commented Dec 5, 2023

metabsd commented Dec 5, 2023

SylvainJuge commented Dec 5, 2023

jackshirazi commented Jan 4, 2024

metabsd commented Jan 4, 2024

jackshirazi commented Jan 5, 2024

False positive error 500 by java agent #3454

False positive error 500 by java agent #3454

Comments

metabsd commented Dec 4, 2023

Describe the bug

Steps to reproduce

Expected behavior

SylvainJuge commented Dec 4, 2023

metabsd commented Dec 5, 2023

SylvainJuge commented Dec 5, 2023

metabsd commented Dec 5, 2023

SylvainJuge commented Dec 5, 2023

metabsd commented Dec 5, 2023

SylvainJuge commented Dec 5, 2023

jackshirazi commented Jan 4, 2024

metabsd commented Jan 4, 2024

jackshirazi commented Jan 5, 2024