You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the 2023 march 8th incident, our lambdas duration average execution increased very significantly (about x2-x3), causing an increase in concurrency and additional load across the board. These lambdas are executed thousands of times per minute, and can take ~90ms on average to complete. We ended up disabling the lambda extension to restore to our normal duration.
Looking at https://github.com/DataDog/datadog-lambda-extension#overhead it seems we shouldn't have seen this. My interpretation is that it would be expected to have 1 invocation every minute to have a larger than normal duration as it flushes the buffered metrics/spans, but most lambda invocations should have kept working at same speed.
Wanted to check on that interpretation and see if there was any other reasons that could have caused such an increase in average duration, and if there was anything that could be done to prevent those in the future in the light of today's incident.
This is our current configuration for the extension:
enableDDTracing: false
# logs are forwarded from CW to DD
enableDDLogs: false
subscribeToAccessLogs: false
# as tracing is disable, do not add DD context to logs
injectLogContext: false
Thank you, and #hugops as I bet this one was a hard one!
The text was updated successfully, but these errors were encountered:
@santiagoaguiar Thanks for reporting what you experienced during the incident! That's is extremely valuable for us, as we are still actively investigating the exact impact to our serverless customers during the incident. Do you mind following up in another week or so, I believe we will have some concrete to share by then.
@santiagoaguiar We were able to identify a few places in the Datadog Agent where the existing retry and buffer logic were not optimized for serverless. We are looking into potential improvements in Q2.
During the 2023 march 8th incident, our lambdas duration average execution increased very significantly (about x2-x3), causing an increase in concurrency and additional load across the board. These lambdas are executed thousands of times per minute, and can take ~90ms on average to complete. We ended up disabling the lambda extension to restore to our normal duration.
Looking at https://github.com/DataDog/datadog-lambda-extension#overhead it seems we shouldn't have seen this. My interpretation is that it would be expected to have 1 invocation every minute to have a larger than normal duration as it flushes the buffered metrics/spans, but most lambda invocations should have kept working at same speed.
Wanted to check on that interpretation and see if there was any other reasons that could have caused such an increase in average duration, and if there was anything that could be done to prevent those in the future in the light of today's incident.
This is our current configuration for the extension:
Thank you, and #hugops as I bet this one was a hard one!
The text was updated successfully, but these errors were encountered: