-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic lambda timeouts after migration to "datadog-lambda-extension" #191
Comments
Thanks for reporting this! |
We are facing a similar kind of issue with Lambda timing out without even running the actual code. As you can see from the below log line, the Runtime Duration is 0 ms and Post Runtime Duration is 0 ms, the actual code never got executed.
We tried upgrading the Datadog AWS Lambda extension to the latest version (57) but still seeing those Lambda timeouts. When we remove the extension from the Lambda code and re-run, we no longer see those timeouts. Any update on this issue? |
I encounter the same issue, quite hard to debug. All permissions seems correct, role, secrets, etc... Just the forwarder fall in timeout. Anyone has more information about the question ? |
Fixed on my side using the cloudformation template instead of the manual setup. By the way the issue was coming from the access to the secret manager arn, also using the stack provided by the cloudformation. |
We're seeing a similar issue with the lambda extension. Post runtime duration can run up to the full 30 second timeout we have configured versus ~250ms to send a response. This is the only extension we have on our lambdas. It's not super common, p99 seems to be under a second for post runtime but it's happening often enough to be prominent in our logs. |
Can everyone try the newest version of the extension and if they're still encountering the same issue, then please open a support request using https://help.datadoghq.com/hc/en-us/requests/new and someone from the engineering team will take a deeper look into it. |
We are also having this issue. We had this issue initially with lambdas that had 128MB memory, and were recommended to increase this to 256MB. This solved the issue for a while, but now we are seeing the same behavior and had to increase the memory to 512MB. This fixed it for a few days, but again we are seeing the timeouts, so if this is memory related, its a memory leak and not just overhead. It is hard to determine what the cause is because, as people here have mentioned already, there are no logs indicating any issue. I have opened an issue via the help.datadoghq.com link shared above, but still wanted to mention here that this seems to be an ongoing issue. |
After speaking with support, we turned on debug logs via the
As you can see, the datadog layer took ~4.6s to initialize, leaving less than 500ms for the actual lambda execution before timeout was reached. This also has a large cost impact on us, since that 4.6s is included in the billed duration. Before adding the datadog layer we were able to run this lambda at 128MB memory, but now it requires 512MB memory to avoid timing out so often that it degrades our service (currently timeouts are < 0.1% of requests with 512MB, but still an issue for us). |
Here is a CDK stack that can be used to reproduce the issue: https://github.com/Genie-Garage/datadog-timeout.git This will have to be run for ~1 hour before seeing timeouts, but we have reproduced this reliably several times now. Is there anyone from the Datadog side that could look into reproducing this? NOTE: We have reproduced this issue with the "next" beta version as well as the most recent stable version of the extension. |
Hey @swcloudgenie, thanks for sending us this example! I'll take a look at it as soon as possible! Really appreciate the effort here, will updating as soon as I can |
Hello! We were using the lambda forwarder for a long time and recently migrated to using the extension for sending logs to Datadog. I'm not 100% sure that this is due to the extension but we've been seeing sporadic timeouts for some of our lambda functions. I'll attach a few examples and try to explain what's wrong with them.
Starting simple, here are the logs for 1 particular execution. Note the following thingsAll in all, as said before, I'm not 100% sure that it's all caused by the extension but there's definitely something wrong going on with the logs. The examples I provided are from different functions and different times. It feels like there's some issue in sending logs to Datadog which causes our functions to time out and drop logs. Do you think that's possible?
The text was updated successfully, but these errors were encountered: