Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic lambda timeouts after migration to "datadog-lambda-extension" #191

Open
hrist0stoichev opened this issue Jan 25, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@hrist0stoichev
Copy link

hrist0stoichev commented Jan 25, 2024

Hello! We were using the lambda forwarder for a long time and recently migrated to using the extension for sending logs to Datadog. I'm not 100% sure that this is due to the extension but we've been seeing sporadic timeouts for some of our lambda functions. I'll attach a few examples and try to explain what's wrong with them.

image Starting simple, here are the logs for 1 particular execution. Note the following things
  1. There are no logs whatsoever even though we expect to see some logs.
  2. The runtime duration is negative.
  3. The post runtime duration is huge. This is a function with 6s timeout.
  4. The timestamps indicate that only ~300ms have passed.
image
  1. Runtime duration is negative
  2. Post runtime duration is huge.
  3. A lot of "double logging"
  4. The logs prior to the START log indicate that this is a cold start but there is no Init Duration in the REPORT log.
  5. "Invalid trace id" logs
image Here actually the timeout is real and is indeed caused by the function being slow. The runtime duration and the post runtime durations look fine but note:
  1. The function actually timed out all 3 times (1 when trigger + 2 from the automatic async retry), however, there is only 1 log for the timeout. It looks like the other 2 executions were successful but they weren't (actually the third execution is what made me think "wait, why we haven't stopped after the second one?" and then I realized that all 3 have timed out).
  2. A very strange consequence of START -> REPORT -> START -> END -> REPORT -> START -> END
  3. Timestamps are wrong. First execution correctly shows ~15m but the next two are showing 2-3 minutes of execution when in fact the functions were timing out after 15 minutes (pretty sure about that since they were hitting an infinite loop)

All in all, as said before, I'm not 100% sure that it's all caused by the extension but there's definitely something wrong going on with the logs. The examples I provided are from different functions and different times. It feels like there's some issue in sending logs to Datadog which causes our functions to time out and drop logs. Do you think that's possible?

@duncanista duncanista added the bug Something isn't working label Feb 5, 2024
@alexgallotta
Copy link
Contributor

Thanks for reporting this!
It seems this issue might require a deeper analysis. Could you please file a ticket in https://help.datadoghq.com/hc/en-us/requests/new so we can follow the standard process?

@joel-eroad
Copy link

We are facing a similar kind of issue with Lambda timing out without even running the actual code. As you can see from the below log line, the Runtime Duration is 0 ms and Post Runtime Duration is 0 ms, the actual code never got executed.

REPORT RequestId: dfcebe0d-66cc-4f40-a110-01c339fd97b6 Duration: 901730.40 ms Runtime Duration: 0.00 ms Post Runtime Duration: 0.00 ms Billed Duration: 900000 ms Memory Size: 2048 MB Max Memory Used: 246 MB

We tried upgrading the Datadog AWS Lambda extension to the latest version (57) but still seeing those Lambda timeouts. When we remove the extension from the Lambda code and re-run, we no longer see those timeouts.

Any update on this issue?

@steve-lebleu
Copy link

I encounter the same issue, quite hard to debug. All permissions seems correct, role, secrets, etc... Just the forwarder fall in timeout. Anyone has more information about the question ?

@steve-lebleu
Copy link

I encounter the same issue, quite hard to debug. All permissions seems correct, role, secrets, etc... Just the forwarder fall in timeout. Anyone has more information about the question ?

Fixed on my side using the cloudformation template instead of the manual setup. By the way the issue was coming from the access to the secret manager arn, also using the stack provided by the cloudformation.

@mlh758
Copy link

mlh758 commented Sep 4, 2024

We're seeing a similar issue with the lambda extension. Post runtime duration can run up to the full 30 second timeout we have configured versus ~250ms to send a response. This is the only extension we have on our lambdas. It's not super common, p99 seems to be under a second for post runtime but it's happening often enough to be prominent in our logs.

image

@hghotra
Copy link

hghotra commented Sep 10, 2024

Can everyone try the newest version of the extension and if they're still encountering the same issue, then please open a support request using https://help.datadoghq.com/hc/en-us/requests/new and someone from the engineering team will take a deeper look into it.

@mlh758 @joel-eroad @hrist0stoichev

@ajohnston1219
Copy link

We are also having this issue. We had this issue initially with lambdas that had 128MB memory, and were recommended to increase this to 256MB. This solved the issue for a while, but now we are seeing the same behavior and had to increase the memory to 512MB. This fixed it for a few days, but again we are seeing the timeouts, so if this is memory related, its a memory leak and not just overhead. It is hard to determine what the cause is because, as people here have mentioned already, there are no logs indicating any issue. I have opened an issue via the help.datadoghq.com link shared above, but still wanted to mention here that this seems to be an ongoing issue.

@ajohnston1219
Copy link

After speaking with support, we turned on debug logs via the DD_LOG_LEVEL environment variable. Here are the logs that we received for one of our timeouts:

timestamp,message
1727361705215,"[AWS Parameters and Secrets Lambda Extension] 2024/09/26 14:41:45 PARAMETERS_SECRETS_EXTENSION_LOG_LEVEL is not present. Log level set to info.
"
1727361705216,"[AWS Parameters and Secrets Lambda Extension] 2024/09/26 14:41:45 INFO Systems Manager Parameter Store and Secrets Manager Lambda Extension 1.0.103
"
1727361705216,"[AWS Parameters and Secrets Lambda Extension] 2024/09/26 14:41:45 INFO Serving on port 2773
"
1727361705859,"2024-09-26 14:41:45 UTC | DD_EXTENSION | DEBUG | Datadog extension version : 57|Datadog environment variables: DD_API_KEY_SECRET_ARN=***|DD_CAPTURE_LAMBDA_PAYLOAD=false|DD_COLD_START_TRACING=false|DD_ENV=prod|DD_FLUSH_TO_LOG=false|DD_LAMBDA_HANDLER=index.handler|DD_LOGS_INJECTION=false|DD_LOG_LEVEL=debug|DD_MERGE_XRAY_TRACES=false|DD_PROFILING_ENABLED=false|DD_REMOTE_CONFIGURATION_ENABLED=false|DD_SERVERLESS_APPSEC_ENABLED=false|DD_SERVERLESS_LOGS_ENABLED=true|DD_SITE=datadoghq.com|DD_TAGS=git.commit.sha:""***********************************"",git.repository_url:********|DD_TRACE_ENABLED=false|
"
1727361705875,"2024-09-26 14:41:45 UTC | DD_EXTENSION | DEBUG | No config file detected, using environment variable based configuration only
"
1727361705875,"2024-09-26 14:41:45 UTC | DD_EXTENSION | DEBUG | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list
"
1727361705875,"2024-09-26 14:41:45 UTC | DD_EXTENSION | DEBUG | FIPS mode is disabled
"
1727361705875,"2024-09-26 14:41:45 UTC | DD_EXTENSION | INFO | 0 Features detected from environment:
"
1727361705875,"2024-09-26 14:41:45 UTC | DD_EXTENSION | DEBUG | Retrieving ALADDIN_FIREBASE_SECRET_ARN=************* from secrets manager
"
1727361705875,"2024-09-26 14:41:45 UTC | DD_EXTENSION | DEBUG | Found ************* value, trying to use it.
"
1727361706144,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Retrieving DD_API_KEY_SECRET_ARN=************ from secrets manager
"
1727361706144,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Found *********** value, trying to use it.
"
1727361706183,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Retrieving OHD_FIREBASE_SECRET_ARN=************ from secrets manager
"
1727361706183,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Found *************** value, trying to use it.
"
1727361706221,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Starting daemon to receive messages from runtime...
"
1727361706239,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Using a SyncForwarder with a 5s timeout
"
1727361706279,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Retry queue storage on disk is disabled
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Creating TimeSampler #0
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.aggregator_channel_latency_buckets' is empty, falling back to default values
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.listeners_latency_buckets' is empty, falling back to default values
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.listeners_channel_latency_buckets' is empty, falling back to default values
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | dogstatsd-udp: 127.0.0.1:8125 successfully initialized
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Dogstatsd workers and pipelines count:  2  workers,  1  pipelines
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Dogstatsd configured to run with 2 workers and 1 pipelines
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | DogStatsD will run 2 workers
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Dogstatsd workers and pipelines count:  2  workers,  1  pipelines
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Dogstatsd configured to run with 2 workers and 1 pipelines
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Unable to initialize cgroup provider (cgroups not mounted?), err: unable to detect cgroup version from detected mount points: map[]
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Metrics collector: system went into PermaFail, removed from candidates
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Dogstatsd workers and pipelines count:  2  workers,  1  pipelines
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Dogstatsd configured to run with 2 workers and 1 pipelines
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | appsec: security monitoring is not enabled: DD_SERVERLESS_APPSEC_ENABLED is not set to true
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Forwarder started
"
1727361706335,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Demultiplexer started
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | dogstatsd-udp: starting to listen on 127.0.0.1:8125
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | No config file detected, using environment variable based configuration only
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | FIPS mode is disabled
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | 0 Features detected from environment:
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Loaded configuration: /var/task/datadog.yaml
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | otlp endpoint disabled
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Enabling telemetry collection HTTP route
"
1727361706336,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Subscribing to Telemetry for types: [platform function extension]
"
1727361706375,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Setting DefaultEnv to ""prod"" (from 'env' config option)
"
1727361706436,"TELEMETRY	Name: datadog-agent	State: Already subscribed	Types: [Platform, Function, Extension]
"
1727361706457,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | Starting logs-agent...
"
1727361706458,"2024-09-26 14:41:46 UTC | DD_EXTENSION | INFO | logs-agent started
"
1727361706458,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Adding AWS Logs Log Source
"
1727361706458,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Starting ColdStartSpanCreator
"
1727361706458,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | Setting ColdStartSpanCreator on Daemon
"
1727361706458,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | could not list environment variable for proc id %d 1
"
1727361706459,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | finding the lambda runtime took 527.91µs. found runtime: nodejs20.x
"
1727361706495,"2024-09-26 14:41:46 UTC | DD_EXTENSION | DEBUG | serverless agent ready in 601.149168ms
"
1727361708295,"2024-09-26 14:41:48 UTC | DD_EXTENSION | INFO | Container metrics provider discovery process finished
"
1727361709520,"EXTENSION	Name: datadog-agent	State: Ready	Events: [INVOKE, SHUTDOWN]
"
1727361709520,"EXTENSION	Name: AWSParametersAndSecretsLambdaExtension	State: Ready	Events: [SHUTDOWN, INVOKE]
"
1727361709520,"START RequestId: c398b1c8-ce7f-4a67-8155-8092b2f3e125 Version: $LATEST
"
1727361709535,"[AWS Parameters and Secrets Lambda Extension] 2024/09/26 14:41:49 INFO ready to serve traffic
"
1727361709535,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Received invocation event...
"
1727361709535,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Starting Log Collection with ARN: ********************** and RequestId:
"
1727361709535,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | The flush strategy end has decided to not flush at moment: starting
"
1727361709537,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Enhanced metrics: {durationMs:5000 billedDurationMs:5000 memorySizeMB:512 maxMemoryUsedMB:247 initDurationMs:0 initDurationTelemetry:0 initStartTime:{wall:0 ext:0 loc:<nil>}}
"
1727361709538,"2024-09-26T14:41:49.538Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Didn't patch console output with trace context""}
"
1727361709538,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | InitReport done metrics: {durationMs:0 billedDurationMs:0 memorySizeMB:0 maxMemoryUsedMB:0 initDurationMs:0 initDurationTelemetry:4739.407 initStartTime:{wall:0 ext:0 loc:<nil>}}
"
1727361709538,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | [ColdStartCreator] No init duration, passing
"
1727361709538,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | [ColdStartCreator] No lambda span, passing
"
1727361709556,"2024-09-26T14:41:49.556Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Patching HTTP libraries""}
"
1727361709557,"2024-09-26T14:41:49.557Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:event.Execution is not an object.""}
"
1727361709558,"2024-09-26T14:41:49.558Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Reading Xray trace context from env var Root=1-66f572a8-78522a671aafb6075eaae733;Parent=3c8322a1922bc13f;Sampled=0;Lineage=1:c0d1eef0:0""}
"
1727361709559,"2024-09-26T14:41:49.559Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Extracted trace context from xray context"",""traceContext"":{""traceId"":""1922955708679776051"",""parentId"":""4360366941562192191"",""sampleMode"":-1,""source"":""xray""}}
"
1727361709559,"2024-09-26T14:41:49.559Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Didn't attempt to find parent for aws.lambda span"",""mergeDatadogXrayTraces"":false,""traceSource"":""xray""}
"
1727361709561,"2024-09-26T14:41:49.561Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Extension present: true""}
"
1727361709561,"2024-09-26T14:41:49.561Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Using StatsD client""}
"
1727361709575,"2024-09-26T14:41:49.575Z	c398b1c8-ce7f-4a67-8155-8092b2f3e125	DEBUG	{""status"":""debug"",""message"":""datadog:Creating the aws.lambda span""}
"
1727361709578,*****REDACTED******
"
1727361709762,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Timeout detected, finishing the current invocation now to allow receiving the SHUTDOWN event
"
1727361709795,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | LogMessage.UnmarshalJSON: no spans object received
"
1727361709795,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Runtime done metrics: {responseLatency:0 responseDuration:0 producedBytes:0}
"
1727361709796,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Received a runtimeDone log message for the current invocation c398b1c8-ce7f-4a67-8155-8092b2f3e125
"
1727361709796,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | The flush strategy end has decided to flush at moment: stopping
"
1727361709800,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Beginning metrics flush at time 1727361709
"
1727361709835,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Received a Flush trigger
"
1727361709837,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Beginning traces flush at time 1727361709
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Finished traces flush that was started at time 1727361709
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Beginning logs flush at time 1727361709
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | INFO | Triggering a flush in the logs-agent
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Flush in the logs-agent done.
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Finished logs flush that was started at time 1727361709
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Demultiplexer: sendIterableSeries: start sending iterable series to the serializer
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | The payload was not too big, returning the full payload
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | The payload was not too big, returning the full payload
"
1727361709855,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Received shutdown event. Reason: timeout
"
1727361709876,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Send messages for pipeline logs (msg_count:124, content_size=129471, avg_msg_size=1044.12)
"
1727361709949,"2024-09-26 14:41:49 UTC | DD_EXTENSION | INFO | Successfully posted payload to ""https://7-53-0-app.agent.datadoghq.com/api/beta/sketches"" (202 Accepted), the agent will only log transaction success every 500 transactions
"
1727361709949,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | SyncForwarder has flushed 1 transactions
"
1727361709958,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | SyncForwarder has flushed 1 transactions
"
1727361709958,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Demultiplexer: sendIterableSeries: stop routine
"
1727361709959,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Finished metrics flush that was started at time 1727361709
"
1727361709959,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Finished flushing
"
1727361709959,"2024-09-26 14:41:49 UTC | DD_EXTENSION | DEBUG | Waiting to shut down HTTP server
"
1727361710159,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Shutting down HTTP server
"
1727361710159,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Beginning metrics flush at time 1727361710
"
1727361710159,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Received a Flush trigger
"
1727361710160,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Beginning traces flush at time 1727361710
"
1727361710160,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Finished traces flush that was started at time 1727361710
"
1727361710160,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Sending sketches payload : *******REDACTED*******
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | The payload was not too big, returning the full payload
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Demultiplexer: sendIterableSeries: start sending iterable series to the serializer
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | The payload was not too big, returning the full payload
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Beginning logs flush at time 1727361710
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | INFO | Triggering a flush in the logs-agent
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Flush in the logs-agent done.
"
1727361710161,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Finished logs flush that was started at time 1727361710
"
1727361710162,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Send messages for pipeline logs (msg_count:37, content_size=44182, avg_msg_size=1194.11)
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | SyncForwarder has flushed 1 transactions
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Demultiplexer: sendIterableSeries: stop routine
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | SyncForwarder has flushed 1 transactions
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Finished metrics flush that was started at time 1727361710
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Finished flushing
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Shutting down agents
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | [ColdStartCreator] - sending shutdown msg
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | INFO | Stopping logs-agent
"
1727361710179,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | [ColdStartCreator] - shutting down
"
1727361710227,"2024-09-26 14:41:50 UTC | DD_EXTENSION | INFO | logs-agent stopped
"
1727361710227,"2024-09-26 14:41:50 UTC | DD_EXTENSION | DEBUG | Serverless agent shutdown complete
"
1727361710245,"END RequestId: c398b1c8-ce7f-4a67-8155-8092b2f3e125
"
1727361710245,"REPORT RequestId: c398b1c8-ce7f-4a67-8155-8092b2f3e125	Duration: 5000.00 ms	Billed Duration: 5000 ms	Memory Size: 512 MB	Max Memory Used: 235 MB	Status: timeout
"

As you can see, the datadog layer took ~4.6s to initialize, leaving less than 500ms for the actual lambda execution before timeout was reached. This also has a large cost impact on us, since that 4.6s is included in the billed duration. Before adding the datadog layer we were able to run this lambda at 128MB memory, but now it requires 512MB memory to avoid timing out so often that it degrades our service (currently timeouts are < 0.1% of requests with 512MB, but still an issue for us).

@swcloudgenie
Copy link

Here is a CDK stack that can be used to reproduce the issue: https://github.com/Genie-Garage/datadog-timeout.git

This will have to be run for ~1 hour before seeing timeouts, but we have reproduced this reliably several times now. Is there anyone from the Datadog side that could look into reproducing this?

NOTE: We have reproduced this issue with the "next" beta version as well as the most recent stable version of the extension.

@duncanista
Copy link
Contributor

Hey @swcloudgenie, thanks for sending us this example!

I'll take a look at it as soon as possible!

Really appreciate the effort here, will updating as soon as I can

@duncanista duncanista self-assigned this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants