Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awslambdaric prevents graceful lambda shutdown #105

Open
aclemons opened this issue Jul 18, 2023 · 3 comments
Open

awslambdaric prevents graceful lambda shutdown #105

aclemons opened this issue Jul 18, 2023 · 3 comments

Comments

@aclemons
Copy link

aclemons commented Jul 18, 2023

Overview

I've followed the guide described in graceful-shutdown-with-aws-lambda. Unfortunately, my signal handler is never invoked.

I spent time investigating and it looks like the native code in awslambdaric is not bubbling up the signal preventing the signal handler registered from ever running. I don't have experience with writing native code python extensions, so I don't know what needs to change in runtime_client.cpp so that this works.

The graceful shutdown guide actually has a note at the end of the readme which says:

Please be aware that this feature does not work on Python 3.8 and 3.9 runtimes.

I think this is exactly because of the issue I am describing here (and mentioned here aws-samples/graceful-shutdown-with-aws-lambda#2).

Steps to reproduce.

I took a simple lambda function handler.py:

import signal
from types import FrameType

def shutdown(_signum: int, _frame: FrameType | None) -> None:
    print("Shutdown hook invoked")


signal.signal(signal.SIGTERM, shutdown)


def lambda_handler(event, context):
    print(f"Handled event {event}")

    return "SUCCESS"

and created a lambda in eu-central-1 with this as a zip, adding the layer arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38 and invoked it once. I waited until the instance was spun down (10-15 minutes), but there was no log from the shutdown hook.

I also deployed it as a docker image with this Dockerfile:

FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

COPY handler.py "$LAMBDA_TASK_ROOT/handler.py"

CMD ["handler.lambda_handler"]

I pushed this into ECR and then created a container-based lambda with this image again in eu-central-1 and invoked it once. Again, I waited until the instance was spun down (10-15 minutes), but there was no log from the shutdown hook.

To prove that SIGTERM was being sent to the runtime, I altered the previous image to just be a simple shell script and updated the container-based lambda from the previous step:

startup.sh

#!/usr/bin/env bash

set -e
set -o pipefail

printf 'Starting lambda handler. Running as pid %s\n' "$BASHPID"

shutdown() {
  printf 'Shutting down gracefully\n'
  exit
}
trap 'shutdown' SIGTERM

while true ; do
  HEADERS="$(mktemp)"
  BODY="$(mktemp)"
  curl -f -sS -LD "$HEADERS" -X GET "http://$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/next" -o "$BODY"
  REQUEST_ID="$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)"
  rm -rf "$HEADERS"

  printf 'Processing request %s\n' "$REQUEST_ID"

  curl -f -sS -X POST "http://$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/$REQUEST_ID/response" -d '"SUCCESS"'
  rm -rf "$BODY"
done

with this Dockerfile:

FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

COPY startup.sh "$LAMBDA_TASK_ROOT/startup.sh"

ENTRYPOINT ["/var/runtime/startup.sh"]

I invoked the lambda and got the "SUCCESS" response. I waited for it to spin down and checked the cloudwatch logs again:

LOGS	Name: cloudwatch_lambda_agent	State: Subscribed	Types: [Platform]
Starting lambda handler. Running as pid 12
Name: cloudwatch_lambda_agent	State: Ready	Events: [INVOKE, SHUTDOWN]
START RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7 Version: $LATEST
Processing request 1127e4c8-97e8-4ef9-9c52-e24673eb40a7
{""status"":""OK""}
END RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7
REPORT RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7	Duration: 304.42 ms	Billed Duration: 590 ms	Memory Size: 128 MB	Max Memory Used: 35 MB	Init Duration: 285.31 ms	
Terminated
Shutting down gracefully

Success, so I knew that SIGTERM was being sent. After this, I suspected that it had to be somewhere in the runtime handling for python. While reading the source code for awslambdaric I realised part of it was written in cpp and I wondered if this might be the issue.

To test whether this was the problem, I replaced the runtime_client written in cpp, with one written in python.

runtime_client_py.py

# Reimplementation of runtime_client.cpp in python

import http
import http.client
import os

lambda_runtime_address = os.environ["AWS_LAMBDA_RUNTIME_API"]
user_agent = "awslambdaric"


class RuntimeClient:
    @staticmethod
    def initialize_client(user_agent: str) -> None:
        user_agent = user_agent

    @staticmethod
    def next() -> tuple[bytes, dict]:
        endpoint = "/2018-06-01/runtime/invocation/next"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request("GET", endpoint, headers={"User-Agent": user_agent})
        response = runtime_connection.getresponse()
        response_body = response.read()
        raw_headers = response.headers

        headers = {
            "Lambda-Runtime-Aws-Request-Id": raw_headers.get(
                "Lambda-Runtime-Aws-Request-Id"
            ),
            "Lambda-Runtime-Trace-Id": raw_headers.get("Lambda-Runtime-Trace-Id")
            or None,
            "Lambda-Runtime-Invoked-Function-Arn": raw_headers.get(
                "Lambda-Runtime-Invoked-Function-Arn"
            ),
            "Lambda-Runtime-Deadline-Ms": int(
                raw_headers.get("Lambda-Runtime-Deadline-Ms") or "0"
            ),
            "Lambda-Runtime-Client-Context": raw_headers.get(
                "Lambda-Runtime-Client-Context"
            )
            or None,
            "Lambda-Runtime-Cognito-Identity": raw_headers.get(
                "Lambda-Runtime-Cognito-Identity"
            )
            or None,
            "Content-Type": raw_headers.get("Content-Type") or None,
        }

        return response_body, headers

    @staticmethod
    def post_invocation_result(
        invoke_id: str, result_data: bytes, content_type: str
    ) -> None:
        from awslambdaric.lambda_runtime_client import LambdaRuntimeClientError

        endpoint = f"/2018-06-01/runtime/invocation/{invoke_id}/response"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request(
            "POST",
            endpoint,
            body=result_data,
            headers={"User-Agent": user_agent, "Content-Type": content_type},
        )

        response = runtime_connection.getresponse()

        if response.status != http.HTTPStatus.ACCEPTED:
            raise LambdaRuntimeClientError(endpoint, response.status, None)

    @staticmethod
    def post_error(invoke_id: str, error_response_data: bytes, xray_fault: str) -> None:
        endpoint = f"/2018-06-01/runtime/invocation/{invoke_id}/error"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request(
            "POST",
            endpoint,
            body=error_response_data,
            headers={
                "User-Agent": user_agent,
                "Content-Type": "application/json",
                "Lambda-Runtime-Function-Error-Type": "UnhandledContent",
                "Lambda-Runtime-Function-xray-error-cause": xray_fault,
            },
        )


runtime_client = RuntimeClient()

and I put this in an image with this Dockerfile:

FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

RUN rm "$LAMBDA_RUNTIME_DIR/runtime_client.cpython"*.so && sed -i 's/import runtime_client/from awslambdaric.runtime_client import runtime_client/' "$LAMBDA_RUNTIME_DIR/awslambdaric/lambda_runtime_client.py"
COPY runtime_client_py.py "$LAMBDA_RUNTIME_DIR/awslambdaric/runtime_client.py"
COPY handler.py "$LAMBDA_TASK_ROOT/handler.py"

CMD ["handler.lambda_handler"]

I updated the image used by the lambda again and invoked it and waited for it to spin back down. I checked the cloudwatch logs again:

LOGS	Name: cloudwatch_lambda_agent	State: Subscribed	Types: [Platform]
EXTENSION	Name: cloudwatch_lambda_agent	State: Ready	Events: [INVOKE, SHUTDOWN]
START RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9 Version: $LATEST
Handled event {'test': 'value'}
END RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9
REPORT RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9	Duration: 48.67 ms	Billed Duration: 383 ms	Memory Size: 128 MB	Max Memory Used: 45 MB	Init Duration: 333.96 ms	
Shutdown hook invoked

Success! So indeed it looks like the native code needs some adjustment to handle the signal.

Unfortunately, I can't offer a pull request to fix this since I don't know enough about how this works. Some searching points out that perhaps the code needs to be using PyErr_CheckSignals. awslambdaric is also using aws-lambda-runtime internally, so there might be some interaction there too that needs even more handling.

Anyway, I hope these reproduction steps can help someone with the skills find the correct solution.

Thanks.

@tgsong
Copy link
Contributor

tgsong commented Sep 13, 2023

I just ran into the same problem and found this GitHub issue.

Thanks for all the details @aclemons provide, it helped me to locate the problem. The issue is that, awslambdaric uses aws-lambda-cpp under the hood, which uses libcurl to poll for lambda event, and this is a blocking call.

In more details, what happens is that this line would call into the C extension,

response_body, headers = runtime_client.next()

which then calls this line and it blocks the main thread,

According to the Python signal document, even if SIGTERM is received, the registered signal handler will not be called until the current instruction finishes. In our case, the Python interpreter is waiting for runtime_client.next() to finish. This also explains why the pure Python reimplementation of runtime_client will work.

Fortunately, the solution is simple, we can just move the blocking call into a thread to unblock the main thread. Since the C extension will release the GIL (#33) before polling the lambda API, main thread will be able to call the signal handler now.

with ThreadPoolExecutor() as e:
    fut = e.submit(runtime_client.next)
response_body, headers = fut.result()

@smirnoal
Copy link
Contributor

@tgsong do you mind creating a PR with that change?

tgsong added a commit to tgsong/aws-lambda-python-runtime-interface-client that referenced this issue Sep 14, 2023
runtime_client.next is calling into the C extension which blocks the
main thread. Moving it to a separate thread enables the main thread
to process signal, see this issue for more details: aws#105
@tgsong
Copy link
Contributor

tgsong commented Sep 14, 2023

@smirnoal done, #115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants