Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Nucleus): component is not gracefully terminated when a new version is deployed #1667

Open
timvlaer opened this issue Nov 14, 2024 · 6 comments
Labels
bug Something isn't working needs-triage Needs eyeballs

Comments

@timvlaer
Copy link

timvlaer commented Nov 14, 2024

Describe the bug
When I deploy a new version of my component, the currently running version of my component is not nicely shutdown but immediately killed.

To Reproduce

  1. Make a component with a signal handler (make sure the component doesn't spawn any child processes)
  2. Deploy the component
  3. Update the component version and see which signals comes out. Check the time between SIGTERM and SIGKILL.

Expected behavior
I expected to get a SIGTERM signal first and then after a while a hard SIGKILL.

Actual behavior
The application is immediately killed. The logs say they send a sigterm (force=false) and then a sigkill (force=true) but I don't see the SIGTERM. It doesn't look like a get a SIGTERM or I don't have the time to react to it.

Environment

  • OS: Poky (Yocto Project Reference Distro) 3.1.33 (dunfell), Linux 5.5.3
  • JDK version:
openjdk 11.0.23 2024-04-16 LTS
OpenJDK Runtime Environment Corretto-11.0.23.9.1 (build 11.0.23+9-LTS)
OpenJDK Server VM Corretto-11.0.23.9.1 (build 11.0.23+9-LTS, mixed mode)
  • Nucleus version: 2.11.3

Additional context
In the debug logs, a couple of things seems weird to me:

  • the time between force is false and force is true is a couple of milliseconds. The code suggests it should be 5 seconds.
  • The process listed shows two commands (bluetoothctl power on\npython3 -u) but if I do ps -ef on the machine, it only shows a process python3.... I guess the string logged is taken from the config and not the real situation.
  • The logs say "Found children of 4479. []" but the list is empty. (This might be the reason there's no SIGTERM. If there are no children, I expect the parent process to get a SIGTERM.)
2024-11-14T11:31:28.998Z [INFO] (Serialized listener processor) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-config-change. Requesting restart for component. {configNode=services.com.bf.ddd.Ble.lifecycle.Run.Setenv.PYTHONPATH, serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:28.999Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.001Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutting down process ["bluetoothctl power on\npython3 -u  /data/greengrass/v2/packages/artifacts-unarc..."]. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.012Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.014Z [INFO] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 4479, force is false. {}
2024-11-14T11:31:29.017Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.126Z [INFO] (Serialized listener processor) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-config-change. Requesting reinstallation for component. {configNode=services.com.bf.ddd.Ble.version, serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.143Z [DEBUG] (pool-2-thread-24) com.aws.greengrass.deployment.activator.DeploymentActivator: merge-config. Applied new service config. Waiting for services to complete update. {serviceToTrack=[services.aws.greengrass.ShadowManager, services.aws.greengrass.DiskSpooler, services.aws.greengrass.Nucleus:FINISHED, services.aws.greengrass.LogManager, services.com.bf.ddd.Ble:STOPPING], mergeTime=1731583888628}
2024-11-14T11:31:29.410Z [DEBUG] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Found children of 4479. []. {}
2024-11-14T11:31:29.413Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.414Z [INFO] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 4479, force is true. {}
2024-11-14T11:31:29.416Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.766Z [DEBUG] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Found children of 4479. []. {}
2024-11-14T11:31:29.773Z [DEBUG] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Killing pid 4479 with signal 9 using kill -9 4479. {}
2024-11-14T11:31:29.782Z [DEBUG] (AwsEventLoop 1) software.amazon.awssdk.eventstreamrpc.OperationContinuationHandler: aws.greengrass#SubscribeToIoTCore stream continuation closed.. {}
2024-11-14T11:31:29.789Z [DEBUG] (AwsEventLoop 1) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Unsubscribing from topic. {clientId=Dock-d49cdd487732, topic=bf/Dock-d49cdd487732/measurements/openmhealth/+/accepted}
2024-11-14T11:31:29.793Z [DEBUG] (AwsEventLoop 1) software.amazon.awssdk.eventstreamrpc.OperationContinuationHandler: aws.greengrass#SubscribeToTopic stream continuation closed.. {}
2024-11-14T11:31:29.795Z [DEBUG] (AwsEventLoop 1) com.aws.greengrass.builtin.services.pubsub.PubSubIPCEventStreamAgent: Unsubscribed from topic $aws/things/Dock-d49cdd487732/shadow/name/linked-devices/update/accepted. {componentName=com.bf.ddd.Ble}
2024-11-14T11:31:29.798Z [INFO] (AwsEventLoop 1) software.amazon.awssdk.eventstreamrpc.RpcServer: Server connection closed code [socket is closed.]: [Id 37, Class ServerConnection, Refs 1](2024-11-14T11:25:49.483789Z) - <null>. {}
2024-11-14T11:31:29.842Z [WARN] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: kill exited non-zero (process not found or other error). {stdout=, pid=4479, exit-code=1, stderr=kill: (4479): No such process}
2024-11-14T11:31:29.847Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown completed for process ["bluetoothctl power on\npython3 -u  /data/greengrass/v2/packages/artifacts-unarc..."]. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.849Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.853Z [INFO] (Copier) com.aws.greengrass.lifecyclemanager.GenericExternalService: Run script exited. {exitCode=137, serviceName=com.bf.ddd.Ble, currentState=STOPPING}
@timvlaer timvlaer added bug Something isn't working needs-triage Needs eyeballs labels Nov 14, 2024
@timvlaer
Copy link
Author

When I restart greengrass via systemctl (systemctl restart greengrass), the system works as expected. My component gets a SIGTERM and properly terminates (exit code 0).

(I get exit code 137 when the component is killed.)

@timvlaer
Copy link
Author

Right now, I cannot quickly bump the aws.greengrass.Nucleus component to the latest 2.13.0 because it's incompatible with aws.greengrass.ShadowManager 2.3.6 which I also use.
I'd like to avoid bumping all my dependencies without good reason, so let me know if you think it's worth the effort.

@timvlaer timvlaer changed the title (Nucleus): service is not properly terminated when a new version is deployed (Nucleus): component is not gracefully terminated when a new version is deployed Nov 14, 2024
@timvlaer
Copy link
Author

timvlaer commented Nov 15, 2024

As a workaround, I added a shutdown lifecycle script to the component's recipe.

Manifests:
  - Platform:
      os: all
    Artifacts:  
      [...]
    Lifecycle:
      Run:
        [...]
      Shutdown:
        # I added an explicit shutdown script because Greengrass' Nucleus doesn't 
        #   gracefully exit a component when a new version is deployed. 
        # This script acts as a workaround.
        # The script looks for the process id of the python application and 
        #   sends a SIGTERM to it if it's found.
        RequiresPrivilege: true
        Script: |
          pkill --signal TERM --full 'ble/main.py' || true

Heads up, you probably have to change the following to the script:

  • search string after --full, matching the component's run script, run ps aux on your greengrass installation to validate.
  • I added || true to suppress exit code 1 in case the process cannot be found. I don't want Greengrass to act differently and log errors if the component is already killed.

@aws-kevinrickard
Copy link
Member

Thank you for the detailed report. We will look into this and get back to you about the fix.

About updating versions: Our plugin(type) components (such as ShadowManager) are meant to be used with the minor version of nucleus which they have been tested with. So to bump Nucleus up to 2.13.x, you should also bump all your plugin components at the same time. This would be 2.3.9 for ShadowManager.

@timvlaer
Copy link
Author

@aws-kevinrickard thanks for the explanation of the update strategy. Let me know if you think updating the version would solve the issue or would give you more insights. I didn't see any changes in the relevant code over releases (See #1668), but I might be wrong.

My setup involves a couple of different devices and testing is a little cumbersome, I'd prefer to bump the version only when it gives value. I hope that makes sense.

@aws-kevinrickard
Copy link
Member

@timvlaer We don't have reason to believe that upgrading the version to 2.13 would solve it.

Thanks also for the draft PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Needs eyeballs
Projects
None yet
Development

No branches or pull requests

2 participants