Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS IoT Device Client killing job-enabled processes when STDOUT limit is exceeded #461

Open
cah-rita-kassar opened this issue Jun 25, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@cah-rita-kassar
Copy link

cah-rita-kassar commented Jun 25, 2024

Summary
We have built the AWS IoT device client for armv7 architecture and deployed it on our device. We were also able to connect the device to AWS IoT Core and successfully execute jobs. Our job executions successfully reported both standard output and standard error to AWS IoT Core.

Expectation
Running a job with an output that exceeds the limit of 1024bytes, we were expecting the jobs feature within AWS IoT Device Client to publish the last 1024 characters of standard output as part of the job execution update, as mentioned in AWS documentation.

Actual Result
Instead, once the standard output reached its limit, the processes initiated by AWS jobs were unexpectedly terminated and killed abruptly. Afterwards, the job execution was considered successful and the first 1024 characters logged before termination were reported as standard output

Steps To Reproduce
The AWS IoT Device Client config file looks as follows:

{
  "endpoint": "<endpoint>",
  "cert": "<cert_file_path>/cert.pem.crt",
  "key": "<key_file_path>/private.pem.key",
  "root-ca": "",
  "thing-name": "<thing_name>",
  "logging": {
    "level": "DEBUG",
    "destination": "stdout"
  },
  "jobs": {
    "enabled": true,
    "handler-directory": ""
  },
  "tunneling": {
    "enabled": false
  },
  "device-defender": {
    "enabled": false,
    "interval": 300
  },
  "fleet-provisioning": {
    "enabled": false,
    "template-parameters": "",
    "template-name": "",
    "csr-file": "",
    "device-key": ""
  },
  "samples": {
    "pub-sub": {
      "enabled": false,
      "publish-topic": "",
      "publish-file": "",
      "subscribe-topic": "",
      "subscribe-file": ""
    }
  },
  "config-shadow": {
    "enabled": false
  },
  "sample-shadow": {
    "enabled": false,
    "shadow-name": "",
    "shadow-input-file": "",
    "shadow-output-file": ""
  },
  "secure-element": {
    "enabled": false,
    "pkcs11-lib": "",
    "secure-element-pin": "",
    "secure-element-key-label": "",
    "secure-element-slot-id": "",
    "secure-element-token-label": ""
  }
}

The job document is configured to run a script "run.sh" as such:

bash run.sh --script echo_stdout_limit.sh

And our "echo_stdout_limit.sh" contains:

for i in {0..33000}
do
    echo "111111111111111111111111111111111"
done

echo "After exceeding stdout limit"

The AWS IoT Device Client logs came as follows:
image

And the job execution results on the AWS IoT Console were as follows:
image

@cah-rita-kassar cah-rita-kassar added the bug Something isn't working label Jun 25, 2024
@ig15
Copy link
Contributor

ig15 commented Sep 19, 2024

Hi @cah-rita-kassar. Thanks for reaching out to us. I tried to reproduce the issue on my machine and got the expected output which displays the last 1024 characters of the standard output and execution failed.
I just updated the echo_stdout_limit.sh to print the value of the counter $i instead of repeated 1s for easier debugging.

The AWS IoT Device Client logs came as follows:
image

And the job execution results on the AWS IoT Console were as follows:

image

Can you share any more details about the issue like the job document schema or file you are using and the system information if the issue still persists

@cah-rita-kassar
Copy link
Author

Hello @ig15, thank you for your reply

Our main concern was actually the job being killed before executing entirely. In this case, it is being killed after the 995th iteration (since it reached the stdout limit) which indicates that echo_stdout_limit.sh script did not finish executing.

Our expectations
When a job's execution reaches the stdout limit, we were expecting the job to execute successfully, but the AWS IoT Device Client to only publish the last 1024 characters of the stdout to the AWS Console

Actual Behavior
Instead, the job will actually be killed whenever the stdout limit is reached, and the AWS IoT Device Client publishes the last 1024 characters of the stdout which contains the output of the script before being killed

So if you look back at our echo_stdout_limit.sh script, we have an echo line after the loop to validate whether the script executes entirely or partially. When the script executes entirely, or iterates through the entire loop, we expect to get in the stdout the following echo line "After exceeding stdout limit". Since this line is not being echoed, we can clearly understand that the script is not executing entirely; instead, whenever the stdout limit gets exceeded, the running job will actually be killed instantaneously.

@HarshGandhi-AWS
Copy link
Contributor

Our main concern was actually the job being killed before executing entirely.

It is expected to job to be killed. If you check the code over here, we limit any job from printing more than a specific number of log lines to avoid starvation of other pending jobs.

When a job's execution reaches the stdout limit, we were expecting the job to execute successfully

If I see your initial screenshot of logs, I only see DEBUG messages and no ERROR messages were printed. Also it says Job excited with status: 0 and Job executed successfully which means device client might have stopped printing the echo messages after SDTOUT limit was reached but it still marked the job as to be completed successfully and sent back the job execution status of 1024 char as expected.

I think the job execution was completed as expected by device client. I do not understand what is the issue over here. If you wish the device client to not stop printing log lines after reaching the limit then you will have to update the code to either not have any limit or increase the limit as per your use case. You will have to make this change on your side since we won't be able to remove this limit for all of our customers.

Also, can you tell us more about your use-case and why you wish to print these many log messages?

@cah-rita-kassar
Copy link
Author

Our concern is that the job indeed exited with status 0 and was "considered" successful, but it did not actually execute completely.
Here's a better example to explain our issue:
we changed our script to do the following:

for i in {0..35000}
do
  echo "$i"
done

echo "After exceeding stdout limit"
touch /tmp/file_after_echo.txt

Expected behavior

So we are now expecting the script to touch a file "file_after_echo.txt" at the end of its execution

Actual result

AWS IoT Device Client logs:
image

Job execution results on the AWS IoT Console were as follows:
image
image

Also, the file that is supposed to be created and the end of the script was not under /tmp directory:
Running the following command
ls -l /tmp | grep "file_after_echo.txt"
returned nothing, which means that the script was considered successful and exited with status 0 but did not finish its entire execution

@HarshGandhi-AWS
Copy link
Contributor

It is not printing because we have a limit of 1000 log lines printed by Device Client per child process.

/**
* \brief The maximum number of lines that we'll read from STDOUT or STDERR of the child process
* before stopping. This prevents against log corruption in the event that the specified
* job generates a large volume of output
*/
static constexpr size_t MAX_LOG_LINES = 1000;

@HarshGandhi-AWS
Copy link
Contributor

Try updating your script to only loop for 800 times. It should work as expected. Again, can you tell us more about your use-case and why you wish to print these many log messages?

@cah-rita-kassar
Copy link
Author

Yes, I understand that the Client stdout is limited to a 1000 log lines
We don't actually want to print that many log messages, but we just want to validate that our jobs would execute completely, even though, for whatever reason, the job output exceeded the stdout limit.

In the previous example mentioned above, we have no problem regarding the log lines being limited, but we do have a concern regarding the last line of our job (touching the file under /tmp directory) not being executed, due to the fact that the stdout limit was reached

image

We were expecting the file to be touched, eventhough the stdout limit got exceeded. In other words, we were expecting the job to execute completely, even though it will only return the last 1024 chars due to the fact that the stdout limit was exceeded.
The actual behavior turned out to be that the job was killed after reaching the stdout limit. The last 1024 characters were indeed reported to the console but the job was considered successful whereas it did not finish its execution.

@HarshGandhi-AWS
Copy link
Contributor

We don't actually want to print that many log messages, but we just want to validate that our jobs would execute completely

To do so, reduce the number of line you input and the job will succeed.

even though, for whatever reason, the job output exceeded the stdout limit

Device client is coded to stop the log line print task once the limit is exceeded. To avoid this from happening you will either have to
(A) Create a script which does not print more than 1000 line of code
OR
(B) Update the code over here to not stop even the limit of 1000 line is breached.

We won't be able to make this change since for the rest of the customer's we wish to have the same functionality of stopping the Job Execution successfully after the log line limit is breached.

We were expecting the file to be touched, eventhough the stdout limit got exceeded. In other words, we were expecting the job to execute completely, even though it will only return the last 1024 chars due to the fact that the stdout limit was exceeded.

I am not sure if this is possible. The job execution is being executed on a child process. We cannot stop the child process from printing more logs without terminating the child process. Please let me know if you think it is possible and I would encourage you to push a PR for the same.

@cah-rita-kassar
Copy link
Author

Thank you @HarshGandhi-AWS for your time
We just wanted to validate that the current behavior does kill the job when the stdout limit is exceeded. We wanted to make sure it was not a problem from our side, since we were expecting the job to finish its execution but only report the last 1024 characters to the Console.

@cah-michel-bitar
Copy link

@HarshGandhi-AWS it's just that based on the documentation, 'The Jobs feature within the AWS IoT Device Client automatically publishes the last 1024 characters', we expected the client to publish the last 1024 characters but let the job finish executing it.

From a technical perspective I thought all stdout would be stored in memory, and finally only the last 1024 lines would be logged and published as the job result.

That being said, I understand now that this is meant to work that way and that there are no plans on changing this right?
Also, any specific reason for the specific number 1024? Is it just to prevent cost on IOT Core?

@HarshGandhi-AWS
Copy link
Contributor

@cah-michel-bitar

it's just that based on the documentation, 'The Jobs feature within the AWS IoT Device Client automatically publishes the last 1024 characters', we expected the client to publish the last 1024 characters but let the job finish executing it.

Let me check the documentation and check with my team if we want to change the behavior. Give us some time to get back to you on this.

From a technical perspective I thought all stdout would be stored in memory, and finally only the last 1024 lines would be logged and published as the job result.

It would not be wise to store it in memory since many IoT devices have memory constraints. It might cause issues for smaller devices and might also affect the performance.

That being said, I understand now that this is meant to work that way and that there are no plans on changing this right?

I will have to check with team if we want to change this behavior or not. Let me get back to you on that.

Also, any specific reason for the specific number 1024? Is it just to prevent cost on IOT Core?

This limit is from IoT Jobs service side. We cannot update that via Device Client. I suspect the service side limit is because of the MQTT message size limits and the update job execution API call request should be sent in 1 MQTT message. This is my assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants