Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-49804][K8S] Fix to use the exit code of executor container always
### What changes were proposed in this pull request? When deploying Spark pods on Kubernetes with sidecars, the reported executor's exit code may be incorrect. For example, the reported executor's exit code is 0(success), but the actual is 52 (OOM). ``` 2024-09-25 02:35:29,383 ERROR TaskSchedulerImpl.logExecutorLoss - Lost executor 1 on XXXXX: The executor with id 1 exited with exit code 0(success). The API gave the following container statuses: container name: fluentd container image: docker-images-release.XXXXX.com/XXXXX/fluentd:XXXXX container state: terminated container started at: 2024-09-25T02:32:17Z container finished at: 2024-09-25T02:34:52Z exit code: 0 termination reason: Completed container name: istio-proxy container image: docker-images-release.XXXXX.com/XXXXX-istio/proxyv2:XXXXX container state: running container started at: 2024-09-25T02:32:16Z container name: spark-kubernetes-executor container image: docker-dev-artifactory.XXXXX.com/XXXXX/spark-XXXXX:XXXXX container state: terminated container started at: 2024-09-25T02:32:17Z container finished at: 2024-09-25T02:35:28Z exit code: 52 termination reason: Error ``` The `ExecutorPodsLifecycleManager.findExitCode()` looks for any terminated container and may choose the sidecar instead of the main executor container. I'm changing it to look for the executor container always. Note, it may happen that the pod fails because of the failure of the sidecar container while executor's container is still running, with my changes the reported exit code will be -1 (`UNKNOWN_EXIT_CODE`). ### Why are the changes needed? To correctly report executor failure reason on UI, in the logs and for the event listeners `SparkListener.onExecutorRemoved()` ### Does this PR introduce _any_ user-facing change? Yes, the executor's exit code is taken from the main container instead of the sidecar. ### How was this patch tested? Added unit test and tested manually on the Kubernetes cluster by simulating different types of executor failure (JVM OOM and container eviction due to disk pressure on the node). ### Was this patch authored or co-authored using generative AI tooling? No Closes #48275 from fe2s/SPARK-49804-fix-exit-code. Lead-authored-by: oleksii.diagiliev <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
- Loading branch information