-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod-network-loss: cleanup fails because target pod has been restarted #591
Comments
I investigated the situation a bit further and think other experiments such as the pod-network-latency are affected by the same problem:
A possible solution would be to replace the
With this approach a container restart doesn't affect the cleanup. Additionally before the cleanup there can also be a check to verify if the network namespace still exists:
If not it means the pod has been recreated in the meantime and no cleanup is necessary anymore. |
I did some prototyping and found that it is a bit more complicated than I originally thought. The
When running in a privileged container however, it cannot resolve the "name" of the namespace:
While it would be possible to retrieve the value via container runtime, similar to the container PID...
obviously the
|
Thanks for raising this issue! We will add this feature in the subsequent releases. Thanks again for being so patient! |
Hello, The root cause seem the same for us, the pod getting restarted due to a liveness probe and the helper doesn't revert the chaos because the container ID / process changed and so the chaos never get reverted. Since the container/process changed, if we include as part of the revert process to re-fetching the container ID and the process before actual clean-up be a viable solution ? |
I think the solution I proposed still has some gap since the pod could still be restarted between the time we are fetching the container ID/process and do the actual clean-up. |
BUG REPORT
What happened:
I ran the chaos experiment 'pod-network-loss' against a pod which was able to successfully inject the Linux traffic control rule to block the traffic. This resulted in the pod failing the network-based liveness probe which caused Kubernetes to kill and restart the pod. Once the experiment ended, the traffic control rule was attempted to be reverted but this failed because the original process in the pod was not running anymore. Obviously this resulted in a failed helper pod and an application pod that was stuck in the state CrashLoopBackOff because the network-based readiness probe could not succeed because of the blocked traffic.
What you expected to happen:
Once the experiment completes the traffic control rule is successfully removed so that the application pod is able to properly function again.
How to reproduce it (as minimally and precisely as possible):
TOTAL_CHAOS_DURATION
so that the liveness probe reaches the failure threshold and the pod is killedAnything else we need to know?:
The text was updated successfully, but these errors were encountered: