@smarterclayton
March 2016
Within a pod there is a need to initialize local data or adapt to the current cluster environment that is not easily achieved in the current container model. Containers start in parallel after volumes are mounted, leaving no opportunity for coordination between containers without specialization of the image. If two containers need to share common initialization data, both images must be altered to cooperate using filesystem or network semantics, which introduces coupling between images. Likewise, if an image requires configuration in order to start and that configuration is environment dependent, the image must be altered to add the necessary templating or retrieval.
This proposal introduces the concept of an init container, one or more containers started in sequence before the pod's normal containers are started. These init containers may share volumes, perform network operations, and perform computation prior to the start of the remaining containers. They may also, by virtue of their sequencing, block or delay the startup of application containers until some precondition is met. In this document we refer to the existing pod containers as app containers.
This proposal also provides a high level design of volume containers, which initialize a particular volume, as a feature that specializes some of the tasks defined for init containers. The init container design anticipates the existence of volume containers and highlights where they will take future work
- Init containers should be able to:
- Perform initialization of shared volumes
- Download binaries that will be used in app containers as execution targets
- Inject configuration or extension capability to generic images at startup
- Perform complex templating of information available in the local environment
- Initialize a database by starting a temporary execution process and applying schema info.
- Delay the startup of application containers until preconditions are met
- Register the pod with other components of the system
- Perform initialization of shared volumes
- Reduce coupling:
- Between application images, eliminating the need to customize those images for Kubernetes generally or specific roles
- Inside of images, by specializing which containers perform which tasks (install git into init container, use filesystem contents in web container)
- Between initialization steps, by supporting multiple sequential init containers
- Init containers allow simple start preconditions to be implemented that are
decoupled from application code
- The order init containers start should be predictable and allow users to easily reason about the startup of a container
- Complex ordering and failure will not be supported - all complex workflows can if necessary be implemented inside of a single init container, and this proposal aims to enable that ordering without adding undue complexity to the system. Pods in general are not intended to support DAG workflows.
- Both run-once and run-forever pods should be able to use init containers
- As much as possible, an init container should behave like an app container to reduce complexity for end users, for clients, and for divergent use cases. An init container is a container with the minimum alterations to accomplish its goal.
- Volume containers should be able to:
- Perform initialization of a single volume
- Start in parallel
- Perform computation to initialize a volume, and delay start until that volume is initialized successfully.
- Using a volume container that does not populate a volume to delay pod start (in the absence of init containers) would be an abuse of the goal of volume containers.
- Container pre-start hooks are not sufficient for all initialization cases:
- They cannot easily coordinate complex conditions across containers
- They can only function with code in the image or code in a shared volume, which would have to be statically linked (not a common pattern in wide use)
- They cannot be implemented with the current Docker implementation - see #140
- Any mechanism that runs user code on a node before regular pod containers should itself be a container and modeled as such - we explicitly reject creating new mechanisms for running user processes.
- The container pre-start hook (not yet implemented) requires execution within the container's image and so cannot adapt existing images. It also cannot block startup of containers
- Running a "pre-pod" would defeat the purpose of the pod being an atomic unit of scheduling.
Each pod may have 0..N init containers defined along with the existing 1..M app containers.
On startup of the pod, after the network and volumes are initialized, the init containers are started in order. Each container must exit successfully before the next is invoked. If a container fails to start (due to the runtime) or exits with failure, it is retried according to the pod RestartPolicy. RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways pods will retry the failing init container with increasing backoff until it succeeds. To align with the design of application containers, init containers will only support "infinite retries" (RestartPolicyAlways) or "no retries" (RestartPolicyNever).
A pod cannot be ready until all init containers have succeeded. The ports
on an init container are not aggregated under a service. A pod that is
being initialized is in the Pending
phase but should have a distinct
condition. Each app container and all future init containers should have
the reason PodInitializing
. The pod should have a condition Initializing
set to false
until all init containers have succeeded, and true
thereafter.
If the pod is restarted, the Initializing
condition should be set to `false.
If the pod is "restarted" all containers stopped and started due to a node restart, change to the pod definition, or admin interaction, all init containers must execute again. Restartable conditions are defined as:
- An init container image is changed
- The pod infrastructure container is restarted (shared namespaces are lost)
- The Kubelet detects that all containers in a pod are terminated AND no record of init container completion is available on disk (due to GC)
Changes to the init container spec are limited to the container image field. Altering the container image field is equivalent to restarting the pod.
Because init containers can be restarted, retried, or reexecuted, container authors should make their init behavior idempotent by handling volumes that are already populated or the possibility that this instance of the pod has already contacted a remote system.
Each init container has all of the fields of an app container. The following fields are prohibited from being used on init containers by validation:
readinessProbe
- init containers must exit for pod startup to continue, are not included in rotation, and so cannot define readiness distinct from completion.
Init container authors may use activeDeadlineSeconds
on the pod and
livenessProbe
on the container to prevent init containers from failing
forever. The active deadline includes init containers.
Because init containers are semantically different in lifecycle from app containers (they are run serially, rather than in parallel), for backwards compatibility and design clarity they will be identified as distinct fields in the API:
pod:
spec:
containers: ...
initContainers:
- name: init-container1
image: ...
...
- name: init-container2
...
status:
containerStatuses: ...
initContainerStatuses:
- name: init-container1
...
- name: init-container2
...
This separation also serves to make the order of container initialization clear - init containers are executed in the order that they appear, then all app containers are started at once.
The name of each app and init container in a pod must be unique - it is a validation error for any container to share a name.
While pod containers are in alpha state, they will be serialized as an annotation
on the pod with the name pod.alpha.kubernetes.io/init-containers
and the status
of the containers will be stored as pod.alpha.kubernetes.io/init-container-statuses
.
Mutation of these annotations is prohibited on existing pods.
Given the ordering and execution for init containers, the following rules for resource usage apply:
- The highest of any particular resource request or limit defined on all init containers is the effective init request/limit
- The pod's effective request/limit for a resource is the higher of:
- sum of all app containers request/limit for a resource
- effective init request/limit for a resource
- Scheduling is done based on effective requests/limits, which means init containers can reserve resources for initialization that are not used during the life of the pod.
- The lowest QoS tier of init containers per resource is the effective init QoS tier, and the highest QoS tier of both init containers and regular containers is the effective pod QoS tier.
So the following pod:
pod:
spec:
initContainers:
- limits:
cpu: 100m
memory: 1GiB
- limits:
cpu: 50m
memory: 2GiB
containers:
- limits:
cpu: 10m
memory: 1100MiB
- limits:
cpu: 10m
memory: 1100MiB
has an effective pod limit of cpu: 100m
, memory: 2200MiB
(highest init
container cpu is larger than sum of all app containers, sum of container
memory is larger than the max of all init containers). The scheduler, node,
and quota must respect the effective pod request/limit.
In the absence of a defined request or limit on a container, the effective request/limit will be applied. For example, the following pod:
pod:
spec:
initContainers:
- limits:
cpu: 100m
memory: 1GiB
containers:
- request:
cpu: 10m
memory: 1100MiB
will have an effective request of 10m / 1100MiB
, and an effective limit
of 100m / 1GiB
, i.e.:
pod:
spec:
initContainers:
- request:
cpu: 10m
memory: 1GiB
- limits:
cpu: 100m
memory: 1100MiB
containers:
- request:
cpu: 10m
memory: 1GiB
- limits:
cpu: 100m
memory: 1100MiB
and thus have the QoS tier Burstable (because request is not equal to limit).
Quota and limits will be applied based on the effective pod request and limit.
Pod level cGroups will be based on the effective pod request and limit, the same as the scheduler.
Container runtimes should treat the set of init and app containers as one large pool. An individual init container execution should be identical to an app container, including all standard container environment setup (network, namespaces, hostnames, DNS, etc).
All app container operations are permitted on init containers. The logs for an init container should be available for the duration of the pod lifetime or until the pod is restarted.
During initialization, app container status should be shown with the reason
PodInitializing if any init containers are present. Each init container
should show appropriate container status, and all init containers that are
waiting for earlier init containers to finish should have the reason
PendingInitialization.
The container runtime should aggressively prune failed init containers. The container runtime should record whether all init containers have succeeded internally, and only invoke new init containers if a pod restart is needed (for Docker, if all containers terminate or if the pod infra container terminates). Init containers should follow backoff rules as necessary. The Kubelet must preserve at least the most recent instance of an init container to serve logs and data for end users and to track failure states. The Kubelet should prefer to garbage collect completed init containers over app containers, as long as the Kubelet is able to track that initialization has been completed. In the future, container state checkpointing in the Kubelet may remove or reduce the need to preserve old init containers.
For the initial implementation, the Kubelet will use the last termination container state of the highest indexed init container to determine whether the pod has completed initialization. During a pod restart, initialization will be restarted from the beginning (all initializers will be rerun).
All APIs that access containers by name should operate on both init and app containers. Because names are unique the addition of the init container should be transparent to use cases.
A client with no knowledge of init containers should see appropriate
container status reason
and message
fields while the pod is in the
Pending
phase, and so be able to communicate that to end users.
-
Wait for a service to be created
pod: spec: initContainers: - name: wait image: centos:centos7 command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"] containers: - name: run image: application-image command: ["/my_application_that_depends_on_myservice"]
-
Register this pod with a remote server
pod: spec: initContainers: - name: register image: centos:centos7 command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"] env: - name: POD_NAME valueFrom: field: metadata.name - name: POD_IP valueFrom: field: status.podIP containers: - name: run image: application-image command: ["/my_application_that_depends_on_myservice"]
-
Wait for an arbitrary period of time
pod: spec: initContainers: - name: wait image: centos:centos7 command: ["/bin/sh", "-c", "sleep 60"] containers: - name: run image: application-image command: ["/static_binary_without_sleep"]
-
Clone a git repository into a volume (can be implemented by volume containers in the future):
pod: spec: initContainers: - name: download image: image-with-git command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"] volumeMounts: - mountPath: /var/lib/data volumeName: git containers: - name: run image: centos:centos7 command: ["/var/lib/data/binary"] volumeMounts: - mountPath: /var/lib/data volumeName: git volumes: - emptyDir: {} name: git
-
Execute a template transformation based on environment (can be implemented by volume containers in the future):
pod: spec: initContainers: - name: copy image: application-image command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"] volumeMounts: - mountPath: /var/lib/data volumeName: data - name: transform image: image-with-jinja command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"] volumeMounts: - mountPath: /var/lib/data volumeName: data containers: - name: run image: application-image command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"] volumeMounts: - mountPath: /var/lib/data volumeName: data volumes: - emptyDir: {} name: data
-
Perform a container build
pod: spec: initContainers: - name: copy image: base-image workingDir: /home/user/source-tree command: ["make"] containers: - name: commit image: image-with-docker command: - /bin/sh - -c - docker commit $(complex_bash_to_get_container_id_of_copy) \ docker push $(commit_id) myrepo:latest volumesMounts: - mountPath: /var/run/docker.sock volumeName: dockersocket
Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not be able to rely on Kubelets implementing init containers. The management of feature skew between master and Kubelet is tracked in issue #4855.
- Unify pod QoS class with init containers
- Implement container / image volumes to make composition of runtime from images efficient