There's a known issue where a mismatch between the reported and actual attachment capacity on nodes can result in scheduling errors and stuck workloads. This commonly occurs when volume slots are consumed after the driver starts up, which results in kube-scheduler
assigning stateful pods to nodes lacking the necessary capacity to support attachments. As a consequence, volumes can become stuck in the attaching state until a slot is freed up, leading to prolonged delays in pod startup.
Today, CSI plugins report node attachment capacity only once, at startup, via the NodeGetInfo
RPC. This static reporting fails to reflect any subsequent changes in capacity (which may occur when dynamically allocated ENIs or non-CSI devices consume attachment slots).
While a long-term fix is worked on (see kubernetes/enhancements#4875), you can adopt one or more of the following solutions to mitigate this issue:
- Use Dedicated EBS Instance Types: Gen7 and later EC2 instance types have dedicated EBS volume limits and are not affected by dynamic ENI attachments taking up volume slots.
- Enable VPC CNI's Prefix Delegation Feature: This can reduce the number of ENIs needed in your cluster. See the aws-eks-best-practices/networking docs for recommendations and further instructions.
- Use the
--volume-attach-limit
CLI Option: Configure the driver with this option to explicitly specify the limit for volumes to be reported to Kubernetes. This is useful when you have a known safe limit. - Use the
--reserved-volume-attachments
CLI Option: Configure the driver with this option to reserve a number of slots for non-CSI volumes. These reserved slots will be subtracted from the total slots reported to Kubernetes. - Use Multiple DaemonSets: For clusters that need a mix of the above solutions across different groups of nodes, the Helm chart can construct multiple
DaemonSets
via theadditionalDaemonSets
parameter. See Additional DaemonSets for more information.
After a node is preempted, pods using persistent volumes may experience a 6+ minute restart delay. This occurs when a volume is not cleanly unmounted, as indicated by Node.Status.VolumesInUse
. The Attach/Detach controller in kube-controller-manager
waits for 6 minutes before issuing a force detach. This delay in pod restart can prevent potential data corruption due to unclean mounts.
- The EBS CSI Driver node pod on the node may get terminated before the workload's volumes are unmounted, leading to unmount failures.
- The operating system might shut down before volume unmount is completed, even with remaining time in the
shutdownGracePeriod
.
Warning FailedAttachVolume 6m51s attachdetach-controller Multi-Attach error for volume "pvc-4a86c32c-fbce-11ea-b7d2-0ad358e4e824" Volume is already exclusively attached to one node and can't be attached to another
-
Graceful Termination: Ensure instances are gracefully terminated, which is a best practice in Kubernetes. This allows the CSI driver to clean up volumes before the node is terminated utilizing the preStop lifecycle hook.
-
Configure Kubelet for Graceful Node Shutdown: For unexpected shutdowns, it is highly recommended to configure the Kubelet for graceful node shutdown. Using the standard EKS-optimized AMI, you can configure the kubelet for graceful node shutdown with the following user data script:
#!/bin/bash
echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
systemctl restart systemd-logind
echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
systemctl restart kubelet
- Karpenter Best Practices:
- Upgrade to Karpenter version ≥ v1.0.0, where Karpenter will now wait to terminate nodes until all volumes have been detached from them.
- When using Spot Instances with Karpenter, enable interruption handling to manage involuntary interruptions gracefully. Karpenter supports native interruption handling, which cordons, drains, and terminates nodes ahead of interruption events, maximizing workload cleanup time.
The PreStop hook executes to check for remaining VolumeAttachment
objects when a node is being drained for termination. If the node is not being drained, the check is skipped. The hook uses Kubernetes informers to monitor the deletion of VolumeAttachment
objects, waiting until all attachments associated with the node are removed before proceeding.
The PreStop lifecycle hook is enabled by default in the AWS EBS CSI driver. It will execute when a node is being drained for termination.
Version 1.25 of aws-ebs-csi-driver featured four improvements to better manage the EBS volume lifecycle for large-scale clusters.
At a high-level:
- Batching EC2
DescribeVolumes
API Calls across CSI gRPC calls (#1819)- This greatly decreases the number of EC2
DescribeVolumes
calls made by the driver, which significantly reduces your risk of region-level throttling of the 'Describe*' EC2 API Action
- This greatly decreases the number of EC2
- Increasing the default CSI sidecar
worker-threads
values (to 100 for all sidecars) (#1834)- E.g. the
external-provisioner
can be simultaneously running 100ControllerPublishVolume
operations, theexternal-attacher
now has 100 goroutines for processing VolumeAttachments, etc. - This increases the number of in-flight EBS volume creations / attaches / modifications / deletions managed by the driver (which may increase your risk of region-level 'Mutating action' request throttling by the Amazon EC2 API )
- Note: If you are running multiple clusters within a single AWS account and region and risk hitting your EC2 API Throttling Limits. See Request a limit increase and the below Fine-tuning the CSI Sidecar worker-threads parameter section
- E.g. the
- Increasing the default CSI sidecar
kube-api-qps
(to 20) andkube-api-burst
(to 100) for all sidecars (#1834)- Each sidecar can now send a burst of up to 100 queries to the Kubernetes API Server before throttling itself. It will then allow up to 20 more requests per second until it stops bursting.
- This keeps Kubernetes objects (
PersistentVolume
,PersistentVolumeClaim
, andVolumeAttachment
) more synchronized with the actual state of AWS resources, at the cost of increasing the load on the K8s API Server fromebs-csi-controller
pods when many volume operations are happening at once.
- Increasing the default CSI sidecar
timeout
values (from 15s to 60s) for all sidecars (#1824)- E.g. the external-attacher will now give the driver up to 60s to report an attachment success/failure before retrying a
ControllerPublishVolume
call. Now the external-attacher won't prematurely time out aControllerPublishVolume
call that would've taken 20s before returning a success response. - This decreases the number of premature timeouts for CSI RPC calls, which reduces the number of replay EC2 API requests made by and waited for by the driver (at the cost of a longer delay during a real driver timeout (e.g. network blip leads to lost
ControllerPublishVolume
response))
- E.g. the external-attacher will now give the driver up to 60s to report an attachment success/failure before retrying a
Both the EC2 API and K8s CSI sidecars base their API throttling implementation off of the token bucket algorithm. The Amazon EC2 API Reference provides a thorough explanation and example of how this algorithm is applied: Request throttling for the Amazon EC2 API
Cluster operators can set the CSI Sidecar --kube-api-burst
(i.e. bucket size) and --kube-api-qps
(i.e. bucket refill rate) parameters in order to fine-tune how strictly these containers throttle their queries towards the K8s API Server.
- Request throttling for the Amazon EC2 API | Amazon Web Services
- Managing and monitoring API throttling in your workloads | Amazon Web Services
- Reference: kube-apiserver | Kubernetes
- API Priority and Fairness | Kubernetes
In aws-ebs-csi-driver v1.25.0, we changed the following K8s CSI external sidecar parameters to more sensible defaults. See the summary section for an overview of how these parameters affect volume lifecycle management.
--worker-threads
(named--workers
in external-resizer)--kube-api-burst
--kube-api-qps
--timeout
The AWS EBS CSI Driver provides a set of default values intended to balance performance while reducing the risk of reaching the default EC2 rate limits for API calls.
Cluster operators can increase the --kube-api-qps
and --kube-api-burst
of each sidecar to keep the state of Kubernetes objects more in sync with their associated AWS resources.
Cluster operators that need a greater throughput of volume operations should increase the associated sidecar's worker-threads
value. When increased, the AWS Account's EC2 API limits may need to be raised to account for the increased rate of API calls.
Sidecar | Configuration Name | Description | EC2 API Calls Made By Driver |
---|---|---|---|
external-provisioner | provisioner | Watches PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume | EC2 CreateVolume/DeleteVolume |
external-attacher | attacher | Watches VolumeAttachment objects and triggers ControllerPublish/Unpublish | EC2 AttachVolume/DetachVolume, EC2 DescribeInstances |
external-resizer | resizer | Watches PersistentVolumeClaims objects and triggers controller side expansion operation | EC2 ModifyVolume, EC2 DescribeVolumesModifications |
external-snapshotter | snapshotter | Watches Snapshot CRD objects and triggers CreateSnapshot/DeleteSnapshot | EC2 CreateSnapshot/DeleteSnapshot, EC2 DescribeSnapshots |
Create a file named example-ebs-csi-config-values.yaml
with the following yaml:
sidecars:
provisioner:
additionalArgs:
- "--worker-threads=101"
- "--kube-api-burst=200"
- "--kube-api-qps=40.0"
- "--timeout=61s"
resizer:
additionalArgs:
- "--workers=101"
**Note: The external-resizer uses the --workers
parameter instead of --worker-threads
Self-managed Helm instructions
Pass in the configuration-values file when installing/upgrading aws-ebs-csi-driver
helm upgrade --install aws-ebs-csi-driver \
--namespace kube-system \
--values example-ebs-csi-config-values.yaml
aws-ebs-csi-driver/aws-ebs-csi-driver
EKS-managed add-on instructions
Pass in the add-on configuration-values file when:
Creating your add-on:
ADDON_CONFIG_FILEPATH="./example-addon-config.yaml"
aws eks create-addon \
--cluster-name "example-cluster" \
--addon-name "aws-ebs-csi-driver" \
--service-account-role-arn "arn:aws:iam::123456789012:role/EBSCSIDriverRole" \
--configuration-values "file://$ADDON_CONFIG_FILEPATH"
Updating your add-on:
ADDON_CONFIG_FILEPATH="./example-addon-config.yaml"
aws eks update-addon \
--cluster-name "example-cluster" \
--addon-name "aws-ebs-csi-driver" \
--configuration-values "file://$ADDON_CONFIG_FILEPATH"
Confirm that these arguments were set by describing a ebs-csi-controller
pod and observing the following args under the relevant sidecar container:
Name: ebs-csi-controller-...
...
Containers:
...
csi-provisioner:
...
Args:
...
--worker-threads=101
--kube-api-burst=200
--kube-api-qps=40.0
--timeout=61s
The driver supports the use of IMDSv2 (Instance Metadata Service Version 2).
To use IMDSv2 with the driver in a containerized environment like Amazon EKS, please ensure that the hop limit for IMDSv2 responses is set to 2 or greater. This is because the default hop limit of 1 is incompatible with containerized applications on Kubernetes that run in a separate network namespace from the instance.
Warnings:
- Ext4's
bigalloc
is an experimental feature, under active development. Please pay particular attention to your node's kernel version. See the ext4(5) man-page for details. - Linux kernel release 4.15 added support for resizing ext4 filesystems using clustered allocation. Resizing volumes mounted to nodes running a Linux kernel version prior to 4.15 will fail.