Skip to content

Latest commit

 

History

History
207 lines (146 loc) · 15.2 KB

faq.md

File metadata and controls

207 lines (146 loc) · 15.2 KB

Frequently Asked Questions

Volume Attachment Capacity Issues

There's a known issue where a mismatch between the reported and actual attachment capacity on nodes can result in scheduling errors and stuck workloads. This commonly occurs when volume slots are consumed after the driver starts up, which results in kube-scheduler assigning stateful pods to nodes lacking the necessary capacity to support attachments. As a consequence, volumes can become stuck in the attaching state until a slot is freed up, leading to prolonged delays in pod startup.

What causes this misalignment?

Today, CSI plugins report node attachment capacity only once, at startup, via the NodeGetInfo RPC. This static reporting fails to reflect any subsequent changes in capacity (which may occur when dynamically allocated ENIs or non-CSI devices consume attachment slots).

What steps can be taken to mitigate this issue?

While a long-term fix is worked on (see kubernetes/enhancements#4875), you can adopt one or more of the following solutions to mitigate this issue:

  1. Use Dedicated EBS Instance Types: Gen7 and later EC2 instance types have dedicated EBS volume limits and are not affected by dynamic ENI attachments taking up volume slots.
  2. Enable VPC CNI's Prefix Delegation Feature: This can reduce the number of ENIs needed in your cluster. See the aws-eks-best-practices/networking docs for recommendations and further instructions.
  3. Use the --volume-attach-limit CLI Option: Configure the driver with this option to explicitly specify the limit for volumes to be reported to Kubernetes. This is useful when you have a known safe limit.
  4. Use the --reserved-volume-attachments CLI Option: Configure the driver with this option to reserve a number of slots for non-CSI volumes. These reserved slots will be subtracted from the total slots reported to Kubernetes.
  5. Use Multiple DaemonSets: For clusters that need a mix of the above solutions across different groups of nodes, the Helm chart can construct multiple DaemonSets via the additionalDaemonSets parameter. See Additional DaemonSets for more information.

6-Minute Delays in Attaching Volumes

What causes 6-minute delays in attaching volumes?

After a node is preempted, pods using persistent volumes may experience a 6+ minute restart delay. This occurs when a volume is not cleanly unmounted, as indicated by Node.Status.VolumesInUse. The Attach/Detach controller in kube-controller-manager waits for 6 minutes before issuing a force detach. This delay in pod restart can prevent potential data corruption due to unclean mounts.

What behaviors have been observed that contribute to this issue?

  • The EBS CSI Driver node pod on the node may get terminated before the workload's volumes are unmounted, leading to unmount failures.
  • The operating system might shut down before volume unmount is completed, even with remaining time in the shutdownGracePeriod.

What are the symptoms of this issue?

Warning  FailedAttachVolume      6m51s              attachdetach-controller  Multi-Attach error for volume "pvc-4a86c32c-fbce-11ea-b7d2-0ad358e4e824" Volume is already exclusively attached to one node and can't be attached to another

What steps can be taken to mitigate this issue?

  1. Graceful Termination: Ensure instances are gracefully terminated, which is a best practice in Kubernetes. This allows the CSI driver to clean up volumes before the node is terminated utilizing the preStop lifecycle hook.

  2. Configure Kubelet for Graceful Node Shutdown: For unexpected shutdowns, it is highly recommended to configure the Kubelet for graceful node shutdown. Using the standard EKS-optimized AMI, you can configure the kubelet for graceful node shutdown with the following user data script:

  #!/bin/bash
  echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
  systemctl restart systemd-logind
  echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
  echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
  systemctl restart kubelet
  1. Karpenter Best Practices:
  • Upgrade to Karpenter version ≥ v1.0.0, where Karpenter will now wait to terminate nodes until all volumes have been detached from them.
  • When using Spot Instances with Karpenter, enable interruption handling to manage involuntary interruptions gracefully. Karpenter supports native interruption handling, which cordons, drains, and terminates nodes ahead of interruption events, maximizing workload cleanup time.

What is the PreStop lifecycle hook?

The PreStop hook executes to check for remaining VolumeAttachment objects when a node is being drained for termination. If the node is not being drained, the check is skipped. The hook uses Kubernetes informers to monitor the deletion of VolumeAttachment objects, waiting until all attachments associated with the node are removed before proceeding.

How do I enable the PreStop lifecycle hook?

The PreStop lifecycle hook is enabled by default in the AWS EBS CSI driver. It will execute when a node is being drained for termination.

Driver performance for large-scale clusters

Summary of scalability-related changes in v1.25

Version 1.25 of aws-ebs-csi-driver featured four improvements to better manage the EBS volume lifecycle for large-scale clusters.

At a high-level:

  1. Batching EC2 DescribeVolumes API Calls across CSI gRPC calls (#1819)
    • This greatly decreases the number of EC2 DescribeVolumes calls made by the driver, which significantly reduces your risk of region-level throttling of the 'Describe*' EC2 API Action
  2. Increasing the default CSI sidecar worker-threads values (to 100 for all sidecars) (#1834)
  3. Increasing the default CSI sidecar kube-api-qps (to 20) and kube-api-burst (to 100) for all sidecars (#1834)
    • Each sidecar can now send a burst of up to 100 queries to the Kubernetes API Server before throttling itself. It will then allow up to 20 more requests per second until it stops bursting.
    • This keeps Kubernetes objects (PersistentVolume, PersistentVolumeClaim, and VolumeAttachment) more synchronized with the actual state of AWS resources, at the cost of increasing the load on the K8s API Server from ebs-csi-controller pods when many volume operations are happening at once.
  4. Increasing the default CSI sidecar timeout values (from 15s to 60s) for all sidecars (#1824)
    • E.g. the external-attacher will now give the driver up to 60s to report an attachment success/failure before retrying a ControllerPublishVolume call. Now the external-attacher won't prematurely time out a ControllerPublishVolume call that would've taken 20s before returning a success response.
    • This decreases the number of premature timeouts for CSI RPC calls, which reduces the number of replay EC2 API requests made by and waited for by the driver (at the cost of a longer delay during a real driver timeout (e.g. network blip leads to lost ControllerPublishVolume response))

EC2 and K8s CSI Sidecar Throttling Overview

Both the EC2 API and K8s CSI sidecars base their API throttling implementation off of the token bucket algorithm. The Amazon EC2 API Reference provides a thorough explanation and example of how this algorithm is applied: Request throttling for the Amazon EC2 API

Cluster operators can set the CSI Sidecar --kube-api-burst (i.e. bucket size) and --kube-api-qps (i.e. bucket refill rate) parameters in order to fine-tune how strictly these containers throttle their queries towards the K8s API Server.

Further Reading

Fine-tuning CSI sidecar scalability parameters

In aws-ebs-csi-driver v1.25.0, we changed the following K8s CSI external sidecar parameters to more sensible defaults. See the summary section for an overview of how these parameters affect volume lifecycle management.

  • --worker-threads (named --workers in external-resizer)
  • --kube-api-burst
  • --kube-api-qps
  • --timeout

The AWS EBS CSI Driver provides a set of default values intended to balance performance while reducing the risk of reaching the default EC2 rate limits for API calls.

Cluster operators can increase the --kube-api-qps and --kube-api-burst of each sidecar to keep the state of Kubernetes objects more in sync with their associated AWS resources.

Cluster operators that need a greater throughput of volume operations should increase the associated sidecar's worker-threads value. When increased, the AWS Account's EC2 API limits may need to be raised to account for the increased rate of API calls.

Sidecar Configuration Name Description EC2 API Calls Made By Driver
external-provisioner provisioner Watches PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume EC2 CreateVolume/DeleteVolume
external-attacher attacher Watches VolumeAttachment objects and triggers ControllerPublish/Unpublish EC2 AttachVolume/DetachVolume, EC2 DescribeInstances
external-resizer resizer Watches PersistentVolumeClaims objects and triggers controller side expansion operation EC2 ModifyVolume, EC2 DescribeVolumesModifications
external-snapshotter snapshotter Watches Snapshot CRD objects and triggers CreateSnapshot/DeleteSnapshot EC2 CreateSnapshot/DeleteSnapshot, EC2 DescribeSnapshots

Sidecar Fine-tuning Examples

Create a file named example-ebs-csi-config-values.yaml with the following yaml:

sidecars:
  provisioner:
    additionalArgs:
    - "--worker-threads=101"
    - "--kube-api-burst=200"
    - "--kube-api-qps=40.0"
    - "--timeout=61s"
  resizer:
    additionalArgs:
    - "--workers=101"

**Note: The external-resizer uses the --workers parameter instead of --worker-threads

Self-managed Helm instructions

Pass in the configuration-values file when installing/upgrading aws-ebs-csi-driver

helm upgrade --install aws-ebs-csi-driver \
--namespace kube-system \
--values example-ebs-csi-config-values.yaml
aws-ebs-csi-driver/aws-ebs-csi-driver
EKS-managed add-on instructions

Pass in the add-on configuration-values file when:

Creating your add-on:

ADDON_CONFIG_FILEPATH="./example-addon-config.yaml"

aws eks create-addon \
  --cluster-name "example-cluster" \
  --addon-name "aws-ebs-csi-driver" \
  --service-account-role-arn "arn:aws:iam::123456789012:role/EBSCSIDriverRole" \
  --configuration-values "file://$ADDON_CONFIG_FILEPATH"

Updating your add-on:

ADDON_CONFIG_FILEPATH="./example-addon-config.yaml"

aws eks update-addon \
  --cluster-name "example-cluster" \
  --addon-name "aws-ebs-csi-driver" \
  --configuration-values "file://$ADDON_CONFIG_FILEPATH"

Confirm that these arguments were set by describing a ebs-csi-controller pod and observing the following args under the relevant sidecar container:

Name: ebs-csi-controller-...
...
Containers:
  ...
  csi-provisioner:
    ...
    Args:
    ...
      --worker-threads=101
      --kube-api-burst=200
      --kube-api-qps=40.0
      --timeout=61s

IMDSv2 Support

The driver supports the use of IMDSv2 (Instance Metadata Service Version 2).

To use IMDSv2 with the driver in a containerized environment like Amazon EKS, please ensure that the hop limit for IMDSv2 responses is set to 2 or greater. This is because the default hop limit of 1 is incompatible with containerized applications on Kubernetes that run in a separate network namespace from the instance.

CreateVolume (StorageClass) Parameters

ext4BigAlloc and ext4ClusterSize

Warnings:

  • Ext4's bigalloc is an experimental feature, under active development. Please pay particular attention to your node's kernel version. See the ext4(5) man-page for details.
  • Linux kernel release 4.15 added support for resizing ext4 filesystems using clustered allocation. Resizing volumes mounted to nodes running a Linux kernel version prior to 4.15 will fail.