This guide describes how to use Amazon EFS as Persistent storage on top of an existing Kubeflow deployment.
- For this README, we will assume that you already have an EKS Cluster with Kubeflow installed since the EFS CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. You can follow any of the other guides to complete these steps - choose one of the AWS managed service integrated offering or vanilla distribution.
Important :
You must make sure you have an OIDC provider for your cluster and that it was added from eksctl
>= 0.56
or if you already have an OIDC provider in place, then you must make sure you have the tag alpha.eksctl.io/cluster-name
with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags.
- At this point, you have likely cloned this repo and checked out the right branch. Let's save this path to help us naviagte to different paths in the rest of this doc -
export GITHUB_ROOT=$(pwd)
export GITHUB_STORAGE_DIR="$GITHUB_ROOT/docs/deployment/add-ons/storage/"
- Make sure the following environment variables are set.
export CLUSTER_NAME=<clustername>
export CLUSTER_REGION=<clusterregion>
- Also, based on your setup, export the name of the user namespace you are planning to use -
export PVC_NAMESPACE=kubeflow-user-example-com
- And finally, choose a name for the EFS claim that we will create. In this guide we will use this variable as the name for the PV as well the PVC.
export CLAIM_NAME=<efs-claim>
You can either use Automated or Manual setup to set up the resources required. If you choose the manual route, you get another choice between static and dynamic provisioning, so pick whichever suits you. On the other hand, for the automated script we currently only support dynamic provisioning. Whichever combination you pick, be sure to continue picking the appropriate sections through the rest of this guide.
The script automates all the manual resource creation steps but is currently only available for Dynamic Provisioning option.
It performs the required cluster configuration, creates an EFS file system and it also takes care of creating a storage class for dynamic provisioning. Once done, move to section 3.0.
- Run the following commands from the
tests/e2e
directory as -
cd $GITHUB_ROOT/tests/e2e
- Install the script dependencies
pip install -r requirements.txt
- Run the automated script as follows -
Note: If you want the script to create a new security group for EFS, specify a name here. On the other hand, if you want to use an existing Security group, you can specify that name too. We have used the same name as the claim we are going to create -
export SECURITY_GROUP_TO_CREATE=$CLAIM_NAME
python utils/auto-efs-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --efs_file_system_name $CLAIM_NAME --efs_security_group_name $SECURITY_GROUP_TO_CREATE
- The script above takes care of creating the
StorageClass (SC)
which is a cluster scoped resource. In order to create thePersistentVolumeClaim (PVC)
you can either use the yaml file provided in this directory or use the Kubeflow UI directly. The PVC needs to be in the namespace you will be accessing it from. Replace thekubeflow-user-example-com
namespace specified the below with the namespace for your kubeflow user and edit theefs/dynamic-provisioning/pvc.yaml
file accordingly.
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
The script applies some default values for the file system name, performance mode etc. If you know what you are doing, you can see which options are customizable by executing python utils/auto-efs-setup.py --help
.
If you prefer to manually setup each component then you can follow this manual guide. As mentioned, it you have two options between Static and Dynamic provisioing later in step 4 of this section.
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
We recommend installing the EFS CSI Driver v1.3.4 directly from the the aws-efs-csi-driver github repo as follows -
kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=tags/v1.3.4"
You can confirm that EFS CSI Driver was installed into the default kube-system namespace for you. You can check using the following command -
kubectl get csidriver
NAME ATTACHREQUIRED PODINFOONMOUNT MODES AGE
efs.csi.aws.com false false Persistent 5d17h
The CSI driver's service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account efs-csi-controller-sa
with an IAM Role which has the required permissions.
- Download the IAM policy document from GitHub as follows -
curl -o iam-policy-example.json https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/v1.3.4/docs/iam-policy-example.json
- Create the policy -
aws iam create-policy \
--policy-name AmazonEKS_EFS_CSI_Driver_Policy \
--policy-document file://iam-policy-example.json
- Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows -
eksctl create iamserviceaccount \
--name efs-csi-controller-sa \
--namespace kube-system \
--cluster $CLUSTER_NAME \
--attach-policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/AmazonEKS_EFS_CSI_Driver_Policy \
--approve \
--override-existing-serviceaccounts \
--region $CLUSTER_REGION
- You can verify by describing the specified service account to check if it has been correctly annotated -
kubectl describe -n kube-system serviceaccount efs-csi-controller-sa
Please refer to the official AWS EFS CSI Document for detailed instructions on creating an EFS filesystem.
Note: For this README, we have assumed that you are creating your EFS Filesystem in the same VPC as your EKS Cluster.
In the following section, you have to choose between setting up dynamic provisioning or setting up static provisioning.
- Use the
$file_system_id
you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit thedynamic-provisioning/sc.yaml
file by chaning<YOUR_FILE_SYSTEM_ID>
with yourfs-xxxxxx
file system id. You can also change it using the following command :
file_system_id=$file_system_id yq e '.parameters.fileSystemId = env(file_system_id)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/sc.yaml
- Create the storage class using the following command :
kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/sc.yaml
- Verify your setup by checking which storage classes are created for your cluster. You can use the following command
kubectl get sc
- The
StorageClass
is a cluster scoped resources but thePersistentVolumeClaim
needs to be in the namespace you will be accessing it from. Let's edit the pvc.yaml accordingly
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
Note : The StorageClass
is a cluster scoped resource which means we only need to do this step once per cluster.
Using this sample from official AWS Docs we have provided the required spec files in the sample subdirectory but you can create the PVC another way.
- Use the
$file_system_id
you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit the last line of the static-provisioning/pv.yaml file to specify thevolumeHandle
field to point to your EFS filesystem. Replace$file_system_id
if it is not already set.
file_system_id=$file_system_id yq e '.spec.csi.volumeHandle = env(file_system_id)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml
- The
PersistentVolume
andStorageClass
are cluster scoped resources but thePersistentVolumeClaim
needs to be in the namespace you will be accessing it from. Replace thekubeflow-user-example-com
namespace specified the below with the namespace for your kubeflow user and edit thestatic-provisioning/pvc.yaml
file accordingly.
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml
- Now create the required persistentvolume, persistentvolumeclaim and storageclass resources as -
kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/sc.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml
Use the following commands to ensure all resources have been deployed as expected and the PersistentVolume is correctly bound to the PersistentVolumeClaim
# Only for Static Provisioning
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
efs-pv 5Gi RWX Retain Bound kubeflow-user-example-com/efs-claim efs-sc 5d16h
# Both Static and Dynamic Provisioning
kubectl get pvc -n $PVC_NAMESPACE
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
efs-claim Bound efs-pv 5Gi RWX efs-sc 5d16h
In the following two sections we will be using this PVC to create a notebook server with Amazon EFS mounted as the workspace volume, download training data into this filesystem and then deploy a TFJob to train a model using this data.
Once you have everything setup, Port Forward as needed and Login to the Kubeflow dashboard. At this point, you can also check the Volumes
tab in Kubeflow and you should be able to see your PVC is available for use within Kubeflow.
For more details on how to access your Kubeflow dashboard, refer to one of the deployment READMEs based on your setup. If you used the vanilla deployment, you can follow this README.
After installing Kubeflow, you can change the default Storage Class from gp2
to the efs storage class you created during the setup. For instance, if you followed the automatic or manual steps, you should have a storage class named efs-sc
. You can check your storage classes by running kubectl get sc
.
This is can be useful if your notebook configuration is set to use the default storage class (it is the case by default). By changing the default storage class, when creating workspace volumes for your notebooks, it will use your EFS storage class automatically. This is not mandatory as you can also manually create a PVC and select the efs-sc
class via the Volume UI but can facilitate the notebook creation process and automatically select this class when creating volume in the UI. You can also decide to keep using gp2
for workspace volumes and keep the EFS storage class for datasets/data volumes only.
To learn more about how to change the default Storage Class, you can refer to the official Kubernetes documentation.
For instance, if you have a default class set to gp2
and another class efs-sc
, then you would need to do the following :
- Remove
gp2
as your default storage class
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
- Set
efs-sc
as your default storage class
kubectl patch storageclass efs-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Note: As mentioned, make sure to change your default storage class only after you have completed your Kubeflow deployment. The default Kubeflow components may not work well with a different storage class.
This step may not be necessary but you might need to specify some additional directory permissions on your worker node before you can use these as mount points. By default, new Amazon EFS file systems are owned by root:root, and only the root user (UID 0) has read-write-execute permissions. If your containers are not running as root, you must change the Amazon EFS file system permissions to allow other users to modify the file system. The set-permission-job.yaml is an example of how you could set these permissions to be able to use the efs as your workspace in your kubeflow notebook. Modify it accordingly if you run into similar permission issues with any other job pod.
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
yq e '.spec.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. We'll assume you created a PVC with the name efs-claim
via Kubeflow Volumes UI or via the manual setup step Static Provisioning. For our example here, we are using the AWS Optimized Tensorflow 2.6 CPU image provided in the notebook configuration options - public.ecr.aws/c9e4w0g3/notebook-servers/jupyter-tensorflow
. Additionally, use the existing efs-claim
volume as the workspace volume at the default mount point /home/jovyan
. The server might take a few minutes to come up.
In case the server does not start up in the expected time, do make sure to check -
- The Notebook Controller Logs
- The specific notebook server instance pod's logs
The following section re-uses the PVC and the Tensorflow Kubeflow Notebook created in the previous steps to download a dataset to the EFS Volume. Then we spin up a TFjob which runs a image classification job using the data from the shared volume. Source: https://www.tensorflow.org/tutorials/load_data/images
Note: The following steps are run from the terminal on your gateway node connected to your EKS cluster and not from the Kubeflow Notebook to test the PVC allowed sharing of data as expected.
In the Kubeflow Notebook created above, use the following snippet to download the data into the /home/jovyan/.keras
directory (which is mounted onto the EFS Volume).
import pathlib
import tensorflow as tf
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
fname='flower_photos',
untar=True)
data_dir = pathlib.Path(data_dir)
In the training-sample
directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the $IMAGE_URI
to your registry and specify an appropriate tag -
export IMAGE_URI=<dockerimage:tag>
cd training-sample
# You will need to login to ECR for the following steps
docker build -t $IMAGE_URI .
docker push $IMAGE_URI
cd -
Once the docker image is built, replace the <dockerimage:tag>
in the tfjob.yaml
file, line #17.
yq e '.spec.tfReplicaSpecs.Worker.template.spec.containers[0].image = env(IMAGE_URI)' -i training-sample/tfjob.yaml
Also, specify the name of the PVC you created -
yq e '.spec.tfReplicaSpecs.Worker.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i training-sample/tfjob.yaml
Make sure to run it in the same namespace as the claim -
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i training-sample/tfjob.yaml
At this point, we are ready to train the model using the training-sample/training.py
script and the data available on the shared volume with the Kubeflow TFJob operator as -
kubectl apply -f training-sample/tfjob.yaml
In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs as -
kubectl describe tfjob image-classification-pvc -n $PVC_NAMESPACE
kubectl logs -n $PVC_NAMESPACE image-classification-pvc-worker-0 -f
This section cleans up the resources created in this README, to cleanup other resources such as the Kubeflow deployment, please refer to the high level README files.
kubectl delete tfjob -n $PVC_NAMESPACE image-classification-pvc
Login to the dashboard to stop and/or terminate any kubeflow notebooks you created for this session or use the following command -
kubectl delete notebook -n $PVC_NAMESPACE <notebook-name>
Use the following command to delete the permissions job -
kubectl delete pod -n $PVC_NAMESPACE $CLAIM_NAME
kubectl delete pvc -n $PVC_NAMESPACE $CLAIM_NAME
kubectl delete pv efs-pv
kubectl delete sc efs-sc
Use the steps in this AWS Guide to delete the EFS filesystem that you created.
- When you rerun the
eksctl create iamserviceaccount
to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In such a case you may need to do one or both of the following - a. Delete the cloudformation stack associated with this add-on role. b. Delete theefs-csi-controller-sa
service account and then re-run the required steps. If you used the auto-script, you can rerun it by specifying the samefilesystem-name
such that a new one is not created.