Additional documentation from the team managing the EIDF Cluster is available at this link. If you don't have access to that repository, you can get in touch with the EIDF service desk (see this link) and request access. Likewise, if you need to flag problems with the cluster, please get in touch with the helpdesk at this link.
About running ML/NLP experiments on a Kubernetes cluster -- we prepared an introductory guide available here -- if you find that anything is missing from that guide, please feel free to add to it (you all have write access) or, if you are unable of doing so, please open an issue.
- WARNING the cluster does not have any quota or permission management at the moment, so please behave and don't hoard resources or make life harder for other users, or we will have to restrict your access.
A guide to onboarding new users. Be aware that this is a developing service.
Full Documentation on signing up at EIDF Documentation.
- Open Browser and goto the EIDF Portal
- Click Login -> This will redirect to SAFE
- if you have a SAFE account -> use that account, if you do not have a SAFE account, register for a new SAFE account
- Return to the EIDF portal after your SAFE account has logged in and activated.
- Click on the dropdown menu "Projects"
- Click "Request Access"
- Choose from the list "eidf029 - Informatics K8s Support"
- Click "Apply'
- An approver will add you to the project and create an account.
- Login into the EIDF Portal
- Click on the dropdown menu "Projects"
- Choose the applicant to the project from the "Project Management Requests"
- Choose the approve option in the page and submit.
- Click on the dropdown menu "Projects"
- Click "Your Projects"
- Choose from the list "eidf029 - Informatics K8s Support"
- Find and click the button for "Create Account"
- Create a username for the account in the form -infk8s e.g dmckay-infk8s
- In the account owner drop down box, choose the applicant you are creating the account for.
- Click "Submit"
- In the "Project Accounts", click "Manage" next to the account you have just created.
- Give "Access", not "Sudo" to the following entries: eidf-gateway, eidf029-host1
- Login into the EIDF Portal
- Click on the dropdown menu "Projects"
- Click "Your Projects"
- Choose from the list "eidf029 - Informatics K8s Support"
- Click your account user name from your Account
- Click to view your initial password and copy/note it
- Click VDI Login
- From the project list of VMs, choose eidf029-host1_ssh
- Enter your username
- Enter the initial password
- At the change prompt follow the instructions.
- Use the VDI as on initial login, save for changing password
- Optional use the EIDF Gateway
Alternatively, you can edit the .ssh/config
file (useful for VSCode)
Host eidf
User USERNAME
IdentityFile PATH-TO-KEY
HostName 10.24.5.121
ProxyJump [email protected]
and access the cluster by ssh eidf
.
( NB: This currently fails, but is not needed)
- Run
kubectl get nodes
- Output should look like:
NAME STATUS ROLES AGE VERSION
gpu-vm00 Ready controlplane,etcd,worker 21d v1.24.4
gpu-vm01 Ready controlplane,etcd,worker 21d v1.24.4
gpu-vm02 Ready controlplane,etcd,worker 21d v1.24.4
gpu-vm03 Ready worker 21d v1.24.4
gpu-vm04 Ready worker 21d v1.24.4
gpu-vm05 Ready worker 21d v1.24.4
gpu-vm06 Ready worker 21d v1.24.4
gpu-vm07 Ready worker 21d v1.24.4
- open editor of your choice to create the file test_NBody.yml
- put the following into to the file:
apiVersion: v1
kind: Pod
metadata:
generateName: sample-
spec:
restartPolicy: OnFailure
containers:
- name: cudasample
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1
args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"]
resources:
limits:
nvidia.com/gpu: 1
- Save the file and exit the editor
- Run `kubectl create -f test_NBody.yml'
- This will output something like:
pod/sample-7gdtb created
- Run
kubectl get pods
- This will output something like:
pi-tt9kq 0/1 Completed 0 24h
sample-24n7n 0/1 Completed 0 24h
sample-2j5tc 0/1 Completed 0 24h
sample-2kjbx 0/1 Completed 0 24h
sample-2mnvg 0/1 Completed 0 24h
sample-4sng2 0/1 Completed 0 24h
sample-5h6sr 0/1 Completed 0 24h
sample-6bqql 0/1 Completed 0 24h
sample-7gdtb 0/1 Completed 0 39s
sample-8dnht 0/1 Completed 0 24h
sample-8pxz4 0/1 Completed 0 24h
sample-bphjx 0/1 Completed 0 24h
sample-cp97f 0/1 Completed 0 24h
sample-gcbbb 0/1 Completed 0 24h
sample-hdlrr 0/1 Completed 0 24h
sample-hkwk2 0/1 Completed 0 24h
sample-j66ck 0/1 Completed 0 24h
sample-jxhtk 0/1 Completed 0 24h
sample-lzmg8 0/1 Completed 0 24h
sample-nhrtk 0/1 Completed 0 24h
sample-rh9v7 0/1 Completed 0 24h
sample-v48jd 0/1 Completed 0 24h
- View the logs of the pod you ran
kubectl logs sample-7gdtb
- This will output something like:
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Fullscreen mode
> Simulation data stored in video memory
> Double precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.0
> Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]
number of bodies = 512000
512000 bodies, total time for 10 iterations: 10570.778 ms
= 247.989 billion interactions per second
= 7439.679 double-precision GFLOP/s at 30 flops per interaction
- delete your pod with
kubectl delete pod sample-7gdtb
Follow this guide to get started, and check the following tools from the amazing @AntreasAntoniou:
- https://github.com/BayesWatch/kubeproject for general kubectl stuff and understanding what’s going on.
- https://github.com/AntreasAntoniou/kubejobs for python-based kubernetes job launching that covers a lot of options for the yaml — but in python class format.
- https://github.com/AntreasAntoniou/minimal-ml-template/tree/main/kubernetes for a minimal ml projects that can run on a kubernetes cluster
More details here