-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: request some benchmarks and compare them with results in native slurm #156
Comments
Hello @CrackedPoly, Thank you for the question. In our experience, Soperator (Slurm on Kubernetes) introduces no noticeable performance degradation compared to Slurm-over-VMs configurations. If by “native Slurm” you’re referring to Slurm running on bare-metal servers, we haven’t run those tests directly ourselves. However, we can compare our results with other public data from cloud providers for context. Performance testingIn all our tests, the nodes were the same:
2-node stable diffusion training (MLCommons benchmark v3.0)The benchmarked performance of Soperator aligns with that of a standard Slurm-over-VMs setup, averaging ~20-20.5 seconds per 100 training steps. 64-node GPT-3 pretraining (MLCommons benchmark v4.0)My single-run test yielded a result of 54.03 minutes, showing:
NVIDIA's results are typically optimal, so a slight difference is to be expected. Also, since I've only tried it once, this result is probably not the best that Soperator can have. Overall, these comparisons suggest that Soperator performs on par with native Slurm setups. The training performance depends more on the hardware and perhaps even on the temperature in data centers than on the use of containers and K8s. Our simpler checks, such as NCCL tests, also show no difference. Overhead ConsiderationsIn theory, Soperator doesn’t introduce specific overhead in training workloads. While there may be minor syscall overhead due to containerization, it’s inconsequential to training performance. Furthermore:
The only potential initialization slowdown we’ve observed stems from using a shared root filesystem, which can delay library loading during startup. However, this effect is minimal and not very significant in model training. You also have an opportunity to store them on local disks. I apologize for the lack of formally designed benchmark results. But, based on theory and our observations, we believe there is no noticeable overhead. |
Hi @rdjjke , thank you so much about the information! It seems soperator is a very promising solution. But I want to comfirm that by Slurm-over-VMs, you mean running slurmd in VMs and spawn job process there or running slurmd in bare metals but jobs are executed in VMs? |
By Slurm-over-VMs I mean a typical Slurm installation (without Soperator) where Slurm daemons are installed on virtual machines (including slurmd). Jobs are child processes of slurmd, so they're also executed on virtual machines. In this case, the virtualization overhead exists (which is small, but still exists for CPU, filesystem, syscalls), but there is no Kubernetes / containerization overhead (which is so small that it can be ignored anyway). If your Slurm deployment is in a cloud (i.e. AWS, Azure, Nebius, etc.) then the compute instances you use are probably virtual machines. But there are some bare-metal clouds, which provide bare-metal hosts to their users (not VMs). They can also provide K8s clusters over bare-metal hosts. When I talked about NVIDIA and SMC in the previous message, I meant that they most likely used bare-metal setups. |
I may not have described it quite clearly, so I'll try to rehash it. Kubernetes takes a number of hosts (which can be virtual machines or bare-metal servers) and runs containers on them. Everything else it provides is convenient management of these containers. So the overhead of using Kubernetes is equal to the overhead of using containers. The overhead of containers is negligibly small. BTW, Slurm jobs in the MLCommons recipes use enroot containers anyway, so they're containerized even when Soperator isn't used. In the Soperator's case, these containers run inside K8s containers, which is also OK. The overhead of virtual machines exists, but it's small for model training because GPUs are usually not virtualized. |
Cool. Let me summarize this.
|
Yes. Correct. |
I wonder how much overhead does soperator introduces in ML, compared with native slurm. This is an important concern and I want to know if you have any statistics.
Some scenarios
Single machine
Distributed
The text was updated successfully, but these errors were encountered: