Replies: 4 comments 2 replies
-
Thanks for raising this.
Yes, exactly that was also our feedback at the time to the authors. The authors focused at the time at a correctness comparison and did not go into performance comparisons, which would have multiple aspects to it:
Of course, we continously comparisons of our own, and the speedup is pretty much what one expects: a GPU has 10-20x the memory bandwidth and performance of a modern CPU and thus gives you comparable speedups. A detailed analysis can be found in our SC22 paper:
And we did comparisons on e.g., the Summit (OLCF) supercomputer where we have ~20x speedup by using the GPUs of a node vs. the CPUs (which makes sense, from memory and flop/s ratios).
Correct. We have an option (default: off) to swap to CPU RAM if you run out of GPU memory, but it is very slow because you will be limited by device-host and host-device memory bandwidth.
No worries, best to start here with our manual :)
Absolutely. We have many examples in 3D here: Pick one that is close and modify it for your science :) Our input sets generally work on CPU and GPU. You can do performance tuning by setting the blocking factor larger on GPUs than on CPUs.
Generally, we currently run a lot in double precision, which means you want to use HPC/data center GPUs instead of gaming GPUs. We have support for single precision, but also work to do there that is ongoing to ensure correctness. Gaming GPUs have fast memory bandwidth, but nearly no double precision flop/s and also no error-correcting GPU RAM, the latter two you want to have. But, besides that, we generally run on them (e.g., we often develop on laptop/work-station/gaming GPUs, but we do not do science production runs on them). "Newer is better": more memory bandwidth and more (DP) TFlop/s is better. In our evaluations (see: SC22 paper above), we are pretty proportionally using what is provided. We continously try to improve performance on GPUs (Nvidia, AMD, Intel is what we support) and CPU. We also support AMD APUs, where ROCm supports them. I will refrain here from a specific buying recommendation, but be advised that we work a lot with Nvidia and AMD GPUs as well as Intel GPUs. We also run on all common CPUs (AMD, Intel, ARM, IBM/Power, ...). I have not heard of Tenstorrent. One would check for memory bandwidth and double precision FLOP/s first, then check if they support a programming model such as CUDA, HIP or SYCL to evaluate it. |
Beta Was this translation helpful? Give feedback.
-
Thanks Axel for encouraging detailed answer with a lot of useful information. You with colleagues have done really impressive job improving the code. I am aware of this and as time permits will definitely try WarpX. But please confirm that you are indeed getting on this specific setup substantial speedups. This is because the numbers sometimes could be very surprising and at the same time not clear how interpret them For reference on just the CPU this case runs 4min on 128 cores workstation. If the task is by some reason too small for getting really valuable speedups let's increase the task size by an order of magnitude to dimensions say 4500x4500. Definitely the accelerations >10x, rz geometry, boosted frame and some other features could be really important sometimes. The reason we are still using other codes is that for our tasks we often need such feature as collisional ionization which WarpX still missing. Also, I am heavily speculating, but due to possibly that feature our own attempt to upgrade to GPU have not done valuable speedups with one of codes we were using. |
Beta Was this translation helpful? Give feedback.
-
Anyone can give me the hint about what are actual GPU accelerations for this specific task above ? NVIDIA GPUs like V100, A100, H100 or H200 are preferable but any others are also welcome. Don't worry if speedups will be not that high. This could be because in 2D case Maxwell's equation part is negligible or something else. But this is very important to know too. Of course you are welcome to redo the 2D case to 3D. Probably this is not hard to do in WarpX. In EPOCH code for example (which unfortunately is not GPU-capable and seems not expected to be GPU ready any time soon) it took me less than couple minutes. What was specifically changed: added third dimension of same size with same boundary condition, reduced all dimensions to 500x500x500, reduced twice particle per cell count to 50 and reduced final run time to 100fs for this test to run in less than 10-20 min. Thanks in advance |
Beta Was this translation helpful? Give feedback.
-
I am surprised having just a monologue here on such interesting and important subject . Seems nobody still use GPU ? Bad manuals or what? But without GPUs the door to all largest supercomputers like Frontier, Aurora etc now totally closed because they all are inherently GPU computers where 90-95% of power is delivered by GPUs (as well as electrical power consumed. Typical request there 1 million node-hours is ~$400,000 just the electricity cost, half of that even if you don't use GPU) and not using GPUs is literally prohibited as a waste of resources. Unless you will prove that by some reason for your tasks GPU has no noticeable speedups |
Beta Was this translation helpful? Give feedback.
-
Found this paper published few years ago ( JR Smith et al, Phys. Plasmas 28, 074505 (2021); doi: 10.1063/5.0053109) where authors took 4 different PIC codes and compared execution speed, memory usage, scaling, accuracy etc of them. As to the execution speed of calculations the WarpX and some other codes like EPOCH showed pretty good results there. The comparison was done on an example of laser pulse interaction with the simple flat target in 2D. The initial data file is shown below.
Since WarpX can also run on GPU my question would be - how faster this example will go there? Please share your experience on how many GPUs is optimal to use for this specific code.
Is this correct that the minimum amount of GPUs defined by the memory requirement of your code setup and the amount of actual VRAM on the card (say the code to run allocates 1TB RAM and each card has only 40GB VRAM, then you need 25 GPUs )?
I never used WarpX but suspect that with the small changes this same example can probably be easily expanded to 3D if needed. Will GPU requirements change in this case ?
What kind of GPUs are the best ? AMD, NVIDIA, Tenstorrent? A100, H100, B200? Can recent usual game RTX 3000, 4000 series cards also serve as accelerators for the simple case of PC/workstation?
Beta Was this translation helpful? Give feedback.
All reactions