GPU and acceleration #5200

DanRRRR · 2024-09-02T05:32:47Z

DanRRRR
Sep 2, 2024

Found this paper published few years ago ( JR Smith et al, Phys. Plasmas 28, 074505 (2021); doi: 10.1063/5.0053109) where authors took 4 different PIC codes and compared execution speed, memory usage, scaling, accuracy etc of them. As to the execution speed of calculations the WarpX and some other codes like EPOCH showed pretty good results there. The comparison was done on an example of laser pulse interaction with the simple flat target in 2D. The initial data file is shown below.

Since WarpX can also run on GPU my question would be - how faster this example will go there? Please share your experience on how many GPUs is optimal to use for this specific code.

Is this correct that the minimum amount of GPUs defined by the memory requirement of your code setup and the amount of actual VRAM on the card (say the code to run allocates 1TB RAM and each card has only 40GB VRAM, then you need 25 GPUs )?

I never used WarpX but suspect that with the small changes this same example can probably be easily expanded to 3D if needed. Will GPU requirements change in this case ?

What kind of GPUs are the best ? AMD, NVIDIA, Tenstorrent? A100, H100, B200? Can recent usual game RTX 3000, 4000 series cards also serve as accelerators for the simple case of PC/workstation?

# Maximum number of time steps
max_step = 7500

# number of grid points
amr.n_cell =   1500  1500

amr.max_level = 0 #

warpx.numprocs = 2 20

#Timing Tests
#warpx.numprocs = 1 1
#warpx.numprocs = 1 2
#warpx.numprocs = 1 4
#warpx.numprocs = 1 8
#warpx.numprocs = 1 16
#warpx.numprocs = 1 32
#warpx.numprocs = 1 48


# Geometry
geometry.coord_sys   = 0                  # 0: Cartesian
geometry.is_periodic = 0     0            # Is periodic?
geometry.prob_lo     = -15.e-6    -15.e-6       # physical domain
geometry.prob_hi     =  15.e-6     15.e-6

# PML
warpx.do_pml = 0
warpx.do_silver_mueller = 1 


# Verbosity
warpx.verbose = 1

# Algorithms (using defaults; explicitly listed for reference)
algo.current_deposition = esirkepov
algo.charge_deposition = standard
algo.field_gathering = energy-conserving

# CFL
warpx.cfl = 0.84794  # 1.0

#particles.nspecies = 2



# interpolation
interpolation.nox = 2
interpolation.noy = 2
interpolation.noz = 2


#################################
############ PLASMA #############
#################################
particles.species_names = electrons ions

electrons.charge = -q_e
electrons.mass = m_e
electrons.injection_style = NUniformPerCell
electrons.num_particles_per_cell_each_dim = 10 10
electrons.momentum_distribution_type = "maxwell_boltzmann"
electrons.theta = 0.0196 

electrons.xmin = -0.5e-6
electrons.xmax = 0.5e-6
electrons.zmin = -10.0e-6
electrons.zmax = 10.0e-6

electrons.profile = constant
electrons.density = 8.50000e27


ions.charge = q_e
ions.mass = m_p
ions.injection_style = NUniformPerCell
ions.num_particles_per_cell_each_dim = 10 10
ions.momentum_distribution_type = "maxwell_boltzmann"
ions.theta = 1.07e-5 

ions.xmin = -0.5e-6
ions.xmax = 0.5e-6
ions.zmin = -10.0e-6
ions.zmax = 10.0e-6

ions.profile = constant
ions.density = 8.50000e27



# Moving window
warpx.do_moving_window = 0

my_constants.omega0=2.355e15 #for 800 nm
my_constants.wBnd=2.9562e-6    #2.5479e-6
my_constants.k=7.854e6 #2pi/800nm
my_constants.e=2.7182818284
my_constants.pi=3.14159265359
my_constants.RC = 5.833e-5 
my_constants.gouy=0.5318 #


# Laser
#lasers.nlasers      = 1
lasers.names        = laser1
laser1.prob_lo      = -15.e-6  -15.e-6
laser1.prob_hi      =  15.e-6   15.e-6
#laser1.profile      = Gaussian

laser1.profile	    = parse_field_function
laser1.field_function(X,Y,t) = "-2.54883113485e13*sin(omega0*t+k*X**2/RC/2-gouy)*e**(-(X/wBnd)**2)*sin(pi*t/60e-15)*(t<60e-15)"



laser1.position     =  -14.99e-6 0. 0.  # This point is on the laser plane
laser1.direction    = 1. 0. 0.     # The plane normal direction
laser1.polarization = 0. 1. 0.    # The main polarization vector
laser1.e_max = 2.54883113485e13


#laser1.profile_waist = 2.5479e-6      # The waist of the laser (in meters)
#laser1.profile_duration = 60e-15  # The duration of the laser (in seconds)
#laser1.profile_t_peak = 30e-15    # The time at which the laser reaches its peak (in seconds)
#laser1.profile_focal_distance = 14.99e-6  # Focal distance from the antenna (in meters)
laser1.wavelength = 0.8e-6         # The wavelength of the laser (in meters)



# Diagnostics (Remove for timing Runs)
diagnostics.diags_names = diag1
diag1.intervals = 250
diag1.diag_type = Full
diag1.fields_to_plot = Ex Ey Ez rho_ions rho_electrons
diag1.file_prefix = diags/plotfiles/plt

warpx.reduced_diags_names = field_E part_E
field_E.type = FieldEnergy
part_E.type = ParticleEnergy

ax3l · 2024-09-04T20:26:06Z

ax3l
Sep 4, 2024
Maintainer

Thanks for raising this.

Since WarpX can also run on GPU my question would be - how faster this example will go there? Please share your experience on how many GPUs is optimal to use for this specific code.

Yes, exactly that was also our feedback at the time to the authors.

The authors focused at the time at a correctness comparison and did not go into performance comparisons, which would have multiple aspects to it:

using large enough problems to utilize modern hardware (like GPUs)
deciding if comparing the exact same algorithms or the time-to-solution (to convergence) using best available numerics of a code
etc.

Of course, we continously comparisons of our own, and the speedup is pretty much what one expects: a GPU has 10-20x the memory bandwidth and performance of a modern CPU and thus gives you comparable speedups. A detailed analysis can be found in our SC22 paper:

Fedeli L, Huebl A, Boillod-Cerneux F, Clark T, Gott K, Hillairet C, Jaure S, Leblanc A, Lehe R, Myers A, Piechurski C, Sato M, Zaim N, Zhang W, Vay J-L, Vincenti H. Pushing the Frontier in the Design of Laser-Based Electron Accelerators with Groundbreaking Mesh-Refined Particle-In-Cell Simulations on Exascale-Class Supercomputers. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ISSN:2167-4337, pp. 25-36, Dallas, TX, US, 2022. DOI:10.1109/SC41404.2022.00008 (preprint here)

And we did comparisons on e.g., the Summit (OLCF) supercomputer where we have ~20x speedup by using the GPUs of a node vs. the CPUs (which makes sense, from memory and flop/s ratios).

Is this correct that the minimum amount of GPUs defined by the memory requirement of your code setup and the amount of actual VRAM on the card (say the code to run allocates 1TB RAM and each card has only 40GB VRAM, then you need 25 GPUs )?

Correct. We have an option (default: off) to swap to CPU RAM if you run out of GPU memory, but it is very slow because you will be limited by device-host and host-device memory bandwidth.

I never used WarpX

No worries, best to start here with our manual :)
https://warpx.readthedocs.io

[...] but suspect that with the small changes this same example can probably be easily expanded to 3D if needed. Will GPU requirements change in this case ?

Absolutely. We have many examples in 3D here:
https://warpx.readthedocs.io/en/latest/usage/examples.html

Pick one that is close and modify it for your science :)

Our input sets generally work on CPU and GPU. You can do performance tuning by setting the blocking factor larger on GPUs than on CPUs.

What kind of GPUs are the best ? AMD, NVIDIA, Tenstorrent? A100, H100, B200? Can recent usual game RTX 3000, 4000 series cards also serve as accelerators for the simple case of PC/workstation?

Generally, we currently run a lot in double precision, which means you want to use HPC/data center GPUs instead of gaming GPUs. We have support for single precision, but also work to do there that is ongoing to ensure correctness. Gaming GPUs have fast memory bandwidth, but nearly no double precision flop/s and also no error-correcting GPU RAM, the latter two you want to have. But, besides that, we generally run on them (e.g., we often develop on laptop/work-station/gaming GPUs, but we do not do science production runs on them).

"Newer is better": more memory bandwidth and more (DP) TFlop/s is better. In our evaluations (see: SC22 paper above), we are pretty proportionally using what is provided. We continously try to improve performance on GPUs (Nvidia, AMD, Intel is what we support) and CPU. We also support AMD APUs, where ROCm supports them. I will refrain here from a specific buying recommendation, but be advised that we work a lot with Nvidia and AMD GPUs as well as Intel GPUs. We also run on all common CPUs (AMD, Intel, ARM, IBM/Power, ...).

I have not heard of Tenstorrent. One would check for memory bandwidth and double precision FLOP/s first, then check if they support a programming model such as CUDA, HIP or SYCL to evaluate it.

0 replies

DanRRRR · 2024-09-05T10:11:20Z

DanRRRR
Sep 5, 2024
Author

Thanks Axel for encouraging detailed answer with a lot of useful information. You with colleagues have done really impressive job improving the code. I am aware of this and as time permits will definitely try WarpX. But please confirm that you are indeed getting on this specific setup substantial speedups. This is because the numbers sometimes could be very surprising and at the same time not clear how interpret them

For reference on just the CPU this case runs 4min on 128 cores workstation. If the task is by some reason too small for getting really valuable speedups let's increase the task size by an order of magnitude to dimensions say 4500x4500.

Definitely the accelerations >10x, rz geometry, boosted frame and some other features could be really important sometimes. The reason we are still using other codes is that for our tasks we often need such feature as collisional ionization which WarpX still missing. Also, I am heavily speculating, but due to possibly that feature our own attempt to upgrade to GPU have not done valuable speedups with one of codes we were using.

1 reply

DanRRRR Sep 8, 2024
Author

NVIDIA next gen RTX 5090 will be based on the same Grace Hopper as NVIDIA latest GB200 GPUs and is expected to appear this year. So this will be great incentive to try and use GPU acceleration even on PC.

By the way it is not exactly clear how GPU acceleration is defined, so i feel totally confused with that. In supercomputers usually 1 node is the unit, in personal computers 1 core. If we ran some code on 1 CPU node and on whatever amount GPUs and got 100 seconds and 10 seconds run times respectively the wall time speedup is 10x. But if we run on 10 CPU nodes the run time will be (in ideal scaling case) also 10 seconds and the speedup is 1x, or nothing. How specifically speedup is measured then? Same confusion exists with personal computers. I just got some preliminary information from colleague who ran before this demo test setup above on some older gen single GPU like V100 and he claims that it ran 173 times faster than on a single core CPU! Hope he is not missing something and if not then what, the speedup is 173x ??? :)))))

DanRRRR · 2024-09-14T00:51:24Z

DanRRRR
Sep 14, 2024
Author

Anyone can give me the hint about what are actual GPU accelerations for this specific task above ?
Just take the run on CPU first and then on one GPU and finally say on 2, 4 or 8 GPUs and simply measure time and scaling with number of GPUs.

NVIDIA GPUs like V100, A100, H100 or H200 are preferable but any others are also welcome.

Don't worry if speedups will be not that high. This could be because in 2D case Maxwell's equation part is negligible or something else. But this is very important to know too. Of course you are welcome to redo the 2D case to 3D. Probably this is not hard to do in WarpX. In EPOCH code for example (which unfortunately is not GPU-capable and seems not expected to be GPU ready any time soon) it took me less than couple minutes. What was specifically changed: added third dimension of same size with same boundary condition, reduced all dimensions to 500x500x500, reduced twice particle per cell count to 50 and reduced final run time to 100fs for this test to run in less than 10-20 min.

Thanks in advance

0 replies

DanRRRR · 2024-09-16T17:43:38Z

DanRRRR
Sep 16, 2024
Author

I am surprised having just a monologue here on such interesting and important subject . Seems nobody still use GPU ? Bad manuals or what?

But without GPUs the door to all largest supercomputers like Frontier, Aurora etc now totally closed because they all are inherently GPU computers where 90-95% of power is delivered by GPUs (as well as electrical power consumed. Typical request there 1 million node-hours is ~$400,000 just the electricity cost, half of that even if you don't use GPU) and not using GPUs is literally prohibited as a waste of resources. Unless you will prove that by some reason for your tasks GPU has no noticeable speedups

1 reply

DanRRRR Sep 25, 2024
Author

Let's take Frontier supercomputer as an example, it has CPU and GPU mode of operation. When you compare speedup of GPU vs CPU how do you measure it? Do you run say first on 100 nodes purely on CPU and then on 100 nodes run on GPU and the elapsed/wall/CPU/whatever times ratio call the speedup ? Or you divide the speedup additionally by 8 because each node has 8 GPU processing units and only one CPU? In other words, speedups are node:node or CPU:GPU processing unit time ratios?

In first case if speedups are less than 8 what was the rationale to use GPU at all? What was the point to switch from CPU to GPU and create GPU supercomputers? The users better stay with purely CPU-based supercomputers. As to the power efficiency, CPU and GPU units (some GPU modules have two of them) drain approximately the same wall-plug power hence here GPU also give no any advantage vs CPU, or at least no brain exploding ones. Is this true? I expect in AI the GPUs could have advantage vs CPU as they double performance when switch to FP32 from typically used in our science FP64. And have additional 2x speedup in AI with FP16 and FP8

I am thinking also about buying/building/upgrading powerful workstation for personal use, and see in many cases GPU speedups from 3.6 to 7 at exorbitant prices, this in part why i have so many questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU and acceleration #5200

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

GPU and acceleration #5200

DanRRRR Sep 2, 2024

Replies: 4 comments · 2 replies

ax3l Sep 4, 2024 Maintainer

DanRRRR Sep 5, 2024 Author

DanRRRR Sep 8, 2024 Author

DanRRRR Sep 14, 2024 Author

DanRRRR Sep 16, 2024 Author

DanRRRR Sep 25, 2024 Author

DanRRRR
Sep 2, 2024

Replies: 4 comments 2 replies

ax3l
Sep 4, 2024
Maintainer

DanRRRR
Sep 5, 2024
Author

DanRRRR Sep 8, 2024
Author

DanRRRR
Sep 14, 2024
Author

DanRRRR
Sep 16, 2024
Author

DanRRRR Sep 25, 2024
Author