Inputs for fast computation on a single PC #5213

binahn91 · 2024-09-04T14:44:30Z

binahn91
Sep 4, 2024

Dear developers,

I am currently running various simple cases on a single PC.

AMD Ryzen Threadripper 7980X 64 cores (With SMT(hyperthreading) 128 cores)
Ubuntu 22.04.4 LTS

I would like to ask for advice on running warpx efficiently(fast) on a single PC,
given that the physical system/configuration is the same.

I ran sims with the following in the linux terminal.
export OMP_NUM_THREADS=128
time warpx.rz input.txt

The sim is a simple RZ grid with uniform plasma(electron+ion_Ar), for 100 steps.
And by varying the OMP_NUM_THREADS, the total computation time is as shown in the following plot.

Briefly, the sim initially has about 3 million particles with 16384 rz cells.
(no applied field and few diagnostic saving)

I am wondering why it is fastest with 16 or 32 ompthreads.

warpx_blocking_factor seems to be not an option in picmi.CylindricalGrid(),
and for this [numcr, numcz]=[64, 256] case,
it is always amr.blocking_factor = 1 and amr.max_grid_size = 32,
in the sim.write_input_file() generated input text file.

Does my input need to be fixed?
(I have attached the input/output files here.)
input.txt
warpxoutput.txt

And, is there any other way to improve the speed of simulation, for a single PC?

Thank you very much!
Bin

Answered by ax3l

Sep 4, 2024

Hi @binahn91,

Awesome question!

Here is how you can optimize running on modern CPUs like your AMD Ryzen Threadripper 7980X.

Modern multi-core CPUs are set up in multiple islands ("chiplets" with their own bus rings(, even within a single socket. Looking at the Ryzen Threadripper whitepaper from AMD:
https://www.amd.com/system/files/documents/tr-pro-workstation-white-paper.pdf
Also a good page: https://en.wikichip.org/wiki/amd/microarchitectures/zen_4

It looks like there might be around 4-8 "chiplets" on your CPU. These chiplets contribute each a fraction to your total physical cores and have very fast access to their closest memory banks (RAM), but are a bit slower if they have to cross-c…

View full answer

ax3l · 2024-09-04T19:16:13Z

ax3l
Sep 4, 2024
Maintainer

Hi @binahn91,

Awesome question!

Here is how you can optimize running on modern CPUs like your AMD Ryzen Threadripper 7980X.

Modern multi-core CPUs are set up in multiple islands ("chiplets" with their own bus rings(, even within a single socket. Looking at the Ryzen Threadripper whitepaper from AMD:
https://www.amd.com/system/files/documents/tr-pro-workstation-white-paper.pdf
Also a good page: https://en.wikichip.org/wiki/amd/microarchitectures/zen_4

It looks like there might be around 4-8 "chiplets" on your CPU. These chiplets contribute each a fraction to your total physical cores and have very fast access to their closest memory banks (RAM), but are a bit slower if they have to cross-communicate (AMD calls the cross-CPU-island communication "infinity fabric").
(Example block diagram in https://www.vortez.net/news_image/14678.html ).

You performance benchmark shows that pretty well: there is a penalty if the cooperating OpenMP threads cross the chiplets.

The strategy we want to follow here is to use MPI+OpenMP. You can test this out. Assuming there are 4 islands, use 4 MPI processes (ranks) via mpirun -n 4 ./warpx ....

Now, each MPI process will spawn OpenMP threads. By default, we use the number of physical cores (64 in your case) and avoid Hyperthreading (2x in this case, which makes it "128 virtual cores"), because usually it adds no benefit: Hyperthreads share much of the resources of a physical core, which only helps few algorithms to get some speedup, usually in the 5% range.
What I am saying is: the number of MPI processes times the number of OpenMP threads should be equal your physical number of cores.

Try this:

export MPI_RANKS=4
export OMP_NUM_THREADS=$(( 64 / ${MPI_RANKS} ))
mpirun -n ${MPI_RANKS} ./warpx ...

Then vary the number of MPI_RANKS from 1 to 8 to see where you get best performance.

Note: for MPI parallelism we use domain decomposition, for which you need to have at least 1 AMReX "block" per MPI rank. WarpX will issue warnings about this on startup and shutdown, please follow the guidance these warnings provide.

Additionally, OpenMP provides "placement hints" that we can set - to make sure the threads are placed close (on the same chiplet) as the MPI process that spawns them. You can also try to experiment if setting these hints is needed:

export OMP_PROC_BIND=spread
export OMP_PLACES=threads

MPI (mpirun) also has options to ensure the processes (ranks) are spread out over your chiplets to start with, but that should be the default (you can watch it with a task monitor using 1 OMP thread and correct it if it does not spread out).

Let us know how this goes for you! :)

There are also two examples that optimize running on CPU with chiplets:

Perlmutter CPU (AMD): https://warpx.readthedocs.io/en/latest/install/hpc/perlmutter.html#running
Quartz CPU (Intel): https://warpx.readthedocs.io/en/latest/install/hpc/quartz.html#intel-xeon-e5-2695-v4-cpus

0 replies

ax3l · 2024-09-04T20:05:59Z

ax3l
Sep 4, 2024
Maintainer

And, is there any other way to improve the speed of simulation, for a single PC?

Yes, there is more fine tuning possible. The most important part is to use the ideal placement of MPI ranks and OpenMP threads (see above).

Additionally, the x86 architecture has undergone many many revisions and we could improve the generated code to be perfectly tailored for your specific CPU (instead of a generic, recent x86).

To do that, you could set the environment variable

export CXXFLAGS="-march=native"

before configuring (cmake --fresh -S . -B build ...) WarpX.
You can also activate fastmath:

export CXXFLAGS="-march=native -ffast-math"  # GCC or Clang

FastMath has the potential to change the results, so try it out. It is not yet our default on CPU, but already used on our GPU builds. Thus, I do not expect major issues using it, but we will introduce systematic checking only in a few months after we are done with some updates of our test system following #5068.

march tuning usually provides another 5%-ish speedup, fastmath can be larger.

If you use a package manager, our conda packages build for generic but modern x86 CPUs. Our Spack package does march tuning by default.

0 replies

binahn91 · 2024-09-05T14:03:15Z

binahn91
Sep 5, 2024
Author

Dear @ax3l

Thank you so much for the quick and detailed help.
Conclusion : As you suggested, MPI+OpenMP is more efficient, even for a single multi-cpu-core PC.

I re-installed WarpX, 'from source with cmake.'
I turned off SMT(Simultaneous Multithreading), for simplicity.
I chose the following test input file. (not my input in the original question)
Examples: Uniform Plasma (2D / Executable: Input File)
In this script, I only changed the following.

max_step = 4000
diag1.intervals = 1000
warpx.verbose = 0

And I tested various cases with the different pairs of MPI_RANKS & OMP_NUM_THREADS.
time mpirun -n ${MPI_RANKS} ./warpx.2d inputs_2d

The following plot is the result of the computation time,

'x' markers are of cases with MPI_RANKS * OMP_NUM_THREADS = 64.

So, for this input case, MPI_RANKS=16 & OMP_NUM_THREADS=4 and MPI_RANKS=8 & OMP_NUM_THREADS=4
were the best performing configuration.
I did not try larger MPI_RANKS, since for this input, 'number of boxes of cells available is 16'.

Takeaway : For a single multi-cpu-core PC, it is recommended to try different pairs of MPI_RANKS & OMP_NUM_THREADS, along with adjustment of amr.max_grid_size & amr.blocking_factor, to find the best configuration for fast computation.

Thank you again. This will save us much time.
Bin

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inputs for fast computation on a single PC #5213

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Inputs for fast computation on a single PC #5213

binahn91 Sep 4, 2024

Replies: 3 comments

ax3l Sep 4, 2024 Maintainer

ax3l Sep 4, 2024 Maintainer

binahn91 Sep 5, 2024 Author

binahn91
Sep 4, 2024

ax3l
Sep 4, 2024
Maintainer

ax3l
Sep 4, 2024
Maintainer

binahn91
Sep 5, 2024
Author