Inputs for fast computation on a single PC #5213
-
Dear developers, I am currently running various simple cases on a single PC.
I would like to ask for advice on running warpx efficiently(fast) on a single PC, I ran sims with the following in the linux terminal. The sim is a simple RZ grid with uniform plasma(electron+ion_Ar), for 100 steps. Briefly, the sim initially has about 3 million particles with 16384 rz cells. I am wondering why it is fastest with 16 or 32 ompthreads.
Does my input need to be fixed? And, is there any other way to improve the speed of simulation, for a single PC? Thank you very much! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Hi @binahn91, Awesome question! Here is how you can optimize running on modern CPUs like your AMD Ryzen Threadripper 7980X. Modern multi-core CPUs are set up in multiple islands ("chiplets" with their own bus rings(, even within a single socket. Looking at the Ryzen Threadripper whitepaper from AMD: It looks like there might be around 4-8 "chiplets" on your CPU. These chiplets contribute each a fraction to your total physical cores and have very fast access to their closest memory banks (RAM), but are a bit slower if they have to cross-communicate (AMD calls the cross-CPU-island communication "infinity fabric"). You performance benchmark shows that pretty well: there is a penalty if the cooperating OpenMP threads cross the chiplets. The strategy we want to follow here is to use MPI+OpenMP. You can test this out. Assuming there are 4 islands, use 4 MPI processes (ranks) via Now, each MPI process will spawn OpenMP threads. By default, we use the number of physical cores (64 in your case) and avoid Hyperthreading (2x in this case, which makes it "128 virtual cores"), because usually it adds no benefit: Hyperthreads share much of the resources of a physical core, which only helps few algorithms to get some speedup, usually in the 5% range. Try this: export MPI_RANKS=4
export OMP_NUM_THREADS=$(( 64 / ${MPI_RANKS} ))
mpirun -n ${MPI_RANKS} ./warpx ... Then vary the number of Note: for MPI parallelism we use domain decomposition, for which you need to have at least 1 AMReX "block" per MPI rank. WarpX will issue warnings about this on startup and shutdown, please follow the guidance these warnings provide. Additionally, OpenMP provides "placement hints" that we can set - to make sure the threads are placed close (on the same chiplet) as the MPI process that spawns them. You can also try to experiment if setting these hints is needed: export OMP_PROC_BIND=spread
export OMP_PLACES=threads MPI (mpirun) also has options to ensure the processes (ranks) are spread out over your chiplets to start with, but that should be the default (you can watch it with a task monitor using 1 OMP thread and correct it if it does not spread out). Let us know how this goes for you! :) There are also two examples that optimize running on CPU with chiplets:
|
Beta Was this translation helpful? Give feedback.
-
Yes, there is more fine tuning possible. The most important part is to use the ideal placement of MPI ranks and OpenMP threads (see above). Additionally, the x86 architecture has undergone many many revisions and we could improve the generated code to be perfectly tailored for your specific CPU (instead of a generic, recent x86). To do that, you could set the environment variable export CXXFLAGS="-march=native" before configuring ( export CXXFLAGS="-march=native -ffast-math" # GCC or Clang FastMath has the potential to change the results, so try it out. It is not yet our default on CPU, but already used on our GPU builds. Thus, I do not expect major issues using it, but we will introduce systematic checking only in a few months after we are done with some updates of our test system following #5068.
If you use a package manager, our conda packages build for generic but modern x86 CPUs. Our Spack package does march tuning by default. |
Beta Was this translation helpful? Give feedback.
-
Dear @ax3l Thank you so much for the quick and detailed help.
And I tested various cases with the different pairs of
'x' markers are of cases with Takeaway : For a single multi-cpu-core PC, it is recommended to try different pairs of Thank you again. This will save us much time. |
Beta Was this translation helpful? Give feedback.
Hi @binahn91,
Awesome question!
Here is how you can optimize running on modern CPUs like your AMD Ryzen Threadripper 7980X.
Modern multi-core CPUs are set up in multiple islands ("chiplets" with their own bus rings(, even within a single socket. Looking at the Ryzen Threadripper whitepaper from AMD:
https://www.amd.com/system/files/documents/tr-pro-workstation-white-paper.pdf
Also a good page: https://en.wikichip.org/wiki/amd/microarchitectures/zen_4
It looks like there might be around 4-8 "chiplets" on your CPU. These chiplets contribute each a fraction to your total physical cores and have very fast access to their closest memory banks (RAM), but are a bit slower if they have to cross-c…