Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I reduce Pilgrim overhead? #38

Open
michael-beebe opened this issue Aug 11, 2023 · 3 comments
Open

How do I reduce Pilgrim overhead? #38

michael-beebe opened this issue Aug 11, 2023 · 3 comments

Comments

@michael-beebe
Copy link

Any scientific application (LAMMPS, WarpX) I try to plug pilgrim into seems to end up not being able to run at all. For example, I used a very simple LJ LAMMPS potential with a very small problem size that finishes running in about 15 seconds on a single node. When I turn on Pilgrim like so:

`#!/bin/bash -l
#SBATCH ...

ml load cray-mpich/8.1.25
ml load PrgEnv-gnu/8.3.3

export PILGRIM_INSTALL=""
export PILGRIM_DEBUG=0
export PILGRIM_TIMING_MODE=ZSTD # or LOSSLESS, or AGGREGATED, i've tried them all
export PILGRIM_TRACING=ON
export PILGRIM_TRACING_MODE=DEFAULT
pilgrim_flags="--export=ALL,LD_PRELOAD=${PILGRIM_INSTALL}/.libs/libpilgrim.so"

EXE=../bin/warpx.3d.MPI.CUDA.DP.PDP.OPMD.QED
INPUTS=./inputs

export MPICH_OFI_NIC_POLICY=GPU
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"
SRUN_FLAGS="--cpus-per-task=16 --cpu-bind=cores"

srun --cpu-bind=cores $pilgrim_flags bash -c "
export CUDA_VISIBLE_DEVICES=$((3-SLURM_LOCALID));
${EXE} ${INPUTS} ${GPU_AWARE_MPI}" \

${PILGRIM_INSTALL}/pilgrim2text ./pilgrim-logs`

The job times out after 3 hours. Any suggestions to reduce the overhead so I can get the job to finish? I have been successful with small test cases but not with any "real world" apps.

@wangvsa
Copy link
Collaborator

wangvsa commented Aug 11, 2023

Did you see any output suggesting the simulation was still running? 15 secs vs 3 hours doesn't seem to be an overhead issue. More likely a deadlock/blocking bug in Pilgrim. Can you share your LAMMPS configuration file? I could try it on my side.

@michael-beebe
Copy link
Author

Sure thing! Here is the input I mentioned

`# 3d Lennard-Jones melt

variable N index off # Newton Setting
variable w index 0 # Warmup Timesteps
variable t index 10 # Main Run Timesteps
variable m index 1 # Main Run Timestep Multiplier
variable n index 0 # Use NUMA Mapping for Multi-Node
variable p index 0 # Use Power Measurement

variable x index 1
variable y index 1
variable z index 1

variable xx equal $x
variable yy equal $y
variable zz equal $z
variable rr equal floor($t*$m)

newton $N
if "$n > 0" then "processors * * * grid numa"

units lj
atom_style atomic

lattice fcc 0.8442
region box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0

velocity all create 1.44 87287 loop geom

pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5

neighbor 0.3 bin
neigh_modify delay 0 every 20 check no

fix 1 all nve
thermo 1000

if "$p > 0" then "run_style verlet/power"

if "$w > 0" then "run $w"
run ${rr}`

@wangvsa
Copy link
Collaborator

wangvsa commented Aug 15, 2023

Just tried your input and a few other configurations for LAMMPS and they all worked fine on my side.
Which machine were you using? It's unlikely the issue, but could you try some applications without GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants