Update ALCF Sunspot and Aurora machine configs #6553

amametjanov · 2024-08-16T17:38:27Z

Update ALCF Sunspot and Aurora machine configs.

[BFB]

github-actions · 2024-08-16T17:40:30Z

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6553/
on branch `gh-pages` at 2024-08-31 02:24 UTC

oksanaguba · 2024-08-16T18:03:26Z

i was tagged to review this. i have no opinion regarding these modules (but removed ones seem not to be used in scream configs). Maybe, it is useful to list what cases you were able to build/run with which compilers for this change?

omarkahmed

@amametjanov , is there a specific reason why you are using the older software stack? The latest available module on sunspot is oneapi/eng-compiler/2024.04.15.002

amametjanov · 2024-08-16T18:40:14Z

Yes, just a draft to get basics working. I'm testing oneapi/eng-compiler/2024.04.15.002 and will ping for (re-) review. Thanks.

rljacob · 2024-08-16T18:49:11Z

@mahf708 the CI testing is running even though this a draft PR. Can we change that?

xylar · 2024-08-16T18:51:13Z

I don't think there's a way to prevent CI from draft PRs. At least I've never seen it.

mahf708 · 2024-08-16T18:54:41Z

@mahf708 the CI testing is running even though this a draft PR. Can we change that?

Yes We Can

@xylar, I think specifying the type will do, but like you, I have never seen it or used it... https://github.com/orgs/community/discussions/25722#discussioncomment-5281953

Or we can apparently query the draftness: https://stackoverflow.com/a/68349262/22990681.

mahf708 · 2024-08-16T18:58:26Z

@rljacob: I think we may want to move to labels-based testing (especially if we implement the self-hosted solution). (@xylar, for reference, https://github.com/mahf708/test-gh-runner-chrys/actions/runs/10409947659 ran on a chrysalis login node)

xylar · 2024-08-16T20:33:33Z

Good to know!

They lead to link errors in currently available oneapi versions.

amametjanov · 2024-08-19T23:02:38Z

Testing on Sunspot:

e3sm_integration + oneapi-ifx + c6e75e3: all pass -- https://my.cdash.org/viewTest.php?buildid=2636710
e3sm_scream_v1 + oneapi-ifx + ffec269: all pass -- https://my.cdash.org/viewTest.php?buildid=2636905
e3sm_gpucxx + oneapi-ifxgpu + d5c15c3: all pass -- https://my.cdash.org/viewTest.php?buildid=2638596
e3sm_scream_v1 + oneapi-ifxgpu : build errors in homme, marking this combination beyond this PR's scope

omarkahmed · 2024-08-20T03:36:32Z

@amametjanov , did you run this branch with oneapi-ifxgpu ?

To clarify, I think there need to be a few small patches in order for i.e. F2010-MMF1 to run on oneapi-ifxgpu. However, I wanted to confirm whether this would be important for the scope of this PR?

Also export ZES_ENABLE_SYSMAN=1 to avoid ext_intel_free_memory run-time errors.

Also avoid a call to variadic printf in a SYCL kernel within an error diagnostic log.

omarkahmed

@amametjanov , your changes look good to me.

I would suggest considering the following changes to the oneapi-ifxgpu config:

<MAX_MPITASKS_PER_NODE compiler="oneapi-ifxgpu">48</MAX_MPITASKS_PER_NODE>
<env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4,6:4,7:4</env>

This would allow you to run in "4-CCS mode", where you can use 4 MPI ranks per PVC tile with minimal performance differences on the device side (assuming that work is balanced between ranks), but significant speed-up on the host due to the increased number of processes.

omarkahmed

@amametjanov , I am seeing some errors on your branch running on aurora with multiple openmp threads. Removing the "--cpu-bind depth" option to mpiexec appears to resolve things. Otherwise, things are looking great!

I wanted to note that I will be on bonding leave over the next two months, so my participation during this period will be sporadic starting on Monday.

rljacob · 2024-09-05T17:33:09Z

Notes: Aurora currently down for network upgrade. Will wait for it to come back up and test again.

Update ALCF Sunspot and Aurora machine configs. [BFB]

Update ALCF Sunspot machine config

ffdb1bc

amametjanov added Machine Files BFB PR leaves answers BFB labels Aug 16, 2024

amametjanov self-assigned this Aug 16, 2024

rljacob requested review from oksanaguba and omarkahmed August 16, 2024 17:44

oksanaguba approved these changes Aug 16, 2024

View reviewed changes

omarkahmed reviewed Aug 16, 2024

View reviewed changes

amametjanov added 4 commits August 19, 2024 22:26

Update oneapi to latest default

7c473e3

Run BGC couppled cases on at least 2 nodes

a4b7c3b

Remove -check flags from debug builds

e3e52e2

They lead to link errors in currently available oneapi versions.

Cleanup unused mpilib refs

c6e75e3

amametjanov added 2 commits August 20, 2024 03:21

Run ne4-cases at 96x1 MPIxOMP

eae53e5

Add initial sunspot machine file for eamxx

ffec269

amametjanov added 3 commits August 21, 2024 23:18

Load pre-built kokkos module for GPU builds

0b58cb0

Also export ZES_ENABLE_SYSMAN=1 to avoid ext_intel_free_memory run-time errors.

Fix a SYCL typo

226ac0f

Also avoid a call to variadic printf in a SYCL kernel within an error diagnostic log.

Let SYCL kernels call a virtual function

d5c15c3

omarkahmed approved these changes Aug 26, 2024

View reviewed changes

amametjanov added 2 commits August 28, 2024 20:35

Run in 4-CCS mode

bd79ca9

Disable openmp-offload

6ae0588

amametjanov marked this pull request as ready for review August 28, 2024 20:55

amametjanov requested review from omarkahmed and oksanaguba August 28, 2024 20:55

Update Aurora machine config

04872f3

amametjanov changed the title ~~Update ALCF Sunspot machine config~~ Update ALCF Sunspot and Aurora machine configs Aug 30, 2024

omarkahmed approved these changes Aug 30, 2024

View reviewed changes

Remove "--cpu-bind depth" in Aurora mpiexec

4b5cf35

amametjanov added a commit that referenced this pull request Sep 6, 2024

Merge branch 'azamat/machines/sunspot20240816' into next (PR #6553)

651ba75

Update ALCF Sunspot and Aurora machine configs. [BFB]

amametjanov merged commit 0ce7fcf into master Sep 6, 2024
21 checks passed

amametjanov deleted the azamat/machines/sunspot20240816 branch September 6, 2024 21:14

rljacob mentioned this pull request Sep 17, 2024

Aurora config E3SM-Project/scream#2593

Open

amametjanov added the Aurora label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ALCF Sunspot and Aurora machine configs #6553

Update ALCF Sunspot and Aurora machine configs #6553

amametjanov commented Aug 16, 2024 •

edited

Loading

github-actions bot commented Aug 16, 2024 •

edited

Loading

oksanaguba commented Aug 16, 2024

omarkahmed left a comment

amametjanov commented Aug 16, 2024

rljacob commented Aug 16, 2024

xylar commented Aug 16, 2024

mahf708 commented Aug 16, 2024 •

edited

Loading

mahf708 commented Aug 16, 2024 •

edited

Loading

xylar commented Aug 16, 2024

amametjanov commented Aug 19, 2024 •

edited

Loading

omarkahmed commented Aug 20, 2024 •

edited

Loading

omarkahmed left a comment •

edited

Loading

omarkahmed left a comment •

edited

Loading

rljacob commented Sep 5, 2024

Update ALCF Sunspot and Aurora machine configs #6553

Update ALCF Sunspot and Aurora machine configs #6553

Conversation

amametjanov commented Aug 16, 2024 • edited Loading

github-actions bot commented Aug 16, 2024 • edited Loading

oksanaguba commented Aug 16, 2024

omarkahmed left a comment

Choose a reason for hiding this comment

amametjanov commented Aug 16, 2024

rljacob commented Aug 16, 2024

xylar commented Aug 16, 2024

mahf708 commented Aug 16, 2024 • edited Loading

mahf708 commented Aug 16, 2024 • edited Loading

xylar commented Aug 16, 2024

amametjanov commented Aug 19, 2024 • edited Loading

omarkahmed commented Aug 20, 2024 • edited Loading

omarkahmed left a comment • edited Loading

Choose a reason for hiding this comment

omarkahmed left a comment • edited Loading

Choose a reason for hiding this comment

rljacob commented Sep 5, 2024

amametjanov commented Aug 16, 2024 •

edited

Loading

github-actions bot commented Aug 16, 2024 •

edited

Loading

mahf708 commented Aug 16, 2024 •

edited

Loading

mahf708 commented Aug 16, 2024 •

edited

Loading

amametjanov commented Aug 19, 2024 •

edited

Loading

omarkahmed commented Aug 20, 2024 •

edited

Loading

omarkahmed left a comment •

edited

Loading

omarkahmed left a comment •

edited

Loading