Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump Kokkos to version 4.2 #109

Closed
wants to merge 1 commit into from
Closed

Conversation

BenWibking
Copy link
Contributor

@BenWibking BenWibking commented Apr 12, 2024

This updates Kokkos to version 4.2.01.

Kokkos 4.2 fixed bugs in reductions on both CUDA and HIP: kokkos/kokkos#6197

@BenWibking BenWibking requested a review from pgrete April 12, 2024 19:26
@BenWibking
Copy link
Contributor Author

@pgrete do you know what happened to the CI here? it didn't even initialize the Docker image...

@BenWibking
Copy link
Contributor Author

@pgrete It looks like there's something wrong with the driver version on the CI machine:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
  Error: failed to start containers: 3f6b9[27](https://github.com/parthenon-hpc-lab/athenapk/actions/runs/8667014387/job/23769079949?pr=109#step:2:30)4a47dd4451cbdae35c0788d44bf2d412fcd8d8e477d2e131700214a8b
  Error: Docker start fail with exit code 1

@pgrete
Copy link
Contributor

pgrete commented Apr 15, 2024

The CI machine needs to reload the cuda driver moduels (currently prevented by a running job). Should be fixed later today.

@BenWibking
Copy link
Contributor Author

@pgrete Do you have any idea what is causing the MPI failures? They look like this:

[93f25432e875:07073] *** Process received signal ***
[93f25432e875:07073] Signal: Segmentation fault (11)
[93f25432e875:07073] Signal code: Address not mapped (1)
[93f25432e875:07073] Failing at address: (nil)
[93f25432e875:07073] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb8e7f78090]
[93f25432e875:07073] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x183bf2)[0x7fb8e80b8bf2]
[93f25432e875:07073] [ 2] /opt/openmpi/lib/libopen-rte.so.40(+0x2d821)[0x7fb8e826f821]
[93f25432e875:07073] [ 3] /opt/openmpi/lib/libopen-rte.so.40(orte_show_help_recv+0x177)[0x7fb8e826fcb7]
[93f[254](https://github.com/parthenon-hpc-lab/athenapk/actions/runs/8667014387/job/23847736958#step:6:255)32e875:07073] [ 4] /opt/openmpi/lib/libopen-rte.so.40(orte_rml_base_process_msg+0x3e1)[0x7fb8e82cd7a1]
[93f25432e875:07073] [ 5] /opt/openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0x7b3)[0x7fb8e81b3f13]
[93f25432e875:07073] [ 6] /opt/openmpi/bin/mpiexec(+0x14a1)[0x55c8789de4a1]
[93f25432e875:07073] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb8e7f59083]
[93f25432e875:07073] [ 8] /opt/openmpi/bin/mpiexec(+0x11fe)[0x55c8789de1fe]
[93f25432e875:07073] *** End of error message ***
"

@pgrete
Copy link
Contributor

pgrete commented Apr 16, 2024

Do they happen on init?
I've seen them intermittently, and IIRC they were (under some circumstances) also related to Kokkos 4.2.
Also, have you done any performance test with Kokkos 4.2? AthenaK reverted back after discovering performance issues.

@BenWibking
Copy link
Contributor Author

Do they happen on init? I've seen them intermittently, and IIRC they were (under some circumstances) also related to Kokkos 4.2. Also, have you done any performance test with Kokkos 4.2? AthenaK reverted back after discovering performance issues.

No, they appear to happen at random timesteps in the middle of each test.

Also, no...maybe this is not a good idea. I will close this PR. Maybe things will stabilize in a few months.

@BenWibking BenWibking closed this Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants