Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer dnf5 for fedora CI #2742

Closed
wants to merge 217 commits into from
Closed

Prefer dnf5 for fedora CI #2742

wants to merge 217 commits into from

Conversation

dschwoerer
Copy link
Contributor

dnf5 is faster, and is sufficiently feature complete for our needs

dschwoerer and others added 30 commits February 28, 2022 13:34
* support sudo without password
* set actually a password
* Use --build-args rather then bash script to generate Dockerfile
* Previously the second cmake invocation failed, e.g. after some source
  changes or after changing a configure time option
* It also generated many files, some of which overwrite git-tracked files from
  autotools, and others where not git-ignored.
Fix some warnings and deprecated headers
Not comprehensive, still fails to run successfully on basic input
file (zero pivot in cyclic reduce)

Fixes:

- unused variables
- wrong variable names (including case)
- creating laplacians from wrong input sections
- some issues with staggering
MPI_DOUBLE is a macro, so this check can break, even if there is no mismatch.
Should be faster and more consistent
@ZedThree
Copy link
Member

Both OpenMPI and MPICH are failing with "bus error", but OpenMPI gives a bit more of a clue:

It appears as if there is not enough space for /dev/shm/vader_segment.fd7fd1d78666.1000.c41c0001.0 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

Bus errors can apparently happen due to not having enough physical memory for requested mallocs. The VM specs are:

  • 2-core CPU (x86_64)
  • 7 GB of RAM
  • 14 GB of SSD space

which I would've thought would be plenty.

OpenMPI has a stack trace too:

[fd7fd1d78666:03983] Signal: Bus error (7)
[fd7fd1d78666:03983] Signal code: Non-existant physical address (2)
[fd7fd1d78666:03983] Failing at address: 0x7f04ac0ce000
[fd7fd1d78666:03983] [ 0] /lib64/libc.so.6(+0x3e990)[0x7f04b5c5a990]
[fd7fd1d78666:03983] [ 1] /lib64/libc.so.6(+0x16f8ba)[0x7f04b5d8b8ba]
[fd7fd1d78666:03983] [ 2] /lib64/libfabric.so.1(+0x625500)[0x7f04af005500]
[fd7fd1d78666:03983] [ 3] /lib64/libfabric.so.1(+0x60d1be)[0x7f04aefed1be]
[fd7fd1d78666:03983] [ 4] /lib64/libfabric.so.1(+0x611307)[0x7f04aeff1307]
[fd7fd1d78666:03983] [ 5] /lib64/libfabric.so.1(+0x693880)[0x7f04af073880]
[fd7fd1d78666:03983] [ 6] /lib64/libfabric.so.1(+0x5f1080)[0x7f04aefd1080]
[fd7fd1d78666:03983] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_btl_ofi.so(+0x6f83)[0x7f04ae7dbf83]
[fd7fd1d78666:03983] [ 8] /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_btl_base_select+0x126)[0x7f04b40e19f6]
[fd7fd1d78666:03983] [ 9] /usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x1e)[0x7f04af1cd2fe]
[fd7fd1d78666:03983] [10] /usr/lib64/openmpi/lib/libmpi.so.40(mca_bml_base_init+0x99)[0x7f04b6262ec9]
[fd7fd1d78666:03983] [11] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x60b)[0x7f04b62a030b]
[fd7fd1d78666:03983] [12] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f04b623f462]
[fd7fd1d78666:03983] [13] /home/test/BOUT-dev/build/lib/libbout++.so.5.0.0(_ZN8BoutComm7getCommEv+0x2c)[0x7f04b821422c]
[fd7fd1d78666:03983] [14] /home/test/BOUT-dev/build/lib/libbout++.so.5.0.0(_ZN8BoutComm4rankEv+0x9)[0x7f04b8214309]
[fd7fd1d78666:03983] [15] /home/test/BOUT-dev/build/lib/libbout++.so.5.0.0(_Z14BoutInitialiseRiRPPc+0x2da)[0x7f04b7f9da2a]
[fd7fd1d78666:03983] [16] ./2fluid[0x40f44d]
[fd7fd1d78666:03983] [17] /lib64/libc.so.6(+0x2814a)[0x7f04b5c4414a]
[fd7fd1d78666:03983] [18] /lib64/libc.so.6(__libc_start_main+0x8b)[0x7f04b5c4420b]
[fd7fd1d78666:03983] [19] ./2fluid[0x40f585]

which looks to be basically the same for each test failure. You can see it's crashing in a call to BoutComm::getComm inside BoutInitialise, so this is very early one before we've allocated really anything. So either something is already using most of the VM's RAM before we even start, or there's some other issue. I'll give it a whirl locally

@dschwoerer
Copy link
Contributor Author

I tried it a bit, and I can reproduce this when I run in a container, without a tty.
With a tty it does work (I think mpich gets stuck somewhere) but the bus error is only present without a tty.

That of course adds to the fun of debugging ...

@ZedThree
Copy link
Member

Ouch. Was that just a matter of running a shell in the container and then running the tests, vs running the tests directly via the container?

@ZedThree
Copy link
Member

Running ./.ci_fedora.sh locally I get a completely different error:

[684b4db46971:02206] mca_base_component_repository_open: unable to open mca_pmix_ext3x: /usr/lib64/openmpi/lib/openmpi/mca_pmix_ext3x.so: undefined symbol: pmix_value_load (ignored)
[684b4db46971:02206] [[27020,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320

@ZedThree
Copy link
Member

I'm at a bit of a dead end here, MPICH in the container works locally for me. OpenMPI doesn't, but I think that's a bug with Fedora, and it has a different failure to that in CI.

I don't really have time to debug this further, so I think we might have to just disable the Fedora CI for the time being until someone has time to investigate and fix it.

@dschwoerer
Copy link
Contributor Author

This is the error you probably saw:
https://bugzilla.redhat.com/show_bug.cgi?id=2240042 (undetected ABI break in a dependency, a rebuild may fix this one)

Once that is fixed the other error might return, I can go about debugging it.

@dschwoerer
Copy link
Contributor Author

Should be fixed now. Failing run was PETSc #2755

Issue was caused by libfabric, downgrading resolves the issue for now.
Reported as https://bugzilla.redhat.com/show_bug.cgi?id=2242447

@ZedThree
Copy link
Member

ZedThree commented Oct 6, 2023

This isn't really a blocker for 5.1, so I'll wait for the CI to finish on master and re-release

@ZedThree
Copy link
Member

ZedThree commented Oct 6, 2023

Fedora builds finally working again, merging this into next

@ZedThree ZedThree changed the base branch from v5.1.0-rc to next October 6, 2023 10:11
@ZedThree
Copy link
Member

ZedThree commented Oct 6, 2023

🤦 Need to wait till we've actually merged master into next first

@dschwoerer
Copy link
Contributor Author

Shouldn't the fixes / improvements for CI also go to master?
The fixes for fmt should go into 5.1.1 as well?

@dschwoerer dschwoerer mentioned this pull request Oct 9, 2023
@dschwoerer
Copy link
Contributor Author

@ZedThree Any ideas why this are still 200 commits?

@dschwoerer
Copy link
Contributor Author

Replaced by #2768 vi #2764

@dschwoerer
Copy link
Contributor Author

Replaced by #2764 and #2768

@dschwoerer dschwoerer closed this Oct 13, 2023
@dschwoerer dschwoerer deleted the dnf5 branch October 13, 2023 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants