Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi: fix build on darwin #332983

Closed
wants to merge 5 commits into from
Closed

Conversation

markuskowa
Copy link
Member

Description of changes

This pulled in pmix unconditionally as a dependency and effectively disabled openmpi on darwin.
Fix for #327438

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@markuskowa markuskowa requested a review from doronbehar August 7, 2024 12:00
Copy link
Contributor

@doronbehar doronbehar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing and fixing!

@SigmaSquadron SigmaSquadron added 12.approvals: 1 This PR was reviewed and approved by one reputable person 11.by: package-maintainer This PR was created by the maintainer of the package it changes labels Aug 7, 2024
Copy link
Contributor

@nim65s nim65s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working for me on aarch64-darwin:

openmpi> *** Configuring PMIx
openmpi> configure: WARNING: yes/no are invalid responses for --with-pmix-libdir.  Please specify a path.
openmpi> configure: error: Cannot continue

Copy link
Contributor

@doronbehar doronbehar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just about to push this commit instead of this one:

commit 56ad7bfea46efa13ad5429f8f2581c36e7819ef0
Author: Doron Behar <[email protected]>
Date:   Wed Aug 7 17:09:27 2024 +0300

    openmpi: also don't use --without-pmix-libdir, like (ofi-libdir)
    
    Fixes the configurePhase on Dawrin, see:
    https://github.com/NixOS/nixpkgs/pull/332983#pullrequestreview-2225271302

diff --git a/pkgs/development/libraries/openmpi/default.nix b/pkgs/development/libraries/openmpi/default.nix
index 38080e151f58..86d4670ed102 100644
--- a/pkgs/development/libraries/openmpi/default.nix
+++ b/pkgs/development/libraries/openmpi/default.nix
@@ -117,7 +117,6 @@ stdenv.mkDerivation (finalAttrs: {
     (lib.enableFeature fortranSupport "mpi-fortran")
     (lib.withFeatureAs stdenv.isLinux "libnl" (lib.getDev libnl))
     "--with-pmix=${if stdenv.isLinux then (lib.getDev pmix) else "internal"}"
-    (lib.withFeatureAs stdenv.isLinux "pmix-libdir" "${lib.getLib pmix}/lib")
     # Puts a "default OMPI_PRTERUN" value to mpirun / mpiexec executables
     (lib.withFeatureAs stdenv.isLinux "prrte" (lib.getBin prrte))
     (lib.withFeature enableSGE "sge")
@@ -129,9 +128,11 @@ stdenv.mkDerivation (finalAttrs: {
     (lib.enableFeature cudaSupport "dlopen")
     (lib.withFeatureAs fabricSupport "psm2" (lib.getDev libpsm2))
     (lib.withFeatureAs fabricSupport "ofi" (lib.getDev libfabric))
-    # The flag --without-ofi-libdir is not supported from some reason, so we
-    # don't use lib.withFeatureAs
-  ] ++ lib.optionals fabricSupport [ "--with-ofi-libdir=${lib.getLib libfabric}/lib" ];
+    # Any --without-*-libdir flag is not supported from some reason, so we
+    # don't use lib.withFeatureAs for these flags.
+  ]
+    ++ lib.optionals stdenv.isLinux [ "--with-pmix-libdir=${lib.getLib pmix}/lib" ]
+    ++ lib.optionals fabricSupport [ "--with-ofi-libdir=${lib.getLib libfabric}/lib" ];
 
   enableParallelBuilding = true;
 

] ++ lib.optional fabricSupport "--with-ofi-libdir=${lib.getLib libfabric}/lib"
# The pmix flags should only be set when pmix is used
++ lib.optionals stdenv.isLinux [
"--with-pmix=${lib.getDev pmix}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this change - you still want to set --with-pmix=internal no?

Copy link
Member Author

@markuskowa markuskowa Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to set this at all on darwin? "internal" should be the default if it can be built on a given platform. Otherwise it should not be built at all. Note, that pmix is feature to simplify startup of a large number process over a multiple machines. Unless you have a large cluster of darwin machines (managed by slurm or some other wlm) pmix is not of much use anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to set this at all on darwin? "internal" should be the default if it can be built on a given platform. Otherwise it should not be built at all.

Wait, are you saying that there is a difference between using --with-pmix=internal and not using at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the help text from configure:

  --with-pmix(=DIR)       Build pmix support. DIR can take one of three
                          values: "internal", "external", or a valid directory
                          name. "internal" forces Open MPI to use its internal
                          copy of pmix. "external" forces Open MPI to use an
                          external installation of pmix. Supplying a valid
                          directory name also forces Open MPI to use an
                          external installation of pmix, and adds DIR/include,
                          DIR/lib, and DIR/lib64 to the search path for
                          headers and libraries. Note that Open MPI no longer
                          supports --without-pmix. If no argument is
                          specified, Open MPI will search default locations
                          for pmix and fall back to an internal version if one
                          is not found.
  --with-pmix-libdir=DIR  Search for pmix libraries in DIR. Should only be
                          used if an external copy of pmix is being used.

I would not request to build pmix at all on non-Linux, but let openmpi's internal defaults decide if it wants to build it or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not request to build pmix at all on non-Linux, but let openmpi's internal defaults decide if it wants to build it or not.

Usually in packages with many such build options that I maintain, I prefer to track upstream's default behavior and mimic it in the Nix functional behavior ("a function always returns something" sort of). What's weird to me, is that there shouldn't be a significant difference between the bundled pmix and the one we package. Perhaps we should just try to enable pmix for all platforms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should just try to enable pmix for all platforms?

What would be the benefit? pmix requires tight integration with a workload manager. That is already difficult enough to achieve on Linux. Why would we add the extra maintenance effort if it is a niche (if any at all) application on non-Linux.

The benefit would be that you'd handle all platforms consistently. If there will be a bug in the future with pmix only on 1 platform, you'd notice that and be able to focus on that when building pmix, and not when building a openmpi. Also, it'd be easier to apply patches to pmix when it is not a submodule. Note how they simply use a git submodule for pmix:

https://github.com/open-mpi/ompi/tree/v5.0.3/3rd-party

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I can not see that benefit. For me this just means more maintenance effort, to make sure this also builds on darwin (where it is not even needed). Do we know if openmpi even builds with the internal pmix on Darwin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I can not see that benefit. For me this just means more maintenance effort, to make sure this also builds on darwin.

You are making sure it also builds fine on Darwin when you are building openmpi on Darwin, because you are building it from the submodule.

... (where it is not even needed). Do we know if openmpi even builds with the internal pmix on Darwin?

You know it builds with both, because they are the same - up to minor mismatches between the pmix release we build, and the one that was used as the submodule when openmpi was released.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know it builds with both, because they are the same

What I actually meant is: does openmpi build the internal pmix if no --with-pmix is specified or is it simply disabled on darwin per default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know it builds with both, because they are the same

What I actually meant is: does openmpi build the internal pmix if no --with-pmix is specified or is it simply disabled on darwin per default?

I suspect that if --with-pmix is not specified, they simply use the bundled pmix. Perhaps @nim65s could share with us the build log and we will be more sure.

@markuskowa
Copy link
Member Author

@nim65s thanks for checking. I fixed the pmix config parameters but the lib.withFeature* functions may have broken other options on darwin. Can you please check and give me feedback what else needs to be changed. I do not have access to darwin machine, which makes debugging issues with darwin nearly impossible.

@doronbehar
Copy link
Contributor

Also, very annoyingly, nixfmt is annoyed now and requires changing the indentation of all of configureFlags 😠 . I can do that in a separate commit @markuskowa after my proposed commit.

@nim65s
Copy link
Contributor

nim65s commented Aug 7, 2024

Now, in postInstall:

openmpi> substituteStream() in derivation openmpi-5.0.3: ERROR: pattern compiler=g++ doesn't match anything in file '/nix/store/4xcjjj5zghf6rb9cx2bb0k9f6m4k5wra-openmpi-5.0.3/share/openmpi/mpic++-wrapper-data.txt'

@doronbehar doronbehar changed the title openmpi: remove pmix reference from comment string openmpi: fix build on darwin Aug 7, 2024
@doronbehar doronbehar added the 6.topic: darwin Running or building packages on Darwin label Aug 7, 2024
@doronbehar
Copy link
Contributor

Now, in postInstall:

openmpi> substituteStream() in derivation openmpi-5.0.3: ERROR: pattern compiler=g++ doesn't match anything in file '/nix/store/4xcjjj5zghf6rb9cx2bb0k9f6m4k5wra-openmpi-5.0.3/share/openmpi/mpic++-wrapper-data.txt'

Thanks for reporting! Unfortunately, it will be hard to solve this remotely... Here's what I would have done, if I had a Darwin machine: Build openmpi with this temporary change applied:

diff --git i/pkgs/development/libraries/openmpi/default.nix w/pkgs/development/libraries/openmpi/default.nix
index 4af4c475ea62..93d8683fcca5 100644
--- i/pkgs/development/libraries/openmpi/default.nix
+++ w/pkgs/development/libraries/openmpi/default.nix
@@ -218,10 +218,13 @@ stdenv.mkDerivation (finalAttrs: {
         (lib.mapCartesianProduct (
           { part1, part2 }:
           ''
-            substituteInPlace "''${!outputDev}/share/openmpi/${part1}${part2}-wrapper-data.txt" \
-              --replace-fail \
-                compiler=${lib.elemAt wrapperDataSubstitutions.${part2} 0} \
-                compiler=${lib.elemAt wrapperDataSubstitutions.${part2} 1}
+            if [[ -f "''${!outputDev}/share/openmpi/${part1}${part2}-wrapper-data.txt" ]]; then
+              echo @@@@ "''${!outputDev}/share/openmpi/${part1}${part2}-wrapper-data.txt"
+              cat "''${!outputDev}/share/openmpi/${part1}${part2}-wrapper-data.txt"
+            else
+              echo @@@@ file does not exist \
+                "''${!outputDev}/share/openmpi/${part1}${part2}-wrapper-data.txt"
+            fi
           ''
         ))
         (lib.concatStringsSep "\n")

And inspect the output... We probably need to change this pattern for the Darwin... Or perhaps disable this substitution for Darwin. I wonder what comes out in those files anyway.

@markuskowa
Copy link
Member Author

Or perhaps disable this substitution for Darwin. I wonder what comes out in those files anyway.

As is, we can run the whole postInstall only on Linux. Unless we have a dawin maintainer who is willing to take of this, I do not see another option.

@doronbehar
Copy link
Contributor

Or perhaps disable this substitution for Darwin. I wonder what comes out in those files anyway.

As is, we can run the whole postInstall only on Linux. Unless we have a dawin maintainer who is willing to take of this, I do not see another option.

Yea that might be what we'll do - after all this is meant for cross compilation which I barely imagine people do that even from Linux. Not only that, this is not a regular build -> host cross compilation related, but rather a build -> target -> host substitution.

in lib.optionalString stdenv.isLinux
''
in
lib.optionalString stdenv.isLinux ''
Copy link
Contributor

@doronbehar doronbehar Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last nixfmt commit could be squashed with it's former, because the diff noise is not too large.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll squash once we have sorted out all issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll squash once we have sorted out all issues.

I wouldn't want to squash all commits, as the diff noise is large for the configureFlags change... Perhaps if we'd fix pmix on Darwin the diff noise there will be negligible.

@@ -193,7 +193,7 @@ stdenv.mkDerivation (finalAttrs: {
];
part2 = builtins.attrNames wrapperDataSubstitutions;
};
in
in lib.optionalString stdenv.isLinux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to only condition the substitution and not all of the postInstall... We might enable multiple outputs in the future for other platforms, and you'd still want to delete all *.la files...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need extra condition for the split output moves too. If we want to enable multiple outputs in the future for other platforms we can make it more granular later? I do not know what darwin does with .la files. I know that they are not needed on Linux.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need extra condition for the split output moves too.

No you won't. That was the purpose of 2a4636b .

If we want to enable multiple outputs in the future for other platforms we can make it more granular later?

The diff will be much smaller thanks to 2a4636b .

I do not know what darwin does with .la files. I know that they are not needed on Linux.

Yea that line was peculiar to me so I didn't touch it.. I suspect they might be generated on Darwin as well, so I wouldn't touch it even now. For sure it doesn't hurt if these files are not found at all.

@nim65s
Copy link
Contributor

nim65s commented Aug 7, 2024

c5a289b build fine on aarch64-darwin

@doronbehar
Copy link
Contributor

As an alternative to this PR, I suggest:

#333016

If @nim65s would be kind enough to help us test it quicker then ofborg, that would be great :).

@nim65s nim65s mentioned this pull request Aug 7, 2024
13 tasks
@wegank wegank removed the 12.approvals: 1 This PR was reviewed and approved by one reputable person label Aug 10, 2024
@doronbehar doronbehar closed this Aug 12, 2024
@markuskowa markuskowa deleted the fix-openmpi branch November 30, 2024 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: darwin Running or building packages on Darwin 10.rebuild-darwin: 101-500 10.rebuild-linux: 101-500 11.by: package-maintainer This PR was created by the maintainer of the package it changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants