Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Reduce when a process failed not really blocking #67

Open
Loay-Tabbara opened this issue Aug 17, 2022 · 3 comments
Open

MPI_Reduce when a process failed not really blocking #67

Loay-Tabbara opened this issue Aug 17, 2022 · 3 comments

Comments

@Loay-Tabbara
Copy link

Hello,

i am using "mpirun (Open MPI) 5.0.0rc6" with the latest fenix version from the master branch

MPI_Reduce lets some processes pass with init rank state and finish before the error starts being notified or handled, even though there is a failed process before. Which results in wrong sum the code in details if there is no barrier after the reduce.

This does not happen with MPI_Bcast & MPI_Allreduce.

If needed here is a link to code-file in c on easyupload will expire after a couple of weeks

#include <stdio.h>
#include <fenix.h>
#include <mpi.h>
#include <signal.h>

int me, total, error_code, sum=0, fenix_status, spares = 1, spawn_policy = 0;
int old_me;
int main(int argc, char *argv[]) {
MPI_Comm new_comm, current_comm;
MPI_Init(&argc, &argv);
MPI_Comm_dup(MPI_COMM_WORLD, &current_comm);

Fenix_Init(&fenix_status, current_comm, &new_comm, &argc, &argv, spares, spawn_policy, MPI_INFO_NULL, &error_code);

MPI_Comm_rank(new_comm, &me);
MPI_Comm_size(new_comm, &total);

if(fenix_status == FENIX_ROLE_INITIAL_RANK){
old_me=me;
}

MPI_Barrier(new_comm);
if(me == total-1&& fenix_status == FENIX_ROLE_INITIAL_RANK) {
printf("killing: (%d)\n", me);
fflush(stdout);
raise(SIGKILL);
}

//MPI_Allreduce(&me, &sum, 1, MPI_INT, MPI_SUM, new_comm); // does not produce a problem
MPI_Reduce(&me, &sum, 1, MPI_INT, MPI_SUM,0, new_comm);// produces a problem

printf("P(%d) i was P(%d), state: %d\n" ,me,old_me,fenix_status);

if(me==0)printf("sum:%d\n", sum);
//MPI_Barrier(new_comm); // when commented out wrong sums(every run a different sum) will come out because the procs which passed "reduce" with init-state do finish if no barrier exists
Fenix_Finalize();
MPI_Finalize();
}

@Loay-Tabbara
Copy link
Author

i forgot there is a drag function here here is the file
fenix-reduce-issue.c.txt

@bosilca
Copy link
Contributor

bosilca commented Feb 14, 2023

MPI_Reduce being a rooted collective, the only process where the error would be correctly reported is the root, because it will miss one of the contributions. For every other process, the reduction, especially when on very short data, will appear as a non-blocking communications.

Imagine an execution scenario where the process that is expected to die is extremely late, to the point where all other processes were able to send their contributions to the root (assume the reduction is implemented as a star, all processes sending their data directly to the root). Thus, at the moment where a process, has sent its contribution there was no known error, the sending completed successfully, and the process was able to get out of the MPI_Reducewithout reporting any problems. The observed behavior is legitime in a distributed system.

You should be able to see the same behavior in MPI_Bcast in some cases. Basically, if the broadcast topology is a star and you kill the last process in the communicator, every other process will correctly receive their data, and the error will only be reported at the root. This cannot happen in non-rooted operations such as MPI_Allreduce or MPI_Barrier because all processes need to contribute, and one of them will not be able to.

@abouteiller
Copy link

abouteiller commented Feb 16, 2023

Hi, this looks like normal behavior w.r.t. the outcome of the REDUCE call. You do not have a guarantee that the MPI_REDUCE will produce the same error code at all ranks by default. Multiple mitigations can be done here

  1. Use a synchronizing call before you enter Fenix_finalize, to guarantee that you will get an error handler called before you leave
  2. Use synchronizing/uniform versions of the collective operations. This is a new feature that has been recently added to the ULFM spec (more info found here https://fault-tolerance.org/2022/10/31/ulfm-specification-update-4/). We have code that support that feature, but it has not yet made it's way to the Open MPI main repo. There are performance considerations here.
  3. Fenix_finalize should probably (optionally?) perform some synchronization of its own to avoid that problem, as it can be confusing to end-users that some processes exit when we have recovery work to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants