MPI_Reduce when a process failed not really blocking #67

Loay-Tabbara · 2022-08-17T11:24:23Z

Hello,

i am using "mpirun (Open MPI) 5.0.0rc6" with the latest fenix version from the master branch

MPI_Reduce lets some processes pass with init rank state and finish before the error starts being notified or handled, even though there is a failed process before. Which results in wrong sum the code in details if there is no barrier after the reduce.

This does not happen with MPI_Bcast & MPI_Allreduce.

If needed here is a link to code-file in c on easyupload will expire after a couple of weeks

#include <stdio.h>
#include <fenix.h>
#include <mpi.h>
#include <signal.h>

int me, total, error_code, sum=0, fenix_status, spares = 1, spawn_policy = 0;
int old_me;
int main(int argc, char *argv[]) {
MPI_Comm new_comm, current_comm;
MPI_Init(&argc, &argv);
MPI_Comm_dup(MPI_COMM_WORLD, &current_comm);

Fenix_Init(&fenix_status, current_comm, &new_comm, &argc, &argv, spares, spawn_policy, MPI_INFO_NULL, &error_code);

MPI_Comm_rank(new_comm, &me);
MPI_Comm_size(new_comm, &total);

if(fenix_status == FENIX_ROLE_INITIAL_RANK){
old_me=me;
}

MPI_Barrier(new_comm);
if(me == total-1&& fenix_status == FENIX_ROLE_INITIAL_RANK) {
printf("killing: (%d)\n", me);
fflush(stdout);
raise(SIGKILL);
}

//MPI_Allreduce(&me, &sum, 1, MPI_INT, MPI_SUM, new_comm); // does not produce a problem
MPI_Reduce(&me, &sum, 1, MPI_INT, MPI_SUM,0, new_comm);// produces a problem

printf("P(%d) i was P(%d), state: %d\n" ,me,old_me,fenix_status);

if(me==0)printf("sum:%d\n", sum);
//MPI_Barrier(new_comm); // when commented out wrong sums(every run a different sum) will come out because the procs which passed "reduce" with init-state do finish if no barrier exists
Fenix_Finalize();
MPI_Finalize();
}

Loay-Tabbara · 2022-08-17T11:26:29Z

i forgot there is a drag function here here is the file
fenix-reduce-issue.c.txt

bosilca · 2023-02-14T21:40:47Z

MPI_Reduce being a rooted collective, the only process where the error would be correctly reported is the root, because it will miss one of the contributions. For every other process, the reduction, especially when on very short data, will appear as a non-blocking communications.

Imagine an execution scenario where the process that is expected to die is extremely late, to the point where all other processes were able to send their contributions to the root (assume the reduction is implemented as a star, all processes sending their data directly to the root). Thus, at the moment where a process, has sent its contribution there was no known error, the sending completed successfully, and the process was able to get out of the MPI_Reducewithout reporting any problems. The observed behavior is legitime in a distributed system.

You should be able to see the same behavior in MPI_Bcast in some cases. Basically, if the broadcast topology is a star and you kill the last process in the communicator, every other process will correctly receive their data, and the error will only be reported at the root. This cannot happen in non-rooted operations such as MPI_Allreduce or MPI_Barrier because all processes need to contribute, and one of them will not be able to.

abouteiller · 2023-02-16T18:39:57Z

Hi, this looks like normal behavior w.r.t. the outcome of the REDUCE call. You do not have a guarantee that the MPI_REDUCE will produce the same error code at all ranks by default. Multiple mitigations can be done here

Use a synchronizing call before you enter Fenix_finalize, to guarantee that you will get an error handler called before you leave
Use synchronizing/uniform versions of the collective operations. This is a new feature that has been recently added to the ULFM spec (more info found here https://fault-tolerance.org/2022/10/31/ulfm-specification-update-4/). We have code that support that feature, but it has not yet made it's way to the Open MPI main repo. There are performance considerations here.
Fenix_finalize should probably (optionally?) perform some synchronization of its own to avoid that problem, as it can be confusing to end-users that some processes exit when we have recovery work to do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_Reduce when a process failed not really blocking #67

MPI_Reduce when a process failed not really blocking #67

Loay-Tabbara commented Aug 17, 2022

Loay-Tabbara commented Aug 17, 2022

bosilca commented Feb 14, 2023

abouteiller commented Feb 16, 2023 •

edited

Loading

MPI_Reduce when a process failed not really blocking #67

MPI_Reduce when a process failed not really blocking #67

Comments

Loay-Tabbara commented Aug 17, 2022

Loay-Tabbara commented Aug 17, 2022

bosilca commented Feb 14, 2023

abouteiller commented Feb 16, 2023 • edited Loading

abouteiller commented Feb 16, 2023 •

edited

Loading