Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commander crush when attempting first run #168

Open
walleva1715 opened this issue Sep 28, 2023 · 0 comments
Open

Commander crush when attempting first run #168

walleva1715 opened this issue Sep 28, 2023 · 0 comments

Comments

@walleva1715
Copy link

Hi all,

I am running Commander3 on a cluster. I compiled current master branch with intel compilers. I attempt to run a tutorial parameter file but the job is terminated with the error attached below.

Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1548)..............: 
MPIDI_OFI_mpi_init_hook(1554): 
(unknown)(): Other MPI error
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090959
:
system msg for write_line failure : Bad file descriptor
Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1548)..............: 
MPIDI_OFI_mpi_init_hook(1554): 
(unknown)(): Other MPI error
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090959
:
system msg for write_line failure : Bad file descriptor
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libpthread-2.31.s  0000152B82E37420  Unknown               Unknown  Unknown
libmpi.so.12.0.0   0000152B81233BE1  MPIR_Err_return_c     Unknown  Unknown
libmpi.so.12.0.0   0000152B813D9ED0  MPI_Init              Unknown  Unknown
libmpifort.so.12.  0000152B829D748B  PMPI_INIT             Unknown  Unknown
commander3         000000000049276A  MAIN__                     77  commander.f90
commander3         00000000004923BD  Unknown               Unknown  Unknown
libc-2.31.so       0000152B806DD083  __libc_start_main     Unknown  Unknown
commander3         00000000004922DE  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28425,1],0]
  Exit code:    174
--------------------------------------------------------------------------

The parameter file is from BP10 branch. And the version of MPI I was using is mpirun (Open MPI) 4.1.5 I did not modify much but only change the path of output and data path. Is it an error related to MPI or running out of memory on my cluster? Looking forward to any help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant