Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg Fault when trying to use new MueLu preconditioners for Enthalpy problems #862

Open
mperego opened this issue Nov 15, 2022 · 16 comments
Open
Assignees
Labels

Comments

@mperego
Copy link
Collaborator

mperego commented Nov 15, 2022

I get a segmentation fault when using the new MueLu settings provided by Ray Tuminaro for the Humboldt problem.
I tried different settings, below is the error for P1semR1transP2const:

2: ************************************************************************
2: -- Nonlinear Solver Step 0 -- 
2: ||F|| = 7.439e+02  step = 0.000e+00  dx = 0.000e+00
2: ************************************************************************
2: 
2:  Phalanx writing graphviz file for graph of FM0Jacobian (detail = 2)
2:  Process using 'dot -Tpng -O phalanxGraphFM0Jacobian
2:  ************* Phalanx Setup **************
2:  ************ Evaluation Types ************
2:    FM0Jacobian
2:    DFM0Residual
2:    FM0Residual
2:  
2:  ******************************************
2:  Phalanx writing graphviz file for graph of DFM0Jacobian (detail = 2)
2:  Process using 'dot -Tpng -O phalanxGraphDFM0Jacobian
2:  ************* Phalanx Setup **************
2:  ************ Evaluation Types ************
2:    DFM0Jacobian
2:    FM0Jacobian
2:    DFM0Residual
2:    FM0Residual
2:  
2:  ******************************************
2: --------------------------------------------------------------------------
2: Primary job  terminated normally, but 1 process returned
2: a non-zero exit code. Per user-direction, the job has been aborted.
2: --------------------------------------------------------------------------
2: --------------------------------------------------------------------------
2: mpiexec noticed that process rank 3 with PID 0 on node s1026095 exited on signal 11 (Segmentation fault).

I didn't get much info running dbg.
To reproduce the error, build branch https://github.com/sandialabs/Albany/tree/enthalpy_muelu and run the Enthalpy tests:
ctest -R Enthalpy_Humboldt_MueLu

@mperego mperego added the bug label Nov 15, 2022
@mperego mperego self-assigned this Nov 15, 2022
@jhux2
Copy link
Contributor

jhux2 commented Nov 15, 2022

@mperego I have an Albany executable on Perlmutter, but that's probably not the easiest platform to debug on. Is there another machine you'd suggest I build on?

@mperego
Copy link
Collaborator Author

mperego commented Nov 15, 2022

Thanks @jhux2. You could use blake. We have scripts for building Trilinos and Albany. I think you can use the gcc modules blake_gcc_modules_submit.sh and cmake scripts, do-cmake-trilinos-gcc-serial, do-cmake-albany-serial. -- I got the error with gcc compiler.
@jewatkins do you have better advise?

@jewatkins
Copy link
Collaborator

blake is probably the best option right now. The gcc build is a debug build so it will run slow but it might give you more information. You can use the binary directly: /home/projects/albany/nightlyCDashAlbanyBlake/build-gcc/AlbBuildSerialGccNoWarn/src/Albany or use the trilinos install /home/projects/albany/nightlyCDashTrilinosBlake/build-gcc/TrilinosSerialInstallGccNoWarn/

@jhux2
Copy link
Contributor

jhux2 commented Nov 16, 2022

I've run the Humboldt test that's on the main branch, just as a sanity check. This uses the executable that @jewatkins pointed to. Right after the stacked timer output, which I assume comes the end of the simulation, there are a few errors. Are these to be expected?

|   Albany Fill: State Residual: 0.00712972 - 0.011611% [1]
|   |   Phalanx::SortAndOrderEvaluators: 8.958e-06 - 0.125643% [5]
|   |   Remainder: 0.00712076 - 99.8744%
|   Albany: Output to File: 0.298793 - 0.486596% [1]
|   Remainder: 0.178301 - 0.29037%

***
*** Warning! The following Teuchos::RCPNode objects were created but have
*** not been destroyed yet.  A memory checking tool may complain that these
*** objects are not destroyed correctly.

@jewatkins
Copy link
Collaborator

Yes looks like it: https://sems-cdash-son.sandia.gov/cdash/test/3060119 We should probably look into why that's happening. The final result looks correct though.

@mperego
Copy link
Collaborator Author

mperego commented Jan 24, 2023

@jhux2 any updates on this?

@jhux2
Copy link
Contributor

jhux2 commented Jan 24, 2023

@mperego Sorry, I've not looked at this in a while. I'll pick this back up.

@jhux2
Copy link
Contributor

jhux2 commented Feb 6, 2023

@mperego I updated your branch with master and am seeing the following error. Has parsing of ice_thickness changed somehow?

180: ***************************************************************
180: **  ______   __       ______   ______   __   __   __  __     **
180: ** /\  __ \ /\ \     /\  == \ /\  __ \ /\ "-.\ \ /\ \_\ \    **
180: ** \ \  __ \\ \ \____\ \  __< \ \  __ \\ \ \-.  \\ \____ \   **
180: **  \ \_\ \_\\ \_____\\ \_____\\ \_\ \_\\ \_\\"\_\\/\_____\  **
180: **   \/_/\/_/ \/_____/ \/_____/ \/_/\/_/ \/_/ \/_/ \/_____/  **
180: **                                                           **
180: ***************************************************************
180: ** Trilinos git commit id - 62bb6ac4a8e
180: ** Albany git branch ------ enthalpy_muelu
180: ** Albany git commit id --- 75e0b13ba
180: ** Albany cxx compiler ---- GNU 10.1.0
180: ** Albany FadType --------- DFad
180: ** Albany TanFadType ------ DFad
180: ** Albany HessianVecFad  -- DFad
180: ** Simulation start time -- 2023-02-06 at 14:10:52
180: ***************************************************************
180:
180: p=1: *** Caught standard std::exception of type 'Teuchos::Exceptions::InvalidParameterName' :
180:
180:  Error, the parameter {name="Required Fields",type="Array(string)",value="{ice_thickness}"}
180:  in the parameter (sub)list "Albany Parameters->Problem"
180:  was not found in the list of valid parameters!
180:
180:  The valid parameters and types are:
180:    {
180:      "Name" : string =
180:      "Number of Spatial Processors" : int = -1
180:      "Phalanx Graph Visualization Detail" : int = 0
180:      "Use Physics-Based Preconditioner" : bool = 0
180:      "Physics-Based Preconditioner" : string = None
180:      "Initial Condition" : ParameterList = ...
180:      "Initial Condition Dot" : ParameterList = ...
180:      "Initial Condition DotDot" : ParameterList = ...
180:      "Source Functions" : ParameterList = ...
180:      "Absorption" : ParameterList = ...
180:      "Response Functions" : ParameterList = ...
180:      "Parameters" : ParameterList = ...
180:      "Random Parameters" : ParameterList = ...
180:      "Linear Combination Parameters" : ParameterList = ...
180:      "LogNormal Parameter" : ParameterList = ...
180:      "Teko" : ParameterList = ...
180:      "Hessian" : ParameterList = ...
180:      "XFEM" : ParameterList = ...
180:      "Dirichlet BCs" : ParameterList = ...
180:      "Neumann BCs" : ParameterList = ...
180:      "Adaptation" : ParameterList = ...
180:      "Overwrite Nominal Values With Final Point" : bool = 0
180:      "Number Of Time Derivatives" : int = 1
180:      "Use MDField Memoization" : bool = 0
180:      "Use MDField Memoization For Parameters" : bool = 0
180:      "Ignore Residual In Jacobian" : bool = 0
180:      "Perturb Dirichlet" : double = 0
180:      "Solution Method" : string = Steady
180:      "Homotopy Restart Step" : double = 1
180:      "Second Order" : string = No
180:      "Print Response Expansion" : bool = 1
180:      "Compute Sensitivities" : bool = 1
180:      "Constitutive Model NOX Status Test" : Teuchos::RCP<NOX::StatusTest::Generic> = Teuchos::RCP<NOX::StatusTest::Generic>{ptr=0,node=0,strong_count=0,weak_count=0}
180:      "LandIce Physical Parameters" : ParameterList = ...
180:      "LandIce Enthalpy" : ParameterList = ...
180:      "LandIce Viscosity" : ParameterList = ...
180:      "Stereographic Map" : ParameterList = ...
180:      "Basal Side Name" : string =
180:      "Needs Dissipation" : bool = 1
180:      "Needs Basal Friction" : bool = 1
180:    }
180:
180:
180:  Throw number = 1
180:

@mperego
Copy link
Collaborator Author

mperego commented Feb 6, 2023

@jhux2, we cleaned a bit the code. Please remove these lines:

    Required Fields: [ice_thickness]
    Required Basal Fields: [ice_thickness]

Element Shape: Wedge

@jhux2
Copy link
Contributor

jhux2 commented Feb 6, 2023

Thanks, @mperego. Another error, I guess masked by the first:

    Start 180: landIce_Enthalpy_Humboldt_MueLu_P1semiR1transP2const

180: Test command: /projects/sems/install/rhel7-x86_64/sems/v2/tpl/openmpi/4.0.5/gcc/10.1.0/base/e64jpaw/bin/mpiexec "-np" "4" "/scratch/jhu/fanssie/build-albany-relwithdebinfo/src/Albany" "input_enthalpy_humboldt_muelu_P1semiR1transP2const.yaml"
180: Working Directory: /scratch/jhu/fanssie/build-albany-relwithdebinfo/tests/landIce/Enthalpy
180: Test timeout computed to be: 1500
180: ***************************************************************
180: **  ______   __       ______   ______   __   __   __  __     **
180: ** /\  __ \ /\ \     /\  == \ /\  __ \ /\ "-.\ \ /\ \_\ \    **
180: ** \ \  __ \\ \ \____\ \  __< \ \  __ \\ \ \-.  \\ \____ \   **
180: **  \ \_\ \_\\ \_____\\ \_____\\ \_\ \_\\ \_\\"\_\\/\_____\  **
180: **   \/_/\/_/ \/_____/ \/_____/ \/_/\/_/ \/_/ \/_/ \/_____/  **
180: **                                                           **
180: ***************************************************************
180: ** Trilinos git commit id - 62bb6ac4a8e
180: ** Albany git branch ------ enthalpy_muelu
180: ** Albany git commit id --- 75e0b13ba
180: ** Albany cxx compiler ---- GNU 10.1.0
180: ** Albany FadType --------- DFad
180: ** Albany TanFadType ------ DFad
180: ** Albany HessianVecFad  -- DFad
180: ** Simulation start time -- 2023-02-06 at 14:31:21
180: ***************************************************************
180: Albany_IOSS: Loading STKMesh from Exodus file  ../AsciiMeshes/Humboldt/humboldt_2d.exo
180:
180: IOSS: Using decomposition method 'RIB' for 2,611 elements on 4 mpi ranks.
180:
180: p=3: *** Caught standard std::exception of type 'Teuchos::Exceptions::InvalidParameterValue' :
180:
180:  /ascldap/users/jhu/fanssie/sources/Albany/src/disc/stk/Albany_ExtrudedSTKMeshStruct.cpp:136:
180:
180:  Throw number = 1
180:
180:  Throw test that evaluated to true: basalside_elem_name != elem2d_name
180:
180:
180:  Error in ExtrudedSTKMeshStruct: Expecting topology name of elements of 2d mesh to be Quadrilateral_4 but it is Triangle_3

@mperego
Copy link
Collaborator Author

mperego commented Feb 6, 2023

@jhux2 I guess you merged with master before #888 got merged. If so, you need to put back
Element Shape: Wedge

Let me know if this is not the issue

@jhux2
Copy link
Contributor

jhux2 commented Feb 6, 2023

@mperego That seems to have fixed it, I'm now back to the original error you reported. Thanks.

@jhux2
Copy link
Contributor

jhux2 commented Feb 7, 2023

@mperego Here's a quick update. MueLu's setup is recursing until it exhausts stack memory, and one of the processes seg faults. I'm sifting through factory dependency information at the moment to see what's going wrong.

@mperego
Copy link
Collaborator Author

mperego commented Feb 7, 2023

@jhux2 thanks for looking into that! It doesn't sound fun..

@mperego
Copy link
Collaborator Author

mperego commented May 18, 2023

@jhux2 are there any updates on this issue?

@mperego
Copy link
Collaborator Author

mperego commented Jun 7, 2023

Hi @jhux, there have been some changes in Albany that needs to be merged in this branch. A few additional changes are needed in the input files as well. Let me know when you plan to look into this and I'll do the merge and fix the input files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants