destroy pool fails with seg fault for multi-block #788

mark-petersen · 2021-01-13T15:09:23Z

If we run with two blocks per core in RK4, the destroy pool fails at the end of the first time step. Error is a seg fault, with traceback:

MPAS-Model/src/framework/mpas_field_routines.F

Line 1717 in 057dd78

deallocate(f_cursor)

MPAS-Model/src/framework/mpas_pool_routines.F

Line 261 in 51d5624

deallocate(dptr % r2 % array, stat=local_err)

MPAS-Model/src/core_sw/mpas_sw_time_integration.F

Line 349 in 51d5624

call mpas_pool_destroy_pool(provisStatePool)

This was introduced in #578.

We can invoke this error in the shallow water core, for example, by setting config_number_of_blocks = 36 but running on 18 cores.

The text was updated successfully, but these errors were encountered:

mark-petersen · 2021-01-13T15:27:38Z

@amametjanov and @mgduda I think this problem was introduced inadvertently, as I looked through the comments in #578 and there was no discussion or testing of multiple blocks.

@mgduda if you have a little time to triage, it would help to know if this multi-block failure is fundamental to this design, or just an oversight.

mark-petersen · 2021-01-13T15:28:31Z

@philipwjones, FYI, it turns out multiple blocks per core are probably important for local time stepping. @gcapodag and I have been experimenting with load balancing for local time-stepping, and giving each core one low-res. and one high-res. block is probably the best strategy. That is how we found this bug.

mgduda · 2021-01-13T23:44:33Z

I've been able to reproduce this issue in the SW core. If I'm not mistaken, it looks like the problem is that when we destroy a pool in one block, all blocks for the fields in that pool are deallocated; then, when destroying the same pool in the next block, we attempt to redundantly deallocate blocks of the fields in that pool.

I'll give this some though to see if there's a clean solution. The core problem is that if we have two pointers, say, A and B, to the same memory and we deallocate that memory through one of the pointers (A) we have no way to know that B points to memory that is no longer allocated.

Here's a demonstration of the issue:

program ptrfoo

    real, pointer :: a, b

    allocate(a)
    b => a

    deallocate(a)
    deallocate(b)

    stop

end program ptrfoo

mark-petersen · 2021-01-14T00:06:32Z

@mgduda thanks for your input on this. I had thought the same thing, but the allocation is within a block loop:

MPAS-Model/src/core_sw/mpas_sw_time_integration.F

Lines 113 to 118 in 51d5624

    
           block => domain % blocklist 
        
           do while (associated(block)) 
        
              call mpas_pool_get_subpool(block % structs, 'mesh', meshPool) 
        
              call mpas_pool_get_subpool(block % structs, 'state', statePool) 
        
              allocate(provisStatePool)

so then we allocate once per block. So wouldn't it make sense to also deallocate once per block? Or are you saying that the second allocate for a pointer behaves as b => a? I'm used to allocate statements for allocatable arrays, not pointers.

philipwjones · 2021-01-14T15:38:44Z

@mark-petersen if we really do need this functionality, we're going to need to implement it differently than the current linked-list method. The current implementation is incompatible with what we are doing on the GPU. Even just exposing the arrays with the extra block index at the end would be preferable.

mark-petersen added bug Framework labels Jan 13, 2021

matthewhoffman added bug Framework labels Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

destroy pool fails with seg fault for multi-block #788

destroy pool fails with seg fault for multi-block #788

mark-petersen commented Jan 13, 2021 •

edited

Loading

mark-petersen commented Jan 13, 2021

mark-petersen commented Jan 13, 2021

mgduda commented Jan 13, 2021

mark-petersen commented Jan 14, 2021

philipwjones commented Jan 14, 2021

destroy pool fails with seg fault for multi-block #788

destroy pool fails with seg fault for multi-block #788

Comments

mark-petersen commented Jan 13, 2021 • edited Loading

mark-petersen commented Jan 13, 2021

mark-petersen commented Jan 13, 2021

mgduda commented Jan 13, 2021

mark-petersen commented Jan 14, 2021

philipwjones commented Jan 14, 2021

mark-petersen commented Jan 13, 2021 •

edited

Loading