-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
destroy pool fails with seg fault for multi-block #788
Comments
@amametjanov and @mgduda I think this problem was introduced inadvertently, as I looked through the comments in #578 and there was no discussion or testing of multiple blocks. @mgduda if you have a little time to triage, it would help to know if this multi-block failure is fundamental to this design, or just an oversight. |
@philipwjones, FYI, it turns out multiple blocks per core are probably important for local time stepping. @gcapodag and I have been experimenting with load balancing for local time-stepping, and giving each core one low-res. and one high-res. block is probably the best strategy. That is how we found this bug. |
I've been able to reproduce this issue in the SW core. If I'm not mistaken, it looks like the problem is that when we destroy a pool in one block, all blocks for the fields in that pool are deallocated; then, when destroying the same pool in the next block, we attempt to redundantly deallocate blocks of the fields in that pool. I'll give this some though to see if there's a clean solution. The core problem is that if we have two pointers, say, A and B, to the same memory and we deallocate that memory through one of the pointers (A) we have no way to know that B points to memory that is no longer allocated. Here's a demonstration of the issue: program ptrfoo
real, pointer :: a, b
allocate(a)
b => a
deallocate(a)
deallocate(b)
stop
end program ptrfoo |
@mgduda thanks for your input on this. I had thought the same thing, but the allocation is within a block loop: MPAS-Model/src/core_sw/mpas_sw_time_integration.F Lines 113 to 118 in 51d5624
so then we allocate once per block. So wouldn't it make sense to also deallocate once per block? Or are you saying that the second allocate for a pointer behaves as b => a ? I'm used to allocate statements for allocatable arrays, not pointers.
|
@mark-petersen if we really do need this functionality, we're going to need to implement it differently than the current linked-list method. The current implementation is incompatible with what we are doing on the GPU. Even just exposing the arrays with the extra block index at the end would be preferable. |
If we run with two blocks per core in RK4, the destroy pool fails at the end of the first time step. Error is a seg fault, with traceback:
MPAS-Model/src/framework/mpas_field_routines.F
Line 1717 in 057dd78
MPAS-Model/src/framework/mpas_pool_routines.F
Line 261 in 51d5624
MPAS-Model/src/core_sw/mpas_sw_time_integration.F
Line 349 in 51d5624
This was introduced in #578.
We can invoke this error in the shallow water core, for example, by setting
config_number_of_blocks = 36
but running on 18 cores.The text was updated successfully, but these errors were encountered: