-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/IB: Fix teardown path on memory handler callback #10171
base: master
Are you sure you want to change the base?
Conversation
src/uct/ib/base/ib_md.c
Outdated
if (status != UCS_OK) { | ||
goto err; | ||
} | ||
(void)uct_ib_dereg_mr(ctx->mrs[ctx->mr_idx]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe print a warning message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -219,6 +219,7 @@ typedef struct { | |||
void *address; | |||
size_t length; | |||
size_t first_mr_size; | |||
int mr_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed? I would expect the thread will not exit with a partial registration: either it return UCS_OK, or returns an error and reverts all regisrtations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. It is for the case where there is at least one failing thread. it will then be used to teardown the other threads that had succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check this by the return value of the thread function, instead of adding this var to the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to store per-thread because it's only when they are all joined that we know if one failed and we need to tear them down, so I think we cannot.
pthread_join could take return code and set it in ctx
but seems better to have the thread update it itself.
MAP_PRIVATE | MAP_ANONYMOUS | | ||
((ptr != nullptr) ? MAP_FIXED : 0), | ||
-1, 0); | ||
return ptr != MAP_FAILED ? ptr : nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add ( )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
test/gtest/uct/ib/test_ib_md.cc
Outdated
size_t size = UCS_MBYTE; | ||
const size_t align_mask = (8 * UCS_KBYTE) - 1; | ||
size_t lower_size = ((size / 3)) & ~align_mask; | ||
size_t upper_size = (size - (size / 2)) & ~align_mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(size - (size / 2)) -> (size / 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
test/gtest/uct/ib/test_ib_md.cc
Outdated
/* Find an available VMA */ | ||
start = mmap_anon(nullptr, size); | ||
ASSERT_NE(nullptr, start); | ||
mid = reinterpret_cast<char*>(start) + lower_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use UCS_PTR_BYTE_OFFSET?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
test/gtest/uct/ib/test_ib_md.cc
Outdated
/* coverity[pass_freed_arg] */ | ||
start = mmap_anon(start, lower_size); | ||
last = mmap_anon(last, upper_size); | ||
EXPECT_NE(nullptr, start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- move it right after mmap_anon(start, lower_size)
- maybe better just munmap between "start+lower_size" until "last"?
you assume that after we munmap the block in line 338, no other thread will try to allocate in that area
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed both
/azp run UCX PR |
Azure Pipelines successfully started running 1 pipeline(s). |
@yosefe, could you please check it? |
What
Fix error handling path on multi-threaded registration.
Why ?
Currently we see segfault in the thread because we try to teardown the first failing registration.
More generally, a thread can fail to start or fail to properly register all its tasked
mrs
.How ?
mrs
entries corresponding to started threads.mrs
all initialized.