-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-objective global memory search algorithm #493
Conversation
* [Memory] Add necessary types to support memory search. WIP. * [Memory] Implement modified DP search algorithm with memory cost. Missing base solution. WIP. * [Memory] Complete all changes to the search procedure to support multi-objective search with global memory. A search procedure refactor is in the future plan.
Merge master and memory branches and resolve conflicts
* Save some more expressive logging * Update format
Sync the updates from the master branch
Update of memory branch
…eric/new_lambda_loop
Merge grid search of lambda parameter
@virena This PR is large enough that I'd appreciate a review from you too to make sure we don't miss anything 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible try to make smaller PRs in the future. This took quite a while to review and that was with not really reading the super long functions. It's too easy to miss things when there's this much code.
That said, in general looking pretty good. Left a mix of requests for changes as well as some questions just so I understand why some decisions were made. Let me know when you're ready for another round of review, or if you have any questions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for implementing this @eric-zheng . Just a few general questions:
- Is it possible to test the changes from this PR, from a performance perspective?
- Does the PR also fix issue Error in search #497 , which has forced us to run FlexFlow in data-parallel only mode for now?
2.1. If so, should we remove the--only-data-parallel
flag from themulti_gpu_tests.sh
? - Is there any relation between this PR and issue Modularize and refactor the search algorithm #477 ?
Thanks @gabrieleoliaro ! In terms of the performance, did you mean the performance of the generated strategy or the performance to execute the search? This change will not affect the existing search procedure because the whole flow was duplicated and guarded by I don't think this will fix issue #497 as this change doesn't affect the existing Unity's search. Hmm, I think I was able to run the complicated version of search on DLRM, etc. But I also noticed that some models had problems. This PR probably won't fix those problems. I think we do need to work on #477 in the future but this PR doesn't refactor the search as I believed it needs more thoughtful discussion. I'll leave more comments in #477 . Thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. +1 to creating issues for experimental functions / TODOs.
bool SearchHelper::is_invalid<GraphCostResultWithMemory>( | ||
GraphCostResultWithMemory const &cost) const { | ||
return cost.cost == std::numeric_limits<float>::infinity(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to account for exceeding global memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be a big problem. Can possibly be refactored in the new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not going in this PR, let's get an issue with a full explanation created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_parallel_view.start_device_id = 0; | ||
for (auto const &node : best_graph->inEdges) { | ||
optimal_views[node.first] = data_parallel_view; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Colin here, refactoring this is preferable to lambda functions. What variables have to be passed around?
|
||
bool has_valid_strategy = false; | ||
int best_lambda_index = -1; | ||
int binary_search_budget = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should binary_search_budget
be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be better to be configurable. But I remember we said 10 should be enough for now.
It's better to be factored out into a config class in mem_opt.h. Will likely do that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get an issue created for this too, ideally with the link in a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/runtime/graph.cc
Outdated
|
||
// Be optimistic | ||
lambdas.emplace_back(std::make_pair(1.0, MemorySearchResult{})); | ||
auto try_result = try_one_lambda(lambdas.back()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding on to the need for a refactor -- maybe auto
can be specified to the actual type? I'm not sure what .first
and .second
are supposed to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have factored this lambda out. Hopefully, it became more clear now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comments in existing open discussions
The CI multi_gpu test has the following error which I'm not sure why it happens:
@gabrieleoliaro @lockshaw I was wondering if you may see any similar error before? Thanks! |
I have narrow down the issue to happen when the mapper allocates the Linear operator. It's oom with batch size 64 on this memory branch. However, it's still unclear why the mapper allocates more memory for the same Linear operator compared to the master branch. |
I finally found out why the PR has the memory allocation bug: it was due to a memory leak that the simulator pointer was not properly deleted in my previous implementation. I think the simulator allocates certain amount of GPU memory and therefore we don't have enough memory to allocate the real operators. This actually teaches me a lesson that we should try to use RAII whenever possible. Frankly, I was always trying to use smart pointers except for this time - and a bug happens... 😢 |
All the open discussion has been either resolved or linked to an issue. All the tests passed as well. We are trying to merge this to the master soon in order to update the repo-refactor branch. Colin was also glad to merge as long as the tests pass. Thanks for all the review!
Description of changes:
This pull request is to add the memory awareness to the Unity's search algorithm. Specifically, it supports the multi-objective search algorithm to balance the memory and run time costs given a device cluster.
This pull request is also to merge the content of the memory branch to the master.
Related Issues:
Linked Issues:
Issues closed by this PR:
Before merging: