Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-objective global memory search algorithm #493

Merged
merged 61 commits into from
Feb 21, 2023
Merged

Add multi-objective global memory search algorithm #493

merged 61 commits into from
Feb 21, 2023

Conversation

eric-zheng
Copy link
Collaborator

@eric-zheng eric-zheng commented Nov 23, 2022

Description of changes:
This pull request is to add the memory awareness to the Unity's search algorithm. Specifically, it supports the multi-objective search algorithm to balance the memory and run time costs given a device cluster.

This pull request is also to merge the content of the memory branch to the master.

Related Issues:

Linked Issues:

Issues closed by this PR:

Before merging:

  • Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

eric-zheng and others added 30 commits May 30, 2022 13:39
* [Memory] Add necessary types to support memory search. WIP.

* [Memory] Implement modified DP search algorithm with memory cost. Missing base solution. WIP.

* [Memory] Complete all changes to the search procedure to support multi-objective search with global memory.

A search procedure refactor is in the future plan.
Merge master and memory branches and resolve conflicts
* Save some more expressive logging

* Update format
Sync the updates from the master branch
Merge grid search of lambda parameter
@lockshaw
Copy link
Collaborator

@virena This PR is large enough that I'd appreciate a review from you too to make sure we don't miss anything 🙂

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible try to make smaller PRs in the future. This took quite a while to review and that was with not really reading the super long functions. It's too easy to miss things when there's this much code.

That said, in general looking pretty good. Left a mix of requests for changes as well as some questions just so I understand why some decisions were made. Let me know when you're ready for another round of review, or if you have any questions!

src/parallel_ops/combine.cc Outdated Show resolved Hide resolved
include/flexflow/model.h Show resolved Hide resolved
include/flexflow/memory_optimization.h Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
cmake/cuda.cmake Outdated Show resolved Hide resolved
src/runtime/substitution.cc Outdated Show resolved Hide resolved
src/runtime/substitution.cc Show resolved Hide resolved
src/runtime/substitution.cc Show resolved Hide resolved
src/runtime/substitution.cc Show resolved Hide resolved
src/runtime/substitution.cc Show resolved Hide resolved
Copy link
Collaborator

@goliaro goliaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for implementing this @eric-zheng . Just a few general questions:

  1. Is it possible to test the changes from this PR, from a performance perspective?
  2. Does the PR also fix issue Error in search #497 , which has forced us to run FlexFlow in data-parallel only mode for now?
    2.1. If so, should we remove the --only-data-parallel flag from the multi_gpu_tests.sh?
  3. Is there any relation between this PR and issue Modularize and refactor the search algorithm #477 ?

@eric-zheng
Copy link
Collaborator Author

Great work! Thanks for implementing this @eric-zheng . Just a few general questions:

  1. Is it possible to test the changes from this PR, from a performance perspective?
  2. Does the PR also fix issue Error in search #497 , which has forced us to run FlexFlow in data-parallel only mode for now?
    2.1. If so, should we remove the --only-data-parallel flag from the multi_gpu_tests.sh?
  3. Is there any relation between this PR and issue Modularize and refactor the search algorithm #477 ?

Thanks @gabrieleoliaro !

In terms of the performance, did you mean the performance of the generated strategy or the performance to execute the search? This change will not affect the existing search procedure because the whole flow was duplicated and guarded by --memory-search.

I don't think this will fix issue #497 as this change doesn't affect the existing Unity's search. Hmm, I think I was able to run the complicated version of search on DLRM, etc. But I also noticed that some models had problems. This PR probably won't fix those problems.

I think we do need to work on #477 in the future but this PR doesn't refactor the search as I believed it needs more thoughtful discussion. I'll leave more comments in #477 .

Thanks again!

Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. +1 to creating issues for experimental functions / TODOs.

Comment on lines +1295 to +1298
bool SearchHelper::is_invalid<GraphCostResultWithMemory>(
GraphCostResultWithMemory const &cost) const {
return cost.cost == std::numeric_limits<float>::infinity();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to account for exceeding global memory usage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be a big problem. Can possibly be refactored in the new PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not going in this PR, let's get an issue with a full explanation created

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_parallel_view.start_device_id = 0;
for (auto const &node : best_graph->inEdges) {
optimal_views[node.first] = data_parallel_view;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Colin here, refactoring this is preferable to lambda functions. What variables have to be passed around?


bool has_valid_strategy = false;
int best_lambda_index = -1;
int binary_search_budget = 10;
Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should binary_search_budget be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be better to be configurable. But I remember we said 10 should be enough for now.

It's better to be factored out into a config class in mem_opt.h. Will likely do that in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get an issue created for this too, ideally with the link in a comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Be optimistic
lambdas.emplace_back(std::make_pair(1.0, MemorySearchResult{}));
auto try_result = try_one_lambda(lambdas.back());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on to the need for a refactor -- maybe auto can be specified to the actual type? I'm not sure what .first and .second are supposed to be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have factored this lambda out. Hopefully, it became more clear now.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address comments in existing open discussions

@eric-zheng
Copy link
Collaborator Author

The CI multi_gpu test has the following error which I'm not sure why it happens:

resnet: /__w/FlexFlow/FlexFlow/src/mapper/mapper.cc:648: virtual void FlexFlow::FFMapper::map_task(Legion::Mapping::MapperContext, const Legion::Task&, const Legion::Mapping::Mapper::MapTaskInput&, Legion::Mapping::Mapper::MapTaskOutput&): Assertion `false' failed.
./tests/cpp_gpu_tests.sh: line 56: 55817 Aborted                 (core dumped) resnet -ll:gpu "$GPUS" -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" -b ${BATCHSIZE} --only-data-parallel
forwardAlg
Error: Process completed with exit code 134.

@gabrieleoliaro @lockshaw I was wondering if you may see any similar error before? Thanks!

@eric-zheng
Copy link
Collaborator Author

I have narrow down the issue to happen when the mapper allocates the Linear operator. It's oom with batch size 64 on this memory branch. However, it's still unclear why the mapper allocates more memory for the same Linear operator compared to the master branch.

@eric-zheng
Copy link
Collaborator Author

I finally found out why the PR has the memory allocation bug: it was due to a memory leak that the simulator pointer was not properly deleted in my previous implementation. I think the simulator allocates certain amount of GPU memory and therefore we don't have enough memory to allocate the real operators. This actually teaches me a lesson that we should try to use RAII whenever possible. Frankly, I was always trying to use smart pointers except for this time - and a bug happens... 😢

@jiazhihao jiazhihao enabled auto-merge (squash) February 19, 2023 14:17
@eric-zheng eric-zheng dismissed lockshaw’s stale review February 21, 2023 06:33

All the open discussion has been either resolved or linked to an issue. All the tests passed as well. We are trying to merge this to the master soon in order to update the repo-refactor branch. Colin was also glad to merge as long as the tests pass. Thanks for all the review!

@jiazhihao jiazhihao merged commit 2c4d257 into master Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge improved memory usage calculation back into master
5 participants