Add multi-objective global memory search algorithm #493

eric-zheng · 2022-11-23T20:18:32Z

Description of changes:
This pull request is to add the memory awareness to the Unity's search algorithm. Specifically, it supports the multi-objective search algorithm to balance the memory and run time costs given a device cluster.

This pull request is also to merge the content of the memory branch to the master.

Related Issues:

Linked Issues:

Issue Merge improved memory usage calculation back into master #329

Issues closed by this PR:

Closes Merge improved memory usage calculation back into master #329

Before merging:

Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

* [Memory] Add necessary types to support memory search. WIP. * [Memory] Implement modified DP search algorithm with memory cost. Missing base solution. WIP. * [Memory] Complete all changes to the search procedure to support multi-objective search with global memory. A search procedure refactor is in the future plan.

…emory

…memory_old

Merge master and memory branches and resolve conflicts

…mory_old

* Save some more expressive logging * Update format

…memory

Sync the updates from the master branch

Update of memory branch

…mory

…eric/new_lambda_loop

Merge grid search of lambda parameter

lockshaw · 2023-01-20T07:06:19Z

@virena This PR is large enough that I'd appreciate a review from you too to make sure we don't miss anything 🙂

lockshaw

If possible try to make smaller PRs in the future. This took quite a while to review and that was with not really reading the super long functions. It's too easy to miss things when there's this much code.

That said, in general looking pretty good. Left a mix of requests for changes as well as some questions just so I understand why some decisions were made. Let me know when you're ready for another round of review, or if you have any questions!

src/parallel_ops/combine.cc

include/flexflow/model.h

include/flexflow/memory_optimization.h

CMakeLists.txt

cmake/cuda.cmake

src/runtime/substitution.cc

…mory

goliaro

Great work! Thanks for implementing this @eric-zheng . Just a few general questions:

Is it possible to test the changes from this PR, from a performance perspective?
Does the PR also fix issue Error in search #497 , which has forced us to run FlexFlow in data-parallel only mode for now?
2.1. If so, should we remove the --only-data-parallel flag from the multi_gpu_tests.sh?
Is there any relation between this PR and issue Modularize and refactor the search algorithm #477 ?

eric-zheng · 2023-01-27T05:19:32Z

Great work! Thanks for implementing this @eric-zheng . Just a few general questions:

Is it possible to test the changes from this PR, from a performance perspective?

Does the PR also fix issue Error in search #497 , which has forced us to run FlexFlow in data-parallel only mode for now?
2.1. If so, should we remove the --only-data-parallel flag from the multi_gpu_tests.sh?

Is there any relation between this PR and issue Modularize and refactor the search algorithm #477 ?

Thanks @gabrieleoliaro !

In terms of the performance, did you mean the performance of the generated strategy or the performance to execute the search? This change will not affect the existing search procedure because the whole flow was duplicated and guarded by --memory-search.

I don't think this will fix issue #497 as this change doesn't affect the existing Unity's search. Hmm, I think I was able to run the complicated version of search on DLRM, etc. But I also noticed that some models had problems. This PR probably won't fix those problems.

I think we do need to work on #477 in the future but this PR doesn't refactor the search as I believed it needs more thoughtful discussion. I'll leave more comments in #477 .

Thanks again!

reyna-abhyankar

Overall, looks good. +1 to creating issues for experimental functions / TODOs.

reyna-abhyankar · 2023-02-01T19:36:32Z

src/runtime/graph.cc

+bool SearchHelper::is_invalid<GraphCostResultWithMemory>(
+    GraphCostResultWithMemory const &cost) const {
+  return cost.cost == std::numeric_limits<float>::infinity();
+}


Does this need to account for exceeding global memory usage?

Shouldn't be a big problem. Can possibly be refactored in the new PR.

If it's not going in this PR, let's get an issue with a full explanation created

reyna-abhyankar · 2023-02-01T20:24:02Z

src/runtime/graph.cc

-    data_parallel_view.start_device_id = 0;
-    for (auto const &node : best_graph->inEdges) {
-      optimal_views[node.first] = data_parallel_view;
+


I agree with Colin here, refactoring this is preferable to lambda functions. What variables have to be passed around?

reyna-abhyankar · 2023-02-01T20:37:48Z

src/runtime/graph.cc

+
+  bool has_valid_strategy = false;
+  int best_lambda_index = -1;
+  int binary_search_budget = 10;


Should binary_search_budget be configurable?

Yes, this should be better to be configurable. But I remember we said 10 should be enough for now.

It's better to be factored out into a config class in mem_opt.h. Will likely do that in the future.

Let's get an issue created for this too, ideally with the link in a comment

…mory

reyna-abhyankar · 2023-02-03T10:08:02Z

src/runtime/graph.cc

+
+  // Be optimistic
+  lambdas.emplace_back(std::make_pair(1.0, MemorySearchResult{}));
+  auto try_result = try_one_lambda(lambdas.back());


Adding on to the need for a refactor -- maybe auto can be specified to the actual type? I'm not sure what .first and .second are supposed to be.

I have factored this lambda out. Hopefully, it became more clear now.

lockshaw

Address comments in existing open discussions

…mory

eric-zheng · 2023-02-08T14:12:10Z

The CI multi_gpu test has the following error which I'm not sure why it happens:

resnet: /__w/FlexFlow/FlexFlow/src/mapper/mapper.cc:648: virtual void FlexFlow::FFMapper::map_task(Legion::Mapping::MapperContext, const Legion::Task&, const Legion::Mapping::Mapper::MapTaskInput&, Legion::Mapping::Mapper::MapTaskOutput&): Assertion `false' failed.
./tests/cpp_gpu_tests.sh: line 56: 55817 Aborted                 (core dumped) resnet -ll:gpu "$GPUS" -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" -b ${BATCHSIZE} --only-data-parallel
forwardAlg
Error: Process completed with exit code 134.

@gabrieleoliaro @lockshaw I was wondering if you may see any similar error before? Thanks!

eric-zheng · 2023-02-09T04:25:33Z

I have narrow down the issue to happen when the mapper allocates the Linear operator. It's oom with batch size 64 on this memory branch. However, it's still unclear why the mapper allocates more memory for the same Linear operator compared to the master branch.

…mory

eric-zheng · 2023-02-15T19:44:13Z

I finally found out why the PR has the memory allocation bug: it was due to a memory leak that the simulator pointer was not properly deleted in my previous implementation. I think the simulator allocates certain amount of GPU memory and therefore we don't have enough memory to allocate the real operators. This actually teaches me a lesson that we should try to use RAII whenever possible. Frankly, I was always trying to use smart pointers except for this time - and a bug happens... 😢

All the open discussion has been either resolved or linked to an issue. All the tests passed as well. We are trying to merge this to the master soon in order to update the repo-refactor branch. Colin was also glad to merge as long as the tests pass. Thanks for all the review!

eric-zheng and others added 30 commits May 30, 2022 13:39

Add line to export clang compilation database, but not enable that.

ceaa344

Merge branch 'unify' of https://github.com/eric-zheng/FlexFlow into m…

0f66a00

…emory

[Memory] Save some work

920dc12

[Memory] Allow different run time cost factor

dabf46e

Merge branch 'memory' of https://github.com/eric-zheng/FlexFlow into …

28ef865

…memory_old

Update format

ac3aa1b

Update format again

5108704

Resolve compile error due to merge conflict

1861725

Merge pull request #295 from eric-zheng/memory_old

140a6ac

Merge master and memory branches and resolve conflicts

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

37c82ba

…mory_old

Sync the changes again (#296)

6fde3c6

* Save some more expressive logging * Update format

Merge branch 'master' of https://github.com/eric-zheng/FlexFlow into …

7327852

…memory

Merge pull request #299 from eric-zheng/memory

add907a

Sync the updates from the master branch

[Memory] Correct memory cost calculation

825a0c9

Fix the build with CUDA_TOOLKIT_ROOT_DIR

3c39f8c

[Memory] Update calculation of memory cost

afcaa43

Add logs folder to gitignore

ca577e3

Improve dot graph representation

99b4d7e

[Dot] Update dot graph representation

08072ac

Merge pull request #377 from eric-zheng/memory

d496816

Update of memory branch

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

b1abaeb

…mory

Move changes

1820eea

Quick fix to avoid bert segfault

83ca529

Merge branch 'master' of https://github.com/eric-zheng/FlexFlow into …

98d20fe

…eric/new_lambda_loop

Grid search of lambda

ca88495

Merge pull request #3 from eric-zheng/eric/new_lambda_loop

c13ba1a

Merge grid search of lambda parameter

[WIP] Update

dbe249f

[Interface] Add --memory-search argument

6455550

[Memory] Update memory search

4c4194c

lockshaw requested changes Jan 20, 2023

View reviewed changes

eric-zheng added 3 commits January 23, 2023 13:34

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

cda8ad8

…mory

Update based on review comments

02c9fe7

Remove unnecessary include

f36d10c

eric-zheng requested a review from lockshaw January 23, 2023 16:54

eric-zheng added 2 commits January 25, 2023 02:04

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

092468c

…mory

Update based on review

bc0340d

goliaro reviewed Jan 26, 2023

View reviewed changes

reyna-abhyankar reviewed Feb 1, 2023

View reviewed changes

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

a71ec09

…mory

reyna-abhyankar reviewed Feb 3, 2023

View reviewed changes

lockshaw previously requested changes Feb 3, 2023

View reviewed changes

eric-zheng added 3 commits February 8, 2023 12:22

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

4a283bd

…mory

Update based on review

46f9c9c

Factor out lambda helper functions

734bcb0

eric-zheng requested a review from lockshaw February 8, 2023 13:53

eric-zheng added 3 commits February 12, 2023 03:44

Fix a bug due to moving lambda function out

2f88cae

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

7813c4d

…mory

Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…

7dff97c

…mory

Fix memory leak of the cached_simulator

cd53288

jiazhihao approved these changes Feb 19, 2023

View reviewed changes

jiazhihao enabled auto-merge (squash) February 19, 2023 14:17

jiazhihao approved these changes Feb 19, 2023

View reviewed changes

jiazhihao merged commit 2c4d257 into master Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-objective global memory search algorithm #493

Add multi-objective global memory search algorithm #493

eric-zheng commented Nov 23, 2022 •

edited

Loading

lockshaw commented Jan 20, 2023

lockshaw left a comment

goliaro left a comment

eric-zheng commented Jan 27, 2023

reyna-abhyankar left a comment

reyna-abhyankar Feb 1, 2023

eric-zheng Feb 2, 2023

lockshaw Feb 3, 2023

eric-zheng Feb 8, 2023

reyna-abhyankar Feb 1, 2023

reyna-abhyankar Feb 1, 2023 •

edited

Loading

eric-zheng Feb 2, 2023

lockshaw Feb 3, 2023

eric-zheng Feb 8, 2023

reyna-abhyankar Feb 3, 2023

eric-zheng Feb 8, 2023

lockshaw left a comment

eric-zheng commented Feb 8, 2023

eric-zheng commented Feb 9, 2023

eric-zheng commented Feb 15, 2023

Add multi-objective global memory search algorithm #493

Add multi-objective global memory search algorithm #493

Conversation

eric-zheng commented Nov 23, 2022 • edited Loading

lockshaw commented Jan 20, 2023

lockshaw left a comment

Choose a reason for hiding this comment

goliaro left a comment

Choose a reason for hiding this comment

eric-zheng commented Jan 27, 2023

reyna-abhyankar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyna-abhyankar Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

eric-zheng commented Feb 8, 2023

eric-zheng commented Feb 9, 2023

eric-zheng commented Feb 15, 2023

eric-zheng commented Nov 23, 2022 •

edited

Loading

reyna-abhyankar Feb 1, 2023 •

edited

Loading