Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-objective global memory search algorithm #493

Merged
merged 61 commits into from
Feb 21, 2023
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
a998154
Initial change of search procedure with memory consideration (#278)
eric-zheng May 30, 2022
ceaa344
Add line to export clang compilation database, but not enable that.
eric-zheng Jun 2, 2022
0f66a00
Merge branch 'unify' of https://github.com/eric-zheng/FlexFlow into m…
eric-zheng Jun 29, 2022
920dc12
[Memory] Save some work
eric-zheng Jul 22, 2022
dabf46e
[Memory] Allow different run time cost factor
eric-zheng Jul 26, 2022
28ef865
Merge branch 'memory' of https://github.com/eric-zheng/FlexFlow into …
eric-zheng Aug 13, 2022
ac3aa1b
Update format
eric-zheng Aug 13, 2022
5108704
Update format again
eric-zheng Aug 13, 2022
1861725
Resolve compile error due to merge conflict
eric-zheng Aug 13, 2022
140a6ac
Merge pull request #295 from eric-zheng/memory_old
eric-zheng Aug 13, 2022
37c82ba
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Aug 13, 2022
6fde3c6
Sync the changes again (#296)
eric-zheng Aug 13, 2022
7327852
Merge branch 'master' of https://github.com/eric-zheng/FlexFlow into …
eric-zheng Aug 23, 2022
add907a
Merge pull request #299 from eric-zheng/memory
eric-zheng Aug 23, 2022
825a0c9
[Memory] Correct memory cost calculation
eric-zheng Aug 24, 2022
3c39f8c
Fix the build with CUDA_TOOLKIT_ROOT_DIR
eric-zheng Aug 24, 2022
afcaa43
[Memory] Update calculation of memory cost
eric-zheng Sep 8, 2022
ca577e3
Add logs folder to gitignore
eric-zheng Sep 8, 2022
99b4d7e
Improve dot graph representation
eric-zheng Sep 14, 2022
08072ac
[Dot] Update dot graph representation
eric-zheng Sep 21, 2022
d496816
Merge pull request #377 from eric-zheng/memory
eric-zheng Oct 26, 2022
b1abaeb
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Oct 26, 2022
1820eea
Move changes
eric-zheng Oct 26, 2022
83ca529
Quick fix to avoid bert segfault
eric-zheng Nov 2, 2022
98d20fe
Merge branch 'master' of https://github.com/eric-zheng/FlexFlow into …
eric-zheng Nov 2, 2022
ca88495
Grid search of lambda
eric-zheng Nov 2, 2022
c13ba1a
Merge pull request #3 from eric-zheng/eric/new_lambda_loop
eric-zheng Nov 2, 2022
dbe249f
[WIP] Update
eric-zheng Nov 3, 2022
6455550
[Interface] Add --memory-search argument
eric-zheng Nov 16, 2022
4c4194c
[Memory] Update memory search
eric-zheng Nov 16, 2022
1e859d9
[Interface] Save -ll:fsize info
eric-zheng Nov 17, 2022
3ff36e0
[WIP] Save per-device memory change
eric-zheng Nov 17, 2022
5bbc208
Finalize per-device max memory threshold
eric-zheng Nov 23, 2022
f8250cf
Merge branch 'master' of https://github.com/eric-zheng/FlexFlow into …
eric-zheng Nov 23, 2022
2f84e30
Update format
eric-zheng Nov 23, 2022
fa1a1de
Update comments to prepare for merging
eric-zheng Nov 23, 2022
a1b34d8
Merge pull request #492 from eric-zheng/memory
eric-zheng Nov 23, 2022
1d5bf42
[WIP] Experiments to clear the caches
eric-zheng Nov 23, 2022
4ae92c1
Fixed a memory calculation bug
eric-zheng Nov 24, 2022
22c7cf9
Update minor issues
eric-zheng Nov 24, 2022
f17a50c
Merge pull request #5 from eric-zheng/mem_experiments
eric-zheng Nov 24, 2022
dc71164
Merge branch 'flexflow:memory' into memory
eric-zheng Nov 24, 2022
e68de3e
Merge pull request #494 from eric-zheng/memory
eric-zheng Nov 24, 2022
394dae6
Merge branch 'master' into memory
goliaro Dec 1, 2022
c9c7293
Merge branch 'master' into memory
lockshaw Dec 11, 2022
4bbeae7
Merge branch 'master' into memory
lockshaw Dec 12, 2022
42877ff
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Jan 4, 2023
41719ec
Merge branch 'master' into memory
jiazhihao Jan 12, 2023
cda8ad8
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Jan 23, 2023
02c9fe7
Update based on review comments
eric-zheng Jan 23, 2023
f36d10c
Remove unnecessary include
eric-zheng Jan 23, 2023
092468c
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Jan 25, 2023
bc0340d
Update based on review
eric-zheng Jan 25, 2023
a71ec09
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Feb 2, 2023
4a283bd
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Feb 8, 2023
46f9c9c
Update based on review
eric-zheng Feb 8, 2023
734bcb0
Factor out lambda helper functions
eric-zheng Feb 8, 2023
2f88cae
Fix a bug due to moving lambda function out
eric-zheng Feb 12, 2023
7813c4d
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Feb 12, 2023
7dff97c
Merge branch 'master' of https://github.com/flexflow/FlexFlow into me…
eric-zheng Feb 15, 2023
cd53288
Fix memory leak of the cached_simulator
eric-zheng Feb 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -170,3 +170,6 @@ docker/config
node_modules/
package.json
yarn.lock

# Logs
logs/
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} ${CMAKE_CURRENT_LIST_DIR}/cmake)
set(FLEXFLOW_ROOT ${CMAKE_CURRENT_LIST_DIR})
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -UNDEBUG")

# Export Clang Compilation Database
lockshaw marked this conversation as resolved.
Show resolved Hide resolved
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

# Set a default build type if none was specified
set(default_build_type "Debug")
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
Expand Down
1 change: 1 addition & 0 deletions cmake/cuda.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
set(CUDA_ROOT ${CUDA_PATH})
set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_PATH})
list(APPEND CMAKE_PREFIX_PATH ${CUDA_ROOT})
message( STATUS "CMAKE_PREFIX_PATH" : "${CMAKE_PREFIX_PATH}" )
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
find_package(CUDA REQUIRED)

if(CUDA_FOUND)
Expand Down
2 changes: 2 additions & 0 deletions include/flexflow/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ class FFConfig {
int epochs, batchSize, printFreq;
// int inputHeight, inputWidth;
int numNodes, cpusPerNode, workersPerNode;
float device_mem; // The device (GPU) memory threshold; given by -ll:fsize
float learningRate, weightDecay;
size_t workSpaceSize;
Legion::Context lg_ctx;
Expand Down Expand Up @@ -155,6 +156,7 @@ class FFConfig {
int base_optimize_threshold;
bool enable_control_replication;
int python_data_loader_type;
bool perform_memory_search{false};
};

class FFIterationConfig {
Expand Down
60 changes: 60 additions & 0 deletions include/flexflow/graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#define _FLEXFLOW_GRAPH_H_
#include "flexflow/basic_graph.h"
#include "flexflow/graph_structures.h"
#include "flexflow/memory_optimization.h"
#include "flexflow/model.h"
#include "flexflow/utils/dot/dot_file.h"
#include "flexflow/utils/recursive_logger.h"
Expand Down Expand Up @@ -109,6 +110,32 @@ struct GraphCostResult {
friend std::ostream &operator<<(std::ostream &, GraphCostResult const &);
};

/**
* @brief Experimental. Hold the cost information of a PCG. To be merged with
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
* GraphCostResult.
*/
struct GraphCostResultWithMemory {
float cost; ///< Run time cost
MemoryUsage mem_cost; ///< Memory usage
///< Corresponding machine views (device placement views)
std::unordered_map<Node, MachineView> views;

/**
* @brief Get the multi-objective cost that combines the run time and memory
* cost.
*
* @return float Numerical value to represent the overall cost
*/
float get_multi_obj_cost() const;

static GraphCostResultWithMemory invalid();

bool operator<(GraphCostResultWithMemory const &other) const;

friend std::ostream &operator<<(std::ostream &,
GraphCostResultWithMemory const &);
};

template <typename T>
T sequence_cost(T const &first, T const &second);

Expand Down Expand Up @@ -195,6 +222,20 @@ class SearchHelper {
template <typename T>
void add_operator_cost(NodeAssignment const &, float, T *) const;

template <typename T>
void add_sink_node_costs(NodeAssignment const &sink,
CostMetrics metrics,
T *result) const;

/**
* @brief Add run time cost and memory cost of the operator to the graph cost.
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
* This is a temp workaround and should be refactored eventually.
*/
void add_operator_cost_with_memory(NodeAssignment const &node,
float node_run_time_cost,
MemoryUsage node_mem_cost,
GraphCostResultWithMemory *cost) const;

template <typename T>
float get_cost(T const &) const;

Expand All @@ -204,6 +245,8 @@ class SearchHelper {
public:
mutable std::unique_ptr<RecursiveLogger> logger;

void clear_cache();

private:
template <typename T>
T execute_nonsequence_split(std::unique_ptr<Graph> const &first_graph,
Expand Down Expand Up @@ -255,6 +298,8 @@ class Graph {
Graph subgraph(std::unordered_set<Node> const &nodes) const;
void contract_out_node(Node const &);
float optimal_cost() const;
// Experimental. To be merged with optimal_cost().
float optimal_cost_with_memory(float const run_time_cost_factor) const;
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
std::unordered_map<Node, MachineView> optimal_views() const;
void remove_input_nodes();
void duplicate_input_node(Node const &);
Expand Down Expand Up @@ -330,6 +375,21 @@ struct GraphOptimizeResult {
friend std::ostream &operator<<(std::ostream &, GraphOptimizeResult const &);
};

/**
* @brief Experimental. Hold the optimization results with memory information.
* To be merged with GraphOptimizeResult.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To be merged" when? Feel free to either include inline or (even better) link to a github issue to track this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
struct GraphOptimizeResultWithMemory {
tl::optional<Graph> graph; ///< Optimized PCG
float cost; ///< Run time cost
MemoryUsage mem_cost; ///< Memory usage
///< Corresponding machine views (device placement views)
std::unordered_map<Node, MachineView> views;

friend std::ostream &operator<<(std::ostream &,
GraphOptimizeResultWithMemory const &);
};

namespace Utils {
template <>
struct GraphStructure<FlexFlow::PCG::Graph> {
Expand Down
166 changes: 166 additions & 0 deletions include/flexflow/memory_optimization.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
/**
* @file memory_optimization.h
* @brief Memory optimization related stuff
*
* @copyright Copyright 2022 CMU, Facebook, LANL, MIT, and Stanford
* (alphabetical)
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved

#include <cassert>
#include <string>

namespace FlexFlow {

enum class MemoryUsageType {
// Use global memory of a PCG as the measure of memory usage. No device
// mapping consideration.
GLOBAL,

// Use the max of peak per-device memory usage among devices as the measure.
// Need associated device mapping views.
PER_DEVICE_MAX,

// Use detailed per-device memory usage as the measure. Need associated device
// mapping views.
PER_DEVICE_ALL,
};

enum class MemorySearchAlgo {
// Multiple objective DP search. Combine memory cost and run time cost into
// one single cost function and add a factor to balance them.
MULTI_OBJECTIVE,
};

/**
* @brief Config class to control memory optimizations. This should be put into
* config.h and be stored in FFConfig. But for easy turnaround, put this here
* for now.
*/
class MemoryOptimConfig {
public:
MemoryUsageType mem_usage_type; ///< How to represent memory cost
MemorySearchAlgo mem_search_algo; ///< How to search for the optimal schedule
float run_time_cost_factor; ///< The weight factor of run time cost in the
///< overall cost function; used in
///< MULTI_OBJECTIVE algorithm
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved

MemoryOptimConfig()
: mem_usage_type{MemoryUsageType::GLOBAL},
mem_search_algo{MemorySearchAlgo::MULTI_OBJECTIVE},
run_time_cost_factor{0.5} {}
MemoryOptimConfig(float factor)
: mem_usage_type{MemoryUsageType::GLOBAL},
mem_search_algo{MemorySearchAlgo::MULTI_OBJECTIVE},
run_time_cost_factor{factor} {}
};

/**
* @brief Hold the result (including memory information) of a graph_optimize on
* a PCG.
*/
class MemorySearchResult {
public:
float run_time_cost{};
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
float memory_cost{};
float search_time{};
///< The max of per-device memory usage among all devices
float max_per_device_mem_all_deivces{0.0};
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
};

namespace PCG {

/**
* @brief Class to hold memory usage information of a (sub-)PCG.
*/
class MemoryUsage {
public:
MemoryUsageType usage_type; ///< What "num" means
float num; ///< The numerical number of memory usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a more useful name here? Maybe usage_in_bytes or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's not necessarily in bytes I think. I agree that a more verbose name might be better, but a simple "num" should be easy-to-understand that it is holding a floating point number of the usage. So, it's fine to keep this as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What units is num in then, if not bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it's a float, not size_t. Also, changing the name needs a lot of effort and I prefer to leave it in a future PR. I can possibly change this in the next PR.

Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, can you create an issue for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// May need this in the future, but not for now.
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
// std::vector<float> nums; ///< Detailed number of usage for all devices

///
/// Public APIs
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
///
MemoryUsage() : usage_type{MemoryUsageType::GLOBAL}, num{0.0} {}
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
MemoryUsage(MemoryUsageType _usage_type, float _num)
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
: usage_type{_usage_type}, num{_num} {}

std::string to_string() const {
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
std::string type_name;
switch (usage_type) {
case MemoryUsageType::GLOBAL:
type_name = "GLOBAL";
break;
case MemoryUsageType::PER_DEVICE_MAX:
type_name = "PER_DEVICE_MAX";
break;
case MemoryUsageType::PER_DEVICE_ALL:
// Not supporting detailed per-device memory usage now.
assert(false);
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
break;
}
return "(MemoryUsageType:" + type_name + ", Usage:" + std::to_string(num) +
")";
}

MemoryUsage &operator+=(MemoryUsage const &rhs) {
assert(usage_type == rhs.usage_type);

// Handle the merge of memory usage differently here.
switch (usage_type) {
case MemoryUsageType::GLOBAL:
num += rhs.num;
break;
case MemoryUsageType::PER_DEVICE_MAX:
num = std::max(num, rhs.num);
break;
case MemoryUsageType::PER_DEVICE_ALL:
// Not supporting detailed per-device memory usage now.
assert(false);
break;
}

return *this;
}

/**
* @brief Combine the memory usage of two PCGs flexibly based on
* MemoryUsageType.
*/
friend MemoryUsage operator+(MemoryUsage lhs, MemoryUsage const &rhs) {
lhs += rhs;
return lhs;
}

friend std::ostream &operator<<(std::ostream &s, MemoryUsage const &usage) {
s << usage.to_string();
return s;
}
};

/**
* @brief The choice of memory optimizations applied to a Graph.
*/
class MemOptDecision {
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
public:
private:
};

} // namespace PCG
} // namespace FlexFlow
12 changes: 12 additions & 0 deletions include/flexflow/model.h
Original file line number Diff line number Diff line change
Expand Up @@ -760,6 +760,13 @@ class FFModel {
bool only_data_parallel,
std::unique_ptr<PCG::Graph> &best_graph,
std::unordered_map<PCG::Node, MachineView> &optimal_view);
void graph_optimize(size_t budget,
bool only_data_parallel,
std::unique_ptr<PCG::Graph> &best_graph,
std::unordered_map<PCG::Node, MachineView> &optimal_view,
bool perform_memory_search,
MemoryOptimConfig new_config,
MemorySearchResult &search_result);
void mcmc_optimize(std::map<Op const *, ParallelConfig> &best,
size_t budget,
float alpha,
Expand Down Expand Up @@ -797,6 +804,11 @@ class FFModel {
public:
void set_iteration_config_sequence_length(int seq_length);

/**
* @brief Clear the cache of the GraphSearchHelper and SearchHelper.
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
*/
void clear_graph_search_cache();

public:
size_t op_global_guid, layer_global_guid;
size_t tensor_global_guid, parallel_tensor_global_guid, node_global_guid;
Expand Down
7 changes: 4 additions & 3 deletions include/flexflow/parallel_tensor.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,10 @@ struct ParallelDim {
return false;
}

int size = 0;
int degree = UNKNOWN_DEGREE;
int parallel_idx = UNKNOWN_INDEX;
int size = 0; // Actual size of tensor
int degree = UNKNOWN_DEGREE; // Degree of sharding
int parallel_idx = UNKNOWN_INDEX; // Runtime information, unique id of each
// degree of sharding
bool is_replica_dim = false;
};

Expand Down
12 changes: 11 additions & 1 deletion include/flexflow/simulator.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include "config.h"
#include "ffconst.h"
#include "flexflow/memory_optimization.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove as I don't see why this additional include is necessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need this include because the simulator uses things here. I tried and removing it will cause compilation error.

Copy link
Collaborator

@lockshaw lockshaw Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any changes in the diff that would require it to be there. Maybe it should be in simulator.cc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I tired again to put this include in simulator.cc but it has compile errors. We should leave it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no changes, so that shouldn't be necessary. This is just increasing build times unnecessarily. If you can't figure it out can you show me the error you're getting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this include to model.h. It seems like model.h needs types defined in memory_optimization.h

#include "flexflow/operator_params.h"
#include "flexflow/utils/hash_utils.h"
#include "mpark/variant.hpp"
Expand Down Expand Up @@ -53,10 +54,17 @@ class FFModel;
*/
struct CostMetrics {
/**
* @brief Return the sum of the memory usage recorded in this CostMetrics.
* @brief Return the sum of inputs_memory, outputs_memory, and weights_memory
* recorded in this CostMetrics.
*/
size_t total_memory() const;
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Return the sum of memory recorded in this CostMetrics, but in MB,
* instead of Bytes.
*/
float total_memory_as_mb() const;
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Get the incremental difference between the total memory in
* CostMetrics and sim->offset.
Expand All @@ -76,6 +84,8 @@ struct CostMetrics {
// 2. we call Simulator::free_all before measuring an operator
// Therefore, the current memory usage of an operator is (size_t)sim->offset
size_t inputs_memory = 0, outputs_memory = 0, weights_memory = 0;
///< Real memory usage of Op* considering parallelization over devices
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
size_t op_total_mem;
eric-zheng marked this conversation as resolved.
Show resolved Hide resolved
};

class Device {
Expand Down
Loading