Skip to content

Commit

Permalink
[CPU] [DEBUG CAPS] Extend performance info for dynamic shapes
Browse files Browse the repository at this point in the history
Measurements for node shapeInfer and prepareParams stages were added
including support of aggregate tables in csv format.
Verbose output was appended by corresponding durations
and prepareParams cache hit info.

[dmitrygo] Fixed ShapeInfer and PrepareParams counters invocation
[dmitrygo] Added RedefineOutputMemory counnter
  • Loading branch information
EgorDuplensky authored and dmitry-gorokhov committed Sep 3, 2024
1 parent 6bc2742 commit f7ade08
Show file tree
Hide file tree
Showing 42 changed files with 889 additions and 108 deletions.
31 changes: 31 additions & 0 deletions src/plugins/intel_cpu/docs/debug_capabilities/perf_measurements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Performance measurements

Performance counters are controlled by plugin config option **PERF_COUNT** and forcibly enabled in [Verbose mode](verbose.md). If it is enabled *total* performance counters are aggregated by default and could be retrieved via API (for example, *-pc* option in *benchmark_app*).

## Aggregate performance counters

If performance counters are enabled, aggregation of additional performance counters is controlled by environment variable **OV_CPU_PERF_TABLES_PATH**:
```sh
OV_CPU_PERF_TABLES_PATH=<path/prefix> binary ...
```

\<prefix\> is optional, so slash have to be stated at the end of path if no prefix:
```sh
OV_CPU_PERF_TABLES_PATH="/path/to/dumpDir/" binary ...
OV_CPU_PERF_TABLES_PATH="/path/to/dumpDir/table_prefix_" binary ...
```

CSV tables with aggregate performance counters are dumped upon *ExecNetwork* destruction:
* \<prefix\>perf_modelInputs.csv\
Map of model input shapes to indexes.
* \<prefix\>perf_raw_nodes.csv\
Aggregate performance counters without processing.
* \<prefix\>perf_0_\<0_model_input_shape\>\_nodes.csv ... \<prefix\>perf_N_\<N_model_input_shape\>\_nodes.csv\
Aggregate performance counters per node for corresponding model inputs.
* \<prefix\>perf_0_\<0_model_input_shape\>\_nodeTypes.csv ... \<prefix\>perf_N_\<N_model_input_shape\>\_nodeTypes.csv\
Aggregate performance counters per node type for corresponding model inputs.
* \<prefix\>perf_all_nodeTypes.csv\
Aggregate performance counters per node type for all model inputs.\
This table is absent in case of single model input (static case) since it is the same as \<prefix\>perf_0_\<0_model_input_shape\>_nodeTypes.csv

\<model_input_shape\> in table name is limited by 100 characters, so complete shape is additionally stated at the end of corresponding table and in \<prefix\>perf_modelInputs.csv.
22 changes: 14 additions & 8 deletions src/plugins/intel_cpu/docs/debug_capabilities/verbose.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Verbose mode

It is possible to enable tracing execution of plugin nodes to cout and collect statistics, such as:
- inference request number
- node implementer:
* cpu (CPU plugin)
* dnnl (oneDNN library)
Expand All @@ -10,18 +11,23 @@ It is possible to enable tracing execution of plugin nodes to cout and collect s
- node algorithm
- node primitive info
- input / output ports info
- fused nodes
- execution time
- fused nodes (omitted if no)
- time measurements:
* for static case - total
* for dynamic case - total, shapeInfer, prepareParams, exec
- cache hit info (omitted if no):
* execCacheHit (if primitives cache hit occurred)
- etc

Format:
```sh
ov_cpu_verbose,exec,<node_implemeter>,\
<node_name>:<node_type>:<node_alg>,<impl_type>,\
src:<port_id>:<precision>::<type>:<format>:f0:<shape> ...,\
dst:<port_id>:<precision>::<type>:<format>:f0:<shape> ...,\
post_ops:'<node_name>:<node_type>:<node_alg>;...;',\
<execution_time>
ov_cpu_verbose,<inference_request_number>,exec,<node_implemeter>,
<node_name>:<node_type>:<node_alg>,<impl_type>,
src:<port_id>:<precision>::<type>:<format>:f0:<shape> ...
dst:<port_id>:<precision>::<type>:<format>:f0:<shape> ...,
post_ops:'<node_name>:<node_type>:<node_alg>;...;',
time:<time_measurement>:<number>ms ...,
cacheHit:<cache_hit_info> ...
```

To turn on verbose mode the following environment variable should be used:
Expand Down
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/compiled_model.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ CompiledModel::GraphGuard::Lock CompiledModel::get_graph() const {
}
const std::shared_ptr<const ov::Model> model = m_model;
graphLock._graph.CreateGraph(model, ctx);
CPU_DEBUG_CAP_ENABLE(graphLock._graph.setNestingLevel(1));
} catch (...) {
exception = std::current_exception();
}
Expand Down
7 changes: 7 additions & 0 deletions src/plugins/intel_cpu/src/compiled_model.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ class CompiledModel : public ov::ICompiledModel {
m_sub_compiled_models.clear();
m_sub_memory_manager->_memorys_table.clear();
}
#ifdef CPU_DEBUG_CAPS
perfDump(*this);
#endif
}

std::shared_ptr<ov::IAsyncInferRequest> create_infer_request() const override;
Expand All @@ -49,6 +52,10 @@ class CompiledModel : public ov::ICompiledModel {
"Set property to Core::compile_model during compilation");
};

#ifdef CPU_DEBUG_CAPS
friend void perfDump(const CompiledModel& execNet);
#endif

private:
std::shared_ptr<ov::ISyncInferRequest> create_sync_infer_request() const override;
friend class SyncInferRequest;
Expand Down
5 changes: 4 additions & 1 deletion src/plugins/intel_cpu/src/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,11 @@ Config::Config() {
*/
void Config::applyDebugCapsProperties() {
// always enable perf counters for verbose mode and performance summary
if (!debugCaps.verbose.empty() || !debugCaps.summaryPerf.empty())
if (!debugCaps.verbose.empty() ||
!debugCaps.summaryPerf.empty() ||
!debugCaps.perfTablesPath.empty()) {
collectPerfCounters = true;
}
}
#endif

Expand Down
2 changes: 2 additions & 0 deletions src/plugins/intel_cpu/src/dnnl_extension_utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ uint8_t DnnlExtensionUtils::sizeOfDataType(dnnl::memory::data_type dataType) {

dnnl::memory::data_type DnnlExtensionUtils::ElementTypeToDataType(const ov::element::Type& elementType) {
switch (elementType) {
case ov::element::i64:
return memory::data_type::s32;
case ov::element::f32:
return memory::data_type::f32;
case ov::element::i32:
Expand Down
18 changes: 11 additions & 7 deletions src/plugins/intel_cpu/src/graph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1124,8 +1124,10 @@ void Graph::PullOutputData(std::unordered_map<std::size_t, ov::SoPtr<ITensor>>&
}

void Graph::InferStatic(SyncInferRequest* request, int numaId) {
CPU_DEBUG_CAP_ENABLE(const PerfKey perfKey = perfGetKey(*this));

for (const auto& node : m_executableGraphNodes) {
ExecuteNodeWithCatch(node, request, numaId);
ExecuteNodeWithCatch(node, request, numaId, perfKey);
}
}

Expand Down Expand Up @@ -1333,9 +1335,9 @@ class UpdateNodes : public UpdateNodesBase {
/* group all the profiling macros into a single one
* to avoid cluttering a core logic */
#define VERBOSE_PERF_DUMP_ITT_DEBUG_LOG(ittScope, node, config) \
VERBOSE(node, config.debugCaps.verbose); \
PERF(node, config.collectPerfCounters); \
DUMP(node, config.debugCaps, infer_count); \
VERBOSE(node, config.debugCaps.verbose, infer_count); \
PERF(node, config.collectPerfCounters, perfKey); \
DUMP(node, config.debugCaps, nestingLevel, infer_count); \
OV_ITT_SCOPED_TASK(ittScope, node->profiling.execute); \
DEBUG_LOG(*node);

Expand All @@ -1346,7 +1348,7 @@ inline void Graph::ExecuteNode(const NodePtr& node, SyncInferRequest* request, i
node->execute(m_stream, numaId);
}

inline void Graph::ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request, int numaId) const {
inline void Graph::ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request, int numaId, const PerfKey perfKey) const {
VERBOSE_PERF_DUMP_ITT_DEBUG_LOG(itt::domains::intel_cpu, node, getConfig());

try {
Expand All @@ -1358,14 +1360,16 @@ inline void Graph::ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* r

template<typename UpdateStrategy>
void Graph::InferDynamic(SyncInferRequest* request, int numaId, UpdateStrategy&& update) {
CPU_DEBUG_CAP_ENABLE(const PerfKey perfKey = perfGetKey(*this));

size_t inferCounter = 0;
for (auto stopIndx : m_executableSyncNodesInds) {
update(stopIndx);

for (; inferCounter < stopIndx; ++inferCounter) {
auto& node = m_executableGraphNodes[inferCounter];

ExecuteNodeWithCatch(node, request, numaId);
ExecuteNodeWithCatch(node, request, numaId, perfKey);
}
}
}
Expand Down Expand Up @@ -1399,7 +1403,7 @@ void Graph::Infer(SyncInferRequest* request) {
OPENVINO_ASSERT(IsReady(), "Wrong state of the ov::intel_cpu::Graph. Topology is not ready: ", static_cast<int>(status));
}

if (infer_count != -1) infer_count++;
CPU_DEBUG_CAP_ENABLE(infer_count++);
}

void Graph::SortTopologically() {
Expand Down
28 changes: 20 additions & 8 deletions src/plugins/intel_cpu/src/graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ class Graph {
return _name;
}

std::map<std::size_t, NodePtr>& GetInputNodesMap() {
const std::map<std::size_t, NodePtr>& GetInputNodesMap() const {
return inputNodesMap;
}

Expand Down Expand Up @@ -176,8 +176,6 @@ class Graph {

std::shared_ptr<ov::Model> dump() const;

void ResetInferCount() { infer_count = 0; }

void SortTopologically();

bool hasDynamicInput() const {
Expand All @@ -200,10 +198,6 @@ class Graph {
}
Status status { Status::NotReady };

// For dumping purposes. -1 - no counting, all other positive
// values mean increment it within each Infer() call
int infer_count = -1;

bool reuse_io_tensors = true;

MemoryPtr memWorkspace;
Expand Down Expand Up @@ -235,7 +229,7 @@ class Graph {
* @params request Current inference request, which is checked for cancelation
* @params numaId Numa Id to be used for an execution
*/
void ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request = nullptr, int numaId = -1) const;
void ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request = nullptr, int numaId = -1, const PerfKey perfKey = 0) const;

/**
* Execute a given \p node within \p request using \p numaId
Expand Down Expand Up @@ -273,6 +267,24 @@ class Graph {
void EnforceBF16();
void insertReorder(EdgePtr& edge, bool isOptimized, std::unordered_set<std::string>& uniqueLayerNames);
void insertConvert(EdgePtr& edge);
void resolveInPlaceDirection(const NodePtr& node) const;

#ifdef CPU_DEBUG_CAPS

public:
void setNestingLevel(const uint8_t level) { nestingLevel = level; }
void ResetInferCount() { infer_count = 0; }

private:
// Main CPU plugin execution graph has level 1,
// other ones are nested graphs used for particular nodes.
uint8_t nestingLevel = 2;
int infer_count = 0;

std::map<std::vector<VectorDims>, PerfKey> perfKeysMap;
friend PerfKey perfGetKey(Graph& graph);
friend void perfDump(const CompiledModel& execNet);
#endif // CPU_DEBUG_CAPS
};

using GraphPtr = std::shared_ptr<Graph>;
Expand Down
12 changes: 10 additions & 2 deletions src/plugins/intel_cpu/src/node.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -552,8 +552,15 @@ void Node::updateShapes() {
" with name: ",
getName());
try {
if (needShapeInfer()) {
auto result = shapeInfer();
IShapeInfer::Result result = {{}, ShapeInferStatus::skip};
{
PERF_SHAPE_INFER(this);
if (needShapeInfer()) {
result = shapeInfer();
}
}
{
PERF_PREDEFINE_OUTPUT_MEMORY(this);
if (ShapeInferStatus::success == result.status) {
redefineOutputMemory(result.dims);
}
Expand All @@ -570,6 +577,7 @@ void Node::updateDynamicParams() {
" with name: ",
getName());
try {
PERF_PREPARE_PARAMS(this);
if (isExecutable()) {
if (needPrepareParams()) {
OPENVINO_ASSERT(inputShapesDefined(),
Expand Down
9 changes: 8 additions & 1 deletion src/plugins/intel_cpu/src/node.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#include "openvino/core/node.hpp"
#include <nodes/common/blocked_desc_creator.h>
#include "nodes/node_config.h"
#include "cache/multi_cache.h"
#include "utils/verbose_node_helper.h"
#include <shape_inference/shape_inference_cpu.hpp>
#include "perf_count.h"
#include "utils/debug_capabilities.h"
Expand Down Expand Up @@ -831,7 +833,12 @@ class Node {
// Hold output scales
std::vector<float> DQScales;

CPU_DEBUG_CAP_ENABLE(friend class Verbose);
#ifdef CPU_DEBUG_CAPS
friend class Verbose;

protected:
VerboseNodeStorage _verboseStorage;
#endif
};

#ifndef CPU_DEBUG_CAPS
Expand Down
14 changes: 7 additions & 7 deletions src/plugins/intel_cpu/src/nodes/common/reorder_prim.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ bool ReorderKey::operator==(const ReorderKey& rhs) const {
return retVal;
}

dnnl::reorder getReorderPrim(MultiCachePtr cache,
const dnnl::engine& engine,
const dnnl::memory::desc& src,
const dnnl::memory::desc& dest) {
std::pair<dnnl::reorder, CacheEntryBase::LookUpStatus> getReorderPrim(MultiCachePtr cache,
const dnnl::engine& engine,
const dnnl::memory::desc& src,
const dnnl::memory::desc& dest) {
auto builder = [&engine](const ReorderKey& key) {
dnnl::primitive_attr attr;
DEBUG_LOG(key.src, "->", key.dest);
Expand All @@ -58,10 +58,10 @@ dnnl::reorder getReorderPrim(MultiCachePtr cache,

ReorderKey key = {src, dest};
if (cache) {
auto result = cache->getOrCreate(key, builder);
return result.first;
return cache->getOrCreate(key, builder);
}
return builder(key);

return {builder(key), CacheEntryBase::LookUpStatus::Miss};
}

} // namespace intel_cpu
Expand Down
4 changes: 2 additions & 2 deletions src/plugins/intel_cpu/src/nodes/common/reorder_prim.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
namespace ov {
namespace intel_cpu {

dnnl::reorder getReorderPrim(MultiCachePtr cache,
std::pair<dnnl::reorder, CacheEntryBase::LookUpStatus> getReorderPrim(MultiCachePtr cache,
const dnnl::engine& engine,
const dnnl::memory::desc& src,
const dnnl::memory::desc& dest);

} // namespace intel_cpu
} // namespace ov
} // namespace ov
4 changes: 2 additions & 2 deletions src/plugins/intel_cpu/src/nodes/conv.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ class Convolution::FusedSubgraph {
}

void infer() {
_graph->ResetInferCount();
CPU_DEBUG_CAP_ENABLE(_graph->ResetInferCount());
_graph->Infer();
}

Expand Down Expand Up @@ -1407,7 +1407,7 @@ void Convolution::prepareParams() {
execPtr = nullptr;
auto cache = context->getParamsCache();
auto result = cache->getOrCreate(key, builder);

VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
execPtr = result.first;

if (!execPtr)
Expand Down
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/nodes/depth_to_space.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ void DepthToSpace::prepareParams() {

auto cache = context->getParamsCache();
auto result = cache->getOrCreate(attrs, builder);
VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
if (!result.first) {
OPENVINO_THROW("DepthToSpaceExecutor was not found for node ", getName(), ".");
}
Expand Down
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/nodes/eltwise.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2719,6 +2719,7 @@ void Eltwise::prepareParams() {

auto cache = context->getParamsCache();
auto result = cache->getOrCreate(key, buildExecutor);
VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
execPtr = result.first;
}

Expand Down
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/nodes/extract_image_patches.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,7 @@ void ExtractImagePatches::prepareParams() {
};
auto cache = context->getParamsCache();
auto result = cache->getOrCreate(key, buildExecutor);
VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
execPtr = result.first;
}

Expand Down
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/nodes/fake_quantize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1484,6 +1484,7 @@ void FakeQuantize::createPrimitive() {
return std::make_shared<FakeQuantizeJitExecutor>(key.jqp);
};
auto result = cache->getOrCreate(key, buildExecutor);
VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
execPtr = result.first;
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/plugins/intel_cpu/src/nodes/if.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ void If::execute(dnnl::stream strm) {

for (auto &mapper : beforeMappers)
mapper->execute(strm);
subGraph.ResetInferCount();
CPU_DEBUG_CAP_ENABLE(subGraph.ResetInferCount());
subGraph.Infer();
for (auto &mapper : afterMappers)
mapper->execute(strm);
Expand Down
Loading

0 comments on commit f7ade08

Please sign in to comment.