[CPU] [DEBUG CAPS] Extend performance info for dynamic shapes

Measurements for node shapeInfer and prepareParams stages were added including support of aggregate tables in csv format. Verbose output was appended by corresponding durations and prepareParams cache hit info. [dmitrygo] Fixed ShapeInfer and PrepareParams counters invocation [dmitrygo] Added RedefineOutputMemory counnter
dmitry-gorokhov · Sep 3, 2024 · f7ade08 · f7ade08
1 parent 6bc2742
commit f7ade08
Show file tree

Hide file tree

Showing 42 changed files with 889 additions and 108 deletions.
diff --git a/src/plugins/intel_cpu/docs/debug_capabilities/perf_measurements.md b/src/plugins/intel_cpu/docs/debug_capabilities/perf_measurements.md
@@ -0,0 +1,31 @@
+# Performance measurements
+
+Performance counters are controlled by plugin config option **PERF_COUNT** and forcibly enabled in [Verbose mode](verbose.md). If it is enabled *total* performance counters are aggregated by default and could be retrieved via API (for example, *-pc* option in *benchmark_app*).
+
+## Aggregate performance counters
+
+If performance counters are enabled, aggregation of additional performance counters is controlled by environment variable **OV_CPU_PERF_TABLES_PATH**:
+```sh
+    OV_CPU_PERF_TABLES_PATH=<path/prefix> binary ...
+```
+
+\<prefix\> is optional, so slash have to be stated at the end of path if no prefix:
+```sh
+    OV_CPU_PERF_TABLES_PATH="/path/to/dumpDir/" binary ...
+    OV_CPU_PERF_TABLES_PATH="/path/to/dumpDir/table_prefix_" binary ...
+```
+
+CSV tables with aggregate performance counters are dumped upon *ExecNetwork* destruction:
+* \<prefix\>perf_modelInputs.csv\
+Map of model input shapes to indexes.
+* \<prefix\>perf_raw_nodes.csv\
+Aggregate performance counters without processing.
+* \<prefix\>perf_0_\<0_model_input_shape\>\_nodes.csv ... \<prefix\>perf_N_\<N_model_input_shape\>\_nodes.csv\
+Aggregate performance counters per node for corresponding model inputs.
+* \<prefix\>perf_0_\<0_model_input_shape\>\_nodeTypes.csv ... \<prefix\>perf_N_\<N_model_input_shape\>\_nodeTypes.csv\
+Aggregate performance counters per node type for corresponding model inputs.
+* \<prefix\>perf_all_nodeTypes.csv\
+Aggregate performance counters per node type for all model inputs.\
+This table is absent in case of single model input (static case) since it is the same as \<prefix\>perf_0_\<0_model_input_shape\>_nodeTypes.csv
+
+\<model_input_shape\> in table name is limited by 100 characters, so complete shape is additionally stated at the end of corresponding table and in \<prefix\>perf_modelInputs.csv.
diff --git a/src/plugins/intel_cpu/docs/debug_capabilities/verbose.md b/src/plugins/intel_cpu/docs/debug_capabilities/verbose.md
@@ -1,6 +1,7 @@
 # Verbose mode
 
 It is possible to enable tracing execution of plugin nodes to cout and collect statistics, such as:
+  - inference request number
   - node implementer:
     * cpu (CPU plugin)
     * dnnl (oneDNN library)
@@ -10,18 +11,23 @@ It is possible to enable tracing execution of plugin nodes to cout and collect s
   - node algorithm
   - node primitive info
   - input / output ports info
-  - fused nodes
-  - execution time
+  - fused nodes (omitted if no)
+  - time measurements:
+    * for static case - total
+    * for dynamic case - total, shapeInfer, prepareParams, exec
+  - cache hit info (omitted if no):
+    * execCacheHit (if primitives cache hit occurred)
   - etc
 
 Format:
 ```sh
-    ov_cpu_verbose,exec,<node_implemeter>,\
-    <node_name>:<node_type>:<node_alg>,<impl_type>,\
-    src:<port_id>:<precision>::<type>:<format>:f0:<shape> ...,\
-    dst:<port_id>:<precision>::<type>:<format>:f0:<shape> ...,\
-    post_ops:'<node_name>:<node_type>:<node_alg>;...;',\
-    <execution_time>
+    ov_cpu_verbose,<inference_request_number>,exec,<node_implemeter>,
+    <node_name>:<node_type>:<node_alg>,<impl_type>,
+    src:<port_id>:<precision>::<type>:<format>:f0:<shape> ...
+    dst:<port_id>:<precision>::<type>:<format>:f0:<shape> ...,
+    post_ops:'<node_name>:<node_type>:<node_alg>;...;',
+    time:<time_measurement>:<number>ms ...,
+    cacheHit:<cache_hit_info> ...
 ```
 
 To turn on verbose mode the following environment variable should be used:

diff --git a/src/plugins/intel_cpu/src/compiled_model.cpp b/src/plugins/intel_cpu/src/compiled_model.cpp
@@ -166,6 +166,7 @@ CompiledModel::GraphGuard::Lock CompiledModel::get_graph() const {
                 }
                 const std::shared_ptr<const ov::Model> model = m_model;
                 graphLock._graph.CreateGraph(model, ctx);
+                CPU_DEBUG_CAP_ENABLE(graphLock._graph.setNestingLevel(1));
             } catch (...) {
                 exception = std::current_exception();
             }

diff --git a/src/plugins/intel_cpu/src/compiled_model.h b/src/plugins/intel_cpu/src/compiled_model.h
@@ -34,6 +34,9 @@ class CompiledModel : public ov::ICompiledModel {
             m_sub_compiled_models.clear();
             m_sub_memory_manager->_memorys_table.clear();
         }
+#ifdef CPU_DEBUG_CAPS
+        perfDump(*this);
+#endif
     }
 
     std::shared_ptr<ov::IAsyncInferRequest> create_infer_request() const override;
@@ -49,6 +52,10 @@ class CompiledModel : public ov::ICompiledModel {
                                        "Set property to Core::compile_model during compilation");
     };
 
+ #ifdef CPU_DEBUG_CAPS
+    friend void perfDump(const CompiledModel& execNet);
+ #endif
+
 private:
     std::shared_ptr<ov::ISyncInferRequest> create_sync_infer_request() const override;
     friend class SyncInferRequest;

diff --git a/src/plugins/intel_cpu/src/config.cpp b/src/plugins/intel_cpu/src/config.cpp
@@ -61,8 +61,11 @@ Config::Config() {
  */
 void Config::applyDebugCapsProperties() {
     // always enable perf counters for verbose mode and performance summary
-    if (!debugCaps.verbose.empty() || !debugCaps.summaryPerf.empty())
+    if (!debugCaps.verbose.empty() ||
+        !debugCaps.summaryPerf.empty() ||
+        !debugCaps.perfTablesPath.empty()) {
         collectPerfCounters = true;
+    }
 }
 #endif
 

diff --git a/src/plugins/intel_cpu/src/dnnl_extension_utils.cpp b/src/plugins/intel_cpu/src/dnnl_extension_utils.cpp
@@ -47,6 +47,8 @@ uint8_t DnnlExtensionUtils::sizeOfDataType(dnnl::memory::data_type dataType) {
 
 dnnl::memory::data_type DnnlExtensionUtils::ElementTypeToDataType(const ov::element::Type& elementType) {
     switch (elementType) {
+        case ov::element::i64:
+            return memory::data_type::s32;
         case ov::element::f32:
             return memory::data_type::f32;
         case ov::element::i32:

diff --git a/src/plugins/intel_cpu/src/graph.cpp b/src/plugins/intel_cpu/src/graph.cpp
@@ -1124,8 +1124,10 @@ void Graph::PullOutputData(std::unordered_map<std::size_t, ov::SoPtr<ITensor>>&
 }
 
 void Graph::InferStatic(SyncInferRequest* request, int numaId) {
+    CPU_DEBUG_CAP_ENABLE(const PerfKey perfKey = perfGetKey(*this));
+
     for (const auto& node : m_executableGraphNodes) {
-        ExecuteNodeWithCatch(node, request, numaId);
+        ExecuteNodeWithCatch(node, request, numaId, perfKey);
     }
 }
 
@@ -1333,9 +1335,9 @@ class UpdateNodes : public UpdateNodesBase {
 /* group all the profiling macros into a single one
  * to avoid cluttering a core logic */
 #define VERBOSE_PERF_DUMP_ITT_DEBUG_LOG(ittScope, node, config) \
-    VERBOSE(node, config.debugCaps.verbose); \
-    PERF(node, config.collectPerfCounters); \
-    DUMP(node, config.debugCaps, infer_count); \
+    VERBOSE(node, config.debugCaps.verbose, infer_count); \
+    PERF(node, config.collectPerfCounters, perfKey); \
+    DUMP(node, config.debugCaps, nestingLevel, infer_count); \
     OV_ITT_SCOPED_TASK(ittScope, node->profiling.execute); \
     DEBUG_LOG(*node);
 
@@ -1346,7 +1348,7 @@ inline void Graph::ExecuteNode(const NodePtr& node, SyncInferRequest* request, i
     node->execute(m_stream, numaId);
 }
 
-inline void Graph::ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request, int numaId) const {
+inline void Graph::ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request, int numaId, const PerfKey perfKey) const {
     VERBOSE_PERF_DUMP_ITT_DEBUG_LOG(itt::domains::intel_cpu, node, getConfig());
 
     try {
@@ -1358,14 +1360,16 @@ inline void Graph::ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* r
 
 template<typename UpdateStrategy>
 void Graph::InferDynamic(SyncInferRequest* request, int numaId, UpdateStrategy&& update) {
+    CPU_DEBUG_CAP_ENABLE(const PerfKey perfKey = perfGetKey(*this));
+
     size_t inferCounter = 0;
     for (auto stopIndx : m_executableSyncNodesInds) {
         update(stopIndx);
 
         for (; inferCounter < stopIndx; ++inferCounter) {
             auto& node = m_executableGraphNodes[inferCounter];
 
-            ExecuteNodeWithCatch(node, request, numaId);
+            ExecuteNodeWithCatch(node, request, numaId, perfKey);
         }
     }
 }
@@ -1399,7 +1403,7 @@ void Graph::Infer(SyncInferRequest* request) {
         OPENVINO_ASSERT(IsReady(), "Wrong state of the ov::intel_cpu::Graph. Topology is not ready: ", static_cast<int>(status));
     }
 
-    if (infer_count != -1) infer_count++;
+    CPU_DEBUG_CAP_ENABLE(infer_count++);
 }
 
 void Graph::SortTopologically() {

diff --git a/src/plugins/intel_cpu/src/graph.h b/src/plugins/intel_cpu/src/graph.h
@@ -72,7 +72,7 @@ class Graph {
         return _name;
     }
 
-    std::map<std::size_t, NodePtr>& GetInputNodesMap() {
+    const std::map<std::size_t, NodePtr>& GetInputNodesMap() const {
         return inputNodesMap;
     }
 
@@ -176,8 +176,6 @@ class Graph {
 
     std::shared_ptr<ov::Model> dump() const;
 
-    void ResetInferCount() { infer_count = 0; }
-
     void SortTopologically();
 
     bool hasDynamicInput() const {
@@ -200,10 +198,6 @@ class Graph {
     }
     Status status { Status::NotReady };
 
-    // For dumping purposes. -1 - no counting, all other positive
-    // values mean increment it within each Infer() call
-    int infer_count = -1;
-
     bool reuse_io_tensors = true;
 
     MemoryPtr memWorkspace;
@@ -235,7 +229,7 @@ class Graph {
      * @params request  Current inference request, which is checked for cancelation
      * @params numaId   Numa Id to be used for an execution
      */
-    void ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request = nullptr, int numaId = -1) const;
+    void ExecuteNodeWithCatch(const NodePtr& node, SyncInferRequest* request = nullptr, int numaId = -1, const PerfKey perfKey = 0) const;
 
     /**
      * Execute a given \p node within \p request using \p numaId
@@ -273,6 +267,24 @@ class Graph {
     void EnforceBF16();
     void insertReorder(EdgePtr& edge, bool isOptimized, std::unordered_set<std::string>& uniqueLayerNames);
     void insertConvert(EdgePtr& edge);
+    void resolveInPlaceDirection(const NodePtr& node) const;
+
+#ifdef CPU_DEBUG_CAPS
+
+public:
+    void setNestingLevel(const uint8_t level) { nestingLevel = level; }
+    void ResetInferCount() { infer_count = 0; }
+
+private:
+    // Main CPU plugin execution graph has level 1,
+    // other ones are nested graphs used for particular nodes.
+    uint8_t nestingLevel = 2;
+    int infer_count = 0;
+
+    std::map<std::vector<VectorDims>, PerfKey> perfKeysMap;
+    friend PerfKey perfGetKey(Graph& graph);
+    friend void perfDump(const CompiledModel& execNet);
+#endif // CPU_DEBUG_CAPS
 };
 
 using GraphPtr = std::shared_ptr<Graph>;

diff --git a/src/plugins/intel_cpu/src/node.cpp b/src/plugins/intel_cpu/src/node.cpp
@@ -552,8 +552,15 @@ void Node::updateShapes() {
                     " with name: ",
                     getName());
         try {
-            if (needShapeInfer()) {
-                auto result = shapeInfer();
+            IShapeInfer::Result result = {{}, ShapeInferStatus::skip};
+            {
+                PERF_SHAPE_INFER(this);
+                if (needShapeInfer()) {
+                    result = shapeInfer();
+                }
+            }
+            {
+                PERF_PREDEFINE_OUTPUT_MEMORY(this);
                 if (ShapeInferStatus::success == result.status) {
                     redefineOutputMemory(result.dims);
                 }
@@ -570,6 +577,7 @@ void Node::updateDynamicParams() {
                     " with name: ",
                     getName());
     try {
+        PERF_PREPARE_PARAMS(this);
         if (isExecutable()) {
             if (needPrepareParams()) {
                 OPENVINO_ASSERT(inputShapesDefined(),

diff --git a/src/plugins/intel_cpu/src/node.h b/src/plugins/intel_cpu/src/node.h
@@ -19,6 +19,8 @@
 #include "openvino/core/node.hpp"
 #include <nodes/common/blocked_desc_creator.h>
 #include "nodes/node_config.h"
+#include "cache/multi_cache.h"
+#include "utils/verbose_node_helper.h"
 #include <shape_inference/shape_inference_cpu.hpp>
 #include "perf_count.h"
 #include "utils/debug_capabilities.h"
@@ -831,7 +833,12 @@ class Node {
     // Hold output scales
     std::vector<float> DQScales;
 
-    CPU_DEBUG_CAP_ENABLE(friend class Verbose);
+#ifdef CPU_DEBUG_CAPS
+    friend class Verbose;
+
+protected:
+    VerboseNodeStorage _verboseStorage;
+#endif
 };
 
 #ifndef CPU_DEBUG_CAPS

diff --git a/src/plugins/intel_cpu/src/nodes/common/reorder_prim.cpp b/src/plugins/intel_cpu/src/nodes/common/reorder_prim.cpp
@@ -42,10 +42,10 @@ bool ReorderKey::operator==(const ReorderKey& rhs) const {
     return retVal;
 }
 
-dnnl::reorder getReorderPrim(MultiCachePtr cache,
-                             const dnnl::engine& engine,
-                             const dnnl::memory::desc& src,
-                             const dnnl::memory::desc& dest) {
+std::pair<dnnl::reorder, CacheEntryBase::LookUpStatus> getReorderPrim(MultiCachePtr cache,
+                                                                      const dnnl::engine& engine,
+                                                                      const dnnl::memory::desc& src,
+                                                                      const dnnl::memory::desc& dest) {
     auto builder = [&engine](const ReorderKey& key) {
         dnnl::primitive_attr attr;
         DEBUG_LOG(key.src, "->", key.dest);
@@ -58,10 +58,10 @@ dnnl::reorder getReorderPrim(MultiCachePtr cache,
 
     ReorderKey key = {src, dest};
     if (cache) {
-        auto result = cache->getOrCreate(key, builder);
-        return result.first;
+        return cache->getOrCreate(key, builder);
     }
-    return builder(key);
+
+    return {builder(key), CacheEntryBase::LookUpStatus::Miss};
 }
 
 }  // namespace intel_cpu

diff --git a/src/plugins/intel_cpu/src/nodes/common/reorder_prim.h b/src/plugins/intel_cpu/src/nodes/common/reorder_prim.h
@@ -9,10 +9,10 @@
 namespace ov {
 namespace intel_cpu {
 
-dnnl::reorder getReorderPrim(MultiCachePtr cache,
+std::pair<dnnl::reorder, CacheEntryBase::LookUpStatus> getReorderPrim(MultiCachePtr cache,
                              const dnnl::engine& engine,
                              const dnnl::memory::desc& src,
                              const dnnl::memory::desc& dest);
 
 }  // namespace intel_cpu
-}  // namespace ov
+}  // namespace ov
diff --git a/src/plugins/intel_cpu/src/nodes/conv.cpp b/src/plugins/intel_cpu/src/nodes/conv.cpp
@@ -207,7 +207,7 @@ class Convolution::FusedSubgraph {
     }
 
     void infer() {
-        _graph->ResetInferCount();
+        CPU_DEBUG_CAP_ENABLE(_graph->ResetInferCount());
         _graph->Infer();
     }
 
@@ -1407,7 +1407,7 @@ void Convolution::prepareParams() {
     execPtr = nullptr;
     auto cache = context->getParamsCache();
     auto result = cache->getOrCreate(key, builder);
-
+    VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
     execPtr = result.first;
 
     if (!execPtr)

diff --git a/src/plugins/intel_cpu/src/nodes/depth_to_space.cpp b/src/plugins/intel_cpu/src/nodes/depth_to_space.cpp
@@ -191,6 +191,7 @@ void DepthToSpace::prepareParams() {
 
     auto cache = context->getParamsCache();
     auto result = cache->getOrCreate(attrs, builder);
+    VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
     if (!result.first) {
         OPENVINO_THROW("DepthToSpaceExecutor was not found for node ", getName(), ".");
     }

diff --git a/src/plugins/intel_cpu/src/nodes/eltwise.cpp b/src/plugins/intel_cpu/src/nodes/eltwise.cpp
@@ -2719,6 +2719,7 @@ void Eltwise::prepareParams() {
 
         auto cache = context->getParamsCache();
         auto result = cache->getOrCreate(key, buildExecutor);
+        VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
         execPtr = result.first;
     }
 

diff --git a/src/plugins/intel_cpu/src/nodes/extract_image_patches.cpp b/src/plugins/intel_cpu/src/nodes/extract_image_patches.cpp
@@ -405,6 +405,7 @@ void ExtractImagePatches::prepareParams() {
     };
     auto cache = context->getParamsCache();
     auto result = cache->getOrCreate(key, buildExecutor);
+    VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
     execPtr = result.first;
 }
 

diff --git a/src/plugins/intel_cpu/src/nodes/fake_quantize.cpp b/src/plugins/intel_cpu/src/nodes/fake_quantize.cpp
@@ -1484,6 +1484,7 @@ void FakeQuantize::createPrimitive() {
             return std::make_shared<FakeQuantizeJitExecutor>(key.jqp);
         };
         auto result = cache->getOrCreate(key, buildExecutor);
+        VERBOSE_HELPER_NODE_PREPARE_PARAMS(result.second);
         execPtr = result.first;
     }
 }

diff --git a/src/plugins/intel_cpu/src/nodes/if.cpp b/src/plugins/intel_cpu/src/nodes/if.cpp
@@ -252,7 +252,7 @@ void If::execute(dnnl::stream strm) {
 
     for (auto &mapper : beforeMappers)
         mapper->execute(strm);
-    subGraph.ResetInferCount();
+    CPU_DEBUG_CAP_ENABLE(subGraph.ResetInferCount());
     subGraph.Infer();
     for (auto &mapper : afterMappers)
         mapper->execute(strm);