Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improve for Intel MTL CPUs #19379

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion onnxruntime/core/platform/windows/env.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
#include "core/common/span_utils.h"
#include "core/platform/env.h"
#include "core/platform/scoped_resource.h"
#if defined(_M_X64) && !defined(_M_ARM64EC)
#include "core/platform/windows/hardware_core_enumerator.h"
#endif
#include <unsupported/Eigen/CXX11/ThreadPool>
#include <wil/Resource.h>

Expand Down Expand Up @@ -248,12 +251,53 @@
Sleep(static_cast<DWORD>(micros) / 1000);
}

// EIGEN_NO_CPUID is not defined in any C/C++ source code. It is a compile option.
#if defined(_M_X64) && !defined(_M_ARM64EC) && !defined(EIGEN_NO_CPUID)
static constexpr std::array<int, 3> kVendorID_Intel = {0x756e6547, 0x6c65746e, 0x49656e69}; // "GenuntelineI"
#endif
int WindowsEnv::DefaultNumCores() {
return std::max(1, static_cast<int>(std::thread::hardware_concurrency() / 2));
}

int WindowsEnv::GetNumPhysicalCpuCores() const {
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
// EIGEN_NO_CPUID is not defined in any C/C++ source code. It is a compile option.
#if defined(_M_X64) && !defined(_M_ARM64EC) && !defined(EIGEN_NO_CPUID)
// The following code was added per a request from someone from Intel. It is to reduce the total number of threads
// by one on Intel MTL CPUs. This is a special perf optimzation for a special kind of CPU. It is based on assumptions
// that:
// 1. All Intel CPUs should have 3 levels of cache. (However it is not true)
// 2. If a CPU core is only associated with two levels of cache, it should be a low performance CPU core and should
// not be used.
// While we make assumptions on hardware configurations, we should realize that:
// 1. Not all CPUs are real. Intel has Software Development Emulator(SDE) and Microsoft has Hyper-V.
// 2. For many reason OS and hypervisors do not always give us the true information of the underlaying hardware.
// The code below is very hacky, but we expect it should not cause any crash. The worst is it might return 1 that
// a thread pool will not be created, which is just a perf issue that does not impact usability.
// TODO: detect if CPUID instruction is available per instructions at https://wiki.osdev.org/CPUID#Checking_CPUID_availability

Check warning on line 276 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:276: Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2]
int regs[4];
__cpuid(regs, 0);
bool bIsIntel =
(kVendorID_Intel[0] == regs[1]) &&
(kVendorID_Intel[1] == regs[2]) &&
(kVendorID_Intel[2] == regs[3]);
if (bIsIntel && regs[0] >= 7) {
// Query Structured Extended Feature Flags Enumeration Leaf
__cpuid(regs, 0x7);
// The bit 15 of EDX indicates if the processor is identified as a hybrid part.
bool ishybrid = regs[3] & (1 << 15);
if (ishybrid) {
// NOTE: even if ishybrid is true, it doesn't mean the processor must have P-cores/E-cores.
// On Intel CPUs we assume the HardwareCoreEnumerator::DefaultIntraOpNumThreads function would never fail.
// NOTE: due to resource restrictions, we cannot test this branch in our CI build pipelines.
return std::max(static_cast<uint32_t>(1), HardwareCoreEnumerator::DefaultIntraOpNumThreads());
} else {
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
}
} else

Check warning on line 296 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 If an else has a brace on one side, it should have it on both [readability/braces] [5] Raw Output: onnxruntime/core/platform/windows/env.cc:296: If an else has a brace on one side, it should have it on both [readability/braces] [5]
#endif
{
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
}
}

std::vector<LogicalProcessors> WindowsEnv::GetDefaultThreadAffinities() const {
Expand Down
89 changes: 89 additions & 0 deletions onnxruntime/core/platform/windows/hardware_core_enumerator.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
// Copyright (c) Microsoft Corporation. All rights reserved.

Check warning on line 1 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 At least two spaces is best between code and comments [whitespace/comments] [2] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:1: At least two spaces is best between code and comments [whitespace/comments] [2]
Fixed Show fixed Hide fixed
// Licensed under the MIT License.

#include "hardware_core_enumerator.h"

Check warning on line 4 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Include the directory when naming header files [build/include_subdir] [4] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:4: Include the directory when naming header files [build/include_subdir] [4]
#include <memory>
#include <Windows.h>
#include <assert.h>

Check warning on line 7 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Found C system header after C++ system header. Should be: hardware_core_enumerator.h, c system, c++ system, other. [build/include_order] [4] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:7: Found C system header after C++ system header. Should be: hardware_core_enumerator.h, c system, c++ system, other. [build/include_order] [4]

namespace onnxruntime {

struct LogicalProcessorInformation {
std::unique_ptr<char[]> Buffer;
size_t Length;
};

struct CoreCounter {
uint32_t PhysicalCores = 0;
uint32_t SocDieCores = 0;
};

static LogicalProcessorInformation GetLogicalProcessorInfos(LOGICAL_PROCESSOR_RELATIONSHIP relationship) {
DWORD length = 0;
DWORD rc = GetLogicalProcessorInformationEx(relationship, nullptr, &length);

assert(rc == FALSE);

auto processorInformationBytes = std::make_unique<char[]>(length);

rc = GetLogicalProcessorInformationEx(
relationship, reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(processorInformationBytes.get()), &length);

Check warning on line 30 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:30: Lines should be <= 120 characters long [whitespace/line_length] [2]

assert(rc == TRUE);

return {std::move(processorInformationBytes), length};

Check warning on line 34 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Add #include <utility> for move [build/include_what_you_use] [4] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:34: Add #include <utility> for move [build/include_what_you_use] [4]
}

uint32_t CountSetBits(DWORD input) {
uint32_t c;
for (c = 0; input; c++) {
input &= input - 1;
}
return c;
}

static CoreCounter GetNumberOPhysicalAndEngineeringCores() {
auto logicalProcessorInformation = GetLogicalProcessorInfos(RelationAll);

CoreCounter cores;
DWORD dwLevel2GroupMask = 0;
DWORD dwLevel3GroupMask = 0;
size_t read = 0;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX currentProcessorInfo = NULL;

while ((read + FIELD_OFFSET(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, Processor)) < logicalProcessorInformation.Length) {

Check warning on line 54 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:54: Lines should be <= 120 characters long [whitespace/line_length] [2]
currentProcessorInfo =
reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(logicalProcessorInformation.Buffer.get() + read);
if ((read + currentProcessorInfo->Size) > logicalProcessorInformation.Length) {
break;
}

switch (currentProcessorInfo->Relationship) {
case RelationProcessorCore:
cores.PhysicalCores++;
break;
case RelationCache:
if (currentProcessorInfo->Cache.Level == 2) {
dwLevel2GroupMask |= currentProcessorInfo->Cache.GroupMask.Mask;
} else if (currentProcessorInfo->Cache.Level == 3) {
dwLevel3GroupMask |= currentProcessorInfo->Cache.GroupMask.Mask;
}
break;
}

read += currentProcessorInfo->Size;
}

cores.SocDieCores = CountSetBits(dwLevel2GroupMask & ~dwLevel3GroupMask);
return cores;
}

uint32_t HardwareCoreEnumerator::DefaultIntraOpNumThreads() {
// # of physical cores = # of P cores + # of E Cores + # of Soc Cores.
// # of logical cores = # of P cores x 2 (if hyper threading is enabled) + # of E cores + # of Soc Cores.
auto cores = GetNumberOPhysicalAndEngineeringCores();
// We want to use the number of physical cores, but exclude soc cores
return cores.PhysicalCores - cores.SocDieCores;
}

} // namespace onnxruntime
12 changes: 12 additions & 0 deletions onnxruntime/core/platform/windows/hardware_core_enumerator.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// Copyright (c) Microsoft Corporation. All rights reserved.

Check warning on line 1 in onnxruntime/core/platform/windows/hardware_core_enumerator.h

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 At least two spaces is best between code and comments [whitespace/comments] [2] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.h:1: At least two spaces is best between code and comments [whitespace/comments] [2]
Fixed Show fixed Hide fixed
// Licensed under the MIT License.

#pragma once
#include <stdint.h>

namespace onnxruntime {
struct HardwareCoreEnumerator {
HardwareCoreEnumerator() = delete;
static uint32_t DefaultIntraOpNumThreads();
};
} // namespace onnxruntime

Check warning on line 12 in onnxruntime/core/platform/windows/hardware_core_enumerator.h

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Could not find a newline character at the end of the file. [whitespace/ending_newline] [5] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.h:12: Could not find a newline character at the end of the file. [whitespace/ending_newline] [5]
19 changes: 14 additions & 5 deletions onnxruntime/core/util/thread_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -93,22 +93,31 @@ static std::unique_ptr<ThreadPool>
CreateThreadPoolHelper(Env* env, OrtThreadPoolParams options) {
ThreadOptions to;
if (options.thread_pool_size <= 0) { // default
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
if (options.auto_set_affinity) {
#ifdef _WIN32
// Only set thread affinity on Server with auto affinity.
// On client best to let OS scheduler handle.
// On big (P-Core) / little (E-Core) CPU designs affinity overrides QoS and has high power usage
if (IsWindowsServer()) {
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
to.affinities = std::move(default_affinities);
} else {
options.thread_pool_size = Env::Default().GetNumPhysicalCpuCores();
}
#else
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
to.affinities = std::move(default_affinities);
#endif
} else {
options.thread_pool_size = Env::Default().GetNumPhysicalCpuCores();
}
}
if (options.thread_pool_size <= 1) {
Expand Down
Loading