Skip to content

Commit

Permalink
Perf improvement for Intel MTL CPUs (#19524)
Browse files Browse the repository at this point in the history
### Description
See the comments inside of the changed files for more detailed
information.

The file onnxruntime/core/platform/windows/hardware_core_enumerator.cc
and onnxruntime/core/platform/windows/hardware_core_enumerator.h were
copied from WinML source folder in this repo, with minor coding style
changes.

I had an offline discussion with Sheil. We agree that given the lack of
a future proof solution, we may check-in this temp fix first, and rework
it later. I will have a meeting with @ivberg for discussing the issue
deeply, and seeking for a long term solution. Thanks for offering help,
@ivberg !

### Motivation and Context
With this change, we will see about 2x perf improvement on some Intel
CPUs.
  • Loading branch information
snnn authored Feb 15, 2024
1 parent 775c774 commit 660f39a
Show file tree
Hide file tree
Showing 5 changed files with 162 additions and 7 deletions.
46 changes: 45 additions & 1 deletion onnxruntime/core/platform/windows/env.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ limitations under the License.
#include "core/common/span_utils.h"
#include "core/platform/env.h"
#include "core/platform/scoped_resource.h"
#if defined(_M_X64) && !defined(_M_ARM64EC) && defined(ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH)
#include "core/platform/windows/hardware_core_enumerator.h"
#endif
#include <unsupported/Eigen/CXX11/ThreadPool>
#include <wil/Resource.h>

Expand Down Expand Up @@ -248,12 +251,53 @@ void WindowsEnv::SleepForMicroseconds(int64_t micros) const {
Sleep(static_cast<DWORD>(micros) / 1000);
}

// EIGEN_NO_CPUID is not defined in any C/C++ source code. It is a compile option.
#if defined(_M_X64) && !defined(_M_ARM64EC) && !defined(EIGEN_NO_CPUID) && defined(ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH)
static constexpr std::array<int, 3> kVendorID_Intel = {0x756e6547, 0x6c65746e, 0x49656e69}; // "GenuntelineI"
#endif
int WindowsEnv::DefaultNumCores() {
return std::max(1, static_cast<int>(std::thread::hardware_concurrency() / 2));
}

int WindowsEnv::GetNumPhysicalCpuCores() const {
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
// EIGEN_NO_CPUID is not defined in any C/C++ source code. It is a compile option.
#if defined(_M_X64) && !defined(_M_ARM64EC) && !defined(EIGEN_NO_CPUID) && defined(ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH)
// The following code is a temporary fix for a perf problem on Intel's Meteor Lake CPUs. The Intel compute platform has
// a hybrid architecture that some CPU cores runs significant slower than the others. If we distribute our compute work
// evenly to all CPU cores, the slowest CPU core will drag the performance down. So, instead, we reduce the total number
// of threads to exclude the slowest cores out.
// The following code is based on assumptions that:
// 1. All Intel hybrid CPUs should have 3 levels of cache.
// 2. If a CPU core is only associated with two levels of cache, it should be a low performance CPU core and should
// not be used.
// Since we don't know what the next Intel hybrid CPU would be like, later on we may need to rework the following code.
// However, no matter what the code should not cause any crash. The worst is it might return 1 that
// thread pools will not be created, which is just a perf issue and does not impact usability.
// TODO: detect if CPUID instruction is available per instructions at https://wiki.osdev.org/CPUID#Checking_CPUID_availability
int regs[4];
__cpuid(regs, 0);
bool bIsIntel =
(kVendorID_Intel[0] == regs[1]) &&
(kVendorID_Intel[1] == regs[2]) &&
(kVendorID_Intel[2] == regs[3]);
if (bIsIntel && regs[0] >= 7) {
// Query Structured Extended Feature Flags Enumeration Leaf
__cpuid(regs, 0x7);
// The bit 15 of EDX indicates if the processor is identified as a hybrid part.
bool ishybrid = regs[3] & (1 << 15);
if (ishybrid) {
// NOTE: even if ishybrid is true, it doesn't mean the processor must have P-cores and E-cores.
// On Intel CPUs we assume the HardwareCoreEnumerator::DefaultIntraOpNumThreads function would never fail.
// NOTE: due to resource restrictions, we cannot test this branch in our CI build pipelines.
return std::max(static_cast<uint32_t>(1), HardwareCoreEnumerator::DefaultIntraOpNumThreads());
} else {
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
}
} else
#endif
{
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
}
}

std::vector<LogicalProcessors> WindowsEnv::GetDefaultThreadAffinities() const {
Expand Down
89 changes: 89 additions & 0 deletions onnxruntime/core/platform/windows/hardware_core_enumerator.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#include "hardware_core_enumerator.h"
#include <memory>
#include <Windows.h>
#include <assert.h>

namespace onnxruntime {

struct LogicalProcessorInformation {
std::unique_ptr<char[]> Buffer;
size_t Length;
};

struct CoreCounter {
uint32_t PhysicalCores = 0;
uint32_t SocDieCores = 0;
};

static LogicalProcessorInformation GetLogicalProcessorInfos(LOGICAL_PROCESSOR_RELATIONSHIP relationship) {
DWORD length = 0;
DWORD rc = GetLogicalProcessorInformationEx(relationship, nullptr, &length);

assert(rc == FALSE);

auto processorInformationBytes = std::make_unique<char[]>(length);

rc = GetLogicalProcessorInformationEx(
relationship, reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(processorInformationBytes.get()), &length);

assert(rc == TRUE);

return {std::move(processorInformationBytes), length};
}

uint32_t CountSetBits(DWORD input) {
uint32_t c;
for (c = 0; input; c++) {
input &= input - 1;
}
return c;
}

static CoreCounter GetNumberOPhysicalAndEngineeringCores() {
auto logicalProcessorInformation = GetLogicalProcessorInfos(RelationAll);

CoreCounter cores;
DWORD dwLevel2GroupMask = 0;
DWORD dwLevel3GroupMask = 0;
size_t read = 0;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX currentProcessorInfo = NULL;

while ((read + FIELD_OFFSET(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, Processor)) < logicalProcessorInformation.Length) {
currentProcessorInfo =
reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(logicalProcessorInformation.Buffer.get() + read);
if ((read + currentProcessorInfo->Size) > logicalProcessorInformation.Length) {
break;
}

switch (currentProcessorInfo->Relationship) {
case RelationProcessorCore:
cores.PhysicalCores++;
break;
case RelationCache:
if (currentProcessorInfo->Cache.Level == 2) {
dwLevel2GroupMask |= currentProcessorInfo->Cache.GroupMask.Mask;
} else if (currentProcessorInfo->Cache.Level == 3) {
dwLevel3GroupMask |= currentProcessorInfo->Cache.GroupMask.Mask;
}
break;
}

read += currentProcessorInfo->Size;
}

cores.SocDieCores = CountSetBits(dwLevel2GroupMask & ~dwLevel3GroupMask);
return cores;
}

uint32_t HardwareCoreEnumerator::DefaultIntraOpNumThreads() {
// # of physical cores = # of P cores + # of E Cores + # of Soc Cores.
// # of logical cores = # of P cores x 2 (if hyper threading is enabled) + # of E cores + # of Soc Cores.
auto cores = GetNumberOPhysicalAndEngineeringCores();
// We want to use the number of physical cores, but exclude soc cores
return cores.PhysicalCores - cores.SocDieCores;
}

} // namespace onnxruntime
12 changes: 12 additions & 0 deletions onnxruntime/core/platform/windows/hardware_core_enumerator.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#pragma once
#include <stdint.h>

namespace onnxruntime {
struct HardwareCoreEnumerator {
HardwareCoreEnumerator() = delete;
static uint32_t DefaultIntraOpNumThreads();
};
} // namespace onnxruntime
19 changes: 14 additions & 5 deletions onnxruntime/core/util/thread_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -93,22 +93,31 @@ static std::unique_ptr<ThreadPool>
CreateThreadPoolHelper(Env* env, OrtThreadPoolParams options) {
ThreadOptions to;
if (options.thread_pool_size <= 0) { // default
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
if (options.auto_set_affinity) {
#ifdef _WIN32
// Only set thread affinity on Server with auto affinity.
// On client best to let OS scheduler handle.
// On big (P-Core) / little (E-Core) CPU designs affinity overrides QoS and has high power usage
if (IsWindowsServer()) {
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
to.affinities = std::move(default_affinities);
} else {
options.thread_pool_size = Env::Default().GetNumPhysicalCpuCores();
}
#else
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
to.affinities = std::move(default_affinities);
#endif
} else {
options.thread_pool_size = Env::Default().GetNumPhysicalCpuCores();
}
}
if (options.thread_pool_size <= 1) {
Expand Down
3 changes: 2 additions & 1 deletion tools/ci_build/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -1526,7 +1526,8 @@ def generate_build_tree(
ldflags = ["/profile", "/DYNAMICBASE"]
# Address Sanitizer libs do not have a Qspectre version. So they two cannot be both enabled.
if not args.enable_address_sanitizer:
cflags += ["/Qspectre"]
# Also enable a special perf patch that was made for Intel Meteor Lake mobile CPUs
cflags += ["/Qspectre", "/DONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH"]
if config == "Release":
cflags += ["/O2", "/Ob2", "/DNDEBUG"]
elif config == "RelWithDebInfo":
Expand Down

0 comments on commit 660f39a

Please sign in to comment.