Roadmap

TensorFlow-DirectML Roadmap - December 2020

Release Cadence and DirectML NuGet

Changes to TensorFlow-DirectML and DirectML often go hand in hand, since we may need new APIs or optimizations in DirectML to support training ops in TensorFlow. While TensorFlow-DirectML is still under development, one of our core requirements to releasing a new version to PyPI is that the underlying build of DirectML is stable. DirectML, like any other Windows component, comes with a fundamental guarantee that its APIs will not have breaking changes. This compatibility guarantee is obviously a good thing, but it does mean we have to be extra cautious when introducing a new API to DirectML since it's effectively permanent. This is why we require extra steps like developer mode when running builds of TensorFlow-DirectML that comes from an under-development branch.

One of the recent improvements we've made to ease this burden is the introduction of a unified and broadly usable redistributable NuGet package, Microsoft.AI.DirectML. We'll be releasing this package more regularly, which means we can also release PyPI packages of TensorFlow-DirectML more frequently. We don't have a release cadence that's set in stone quite yet, but we expect to share new builds of TensorFlow-DirectML every few months.

Operator Coverage

The majority of the work in supporting DirectML in TensorFlow is implementing ops like add and conv2d. There are over 1100 of these primitive ops in TF 1.15, and each one requires some code to run on a TF device like DirectML. The existing GPU device implements just under 600 ops, and our goal with DirectML is to have similar coverage. For more background on how this work relates to the overall DirectML backend, check out a short overview in the DirectML Backend Overview wiki page.

Gaps in op coverage ultimately result in poor performance, since TF will (in most cases) fall back to executing an unimplemented op on a lower priority device like the CPU. This may stall the GPU pipeline and result in extra copies between system memory and dedicated hardware memory; depending on how often this occurs, it can actually be slower to run on hardware than to simply keep everything local to the CPU. It's absolutely critical to avoid bouncing between CPU and DirectML devices, so this work tends to take priority over almost all other features. As of today, DirectML implements ~420 operators.

You can find an overview of the specific ops we're planning to implement next in the Operator Roadmap wiki page. We try to focus on the ops that are frequently used, critical for perf, and have a reasonable mapping to DirectML APIs.

Performance Optimizations

Performance is central to the value of a hardware backend like DirectML! While we still have ops to implement before it's appropriate to look at end-to-end model performance (at least with some models), we do feel there is sufficient coverage to start looking at perf on well-covered models and their respective ops.

Here are some of the things we're looking at next:

DirectML's current shaders are mostly optimized for inference (with NCHW tensor layouts); however, we're improving DirectML's handling of layouts that are more common to TF and training. Some of these changes will be included in DirectML 1.5.0, which is undergoing testing right now.
We're working closely with hardware partners to ensure they can tune their drivers to work optimally with common models.
We've noticed that "emulating" certain kernels (i.e. a CPU implementation that explicitly copies between host and device memory) can improve overall device placement and ensure DirectML kernels are selected more often. Even though the emulated kernel still involves memory transfers, we've seen decent (5-10%) improvements in some cases.
Internally, we've been building up additional infrastructure for automated performance tracking on popular models and benchmarks. This infrastructure will guide additional optimization work in the coming months.

Future and TensorFlow 2

Many of you have asked about our plans for supporting TensorFlow 2.x. Our focus right now is solid support of TF 1.15: good operator coverage, performance, and reliability. There's still quite a bit of work here, especially with performance, so we don't want to jump into multiple versions of TF until we have a good non-preview release of 1.15.

That said, we expect that much of the engineering effort that has gone into TF 1.15 (mainly kernels) will be applicable in newer versions of TF as well. The device runtime is likely to be a bit different: there are different proposals for how this might work (e.g. composable device backends), so we'll have to weigh the options once we're closer to wrapping up 1.15. It's also an option for us to support and build TF 1.15 and 2.x out of this fork, but ideally we want to have the DirectML backend integrate with the main repository at some point.

Finally, ARM support is something that we know many of you are interested in. DirectML itself already supports ARM and ARM64, but there's more work to do here! We do not have plans to support ARM in TF 1.15 (for now), but it's an important issue that we will likely address in tandem with TF2 support.

Provide feedback

Saved searches