Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRESULT failed with 0x887a0001: dml_device_->GetDeviceRemovedReason() #359

Open
HeloWong opened this issue Apr 17, 2023 · 7 comments
Open

Comments

@HeloWong
Copy link

Envs:
Tensroflow 2.12
tensorflow_directml_plugin-0.5.0-cp39-cp39-win_amd64.whl
Python 3.9

Error:
F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0001: dml_device_->GetDeviceRemovedReason(), and Python Restart

I build newest tensorflow_directml_plugin 0.5.0, but when I run minst example on tf,
some error happened: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0001: dml_device_->GetDeviceRemovedReason()
and GPU memory and shared memory have substantial growth.

@FabricatiDiem
Copy link

FabricatiDiem commented Apr 22, 2023

I have the exact same issue and message running a simple Keras model. Fresh install, etc.

@maggie1059
Copy link
Collaborator

Hi @HeloWong, @FabricatiDiem, would you mind including the models that you saw this issue with? I'm not seeing this repro on the Keras tutorial model for MNIST, so it would be helpful for me to test using the scripts you're seeing this with. Please also double-check that your environment is using keras==2.12, as this latest version of the plugin is not compatible with previous versions of keras.

@FabricatiDiem
Copy link

FabricatiDiem commented Apr 26, 2023

This is a minimal example that somewhat more closely aligns with my actual use case: https://gist.github.com/FabricatiDiem/07b8645faabb1ea0a887550a0544ea9d

Note, the example works without error using WSL2 + Docker. The example also tends to work if I tweak it, such as by removing the sparse representation (not feasible in my real use case), by making the feature space smaller, or by reducing the width of the network. Could be a memory issue, but I'm not seeing any memory-related errors, and if it was that, it should affect the Docker version too (I would think).

Also, just upgrading Keras to 2.12 breaks TF entirely for me. I'm using a fresh install of the latest tensorflow-directml-plugin package, which installs TF 2.10 and a bunch of other stuff. I'm not able to try out the bleeding-edge Github version on my local setup, so if it is already fixed but off-release, then I'm good with my WSL2+Docker setup until there's a new release.

Thanks for looking at the issue.

Edit: For completeness, my NVIDIA system information can be found here: https://gist.github.com/FabricatiDiem/fe0667aff7dc529a9b439112194f34b6

#341 looks similar, but I'm not sure.

@radudiaconu0
Copy link

radudiaconu0 commented May 4, 2023

I have the same issue on my AMD GPU with latest driver (23.4.3) on squeeznet example at the epoch 38 and on MNIST exaple at epoch 11. i build the plugin from source with tensorflow-cpu 2.12

@NateAGeek
Copy link

@radudiaconu0
Copy link

any update here?

@PatriceVignola
Copy link
Contributor

I apologize for the delay. We had to pause the development of this plugin until further notice. For the time being, all latest DirectML features and performance improvements are going into onnxruntime for inference scenarios. We'll update this issue if/when things change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants