[E2E] Tts_angular NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device #1231

mengfei25 · 2024-12-30T05:02:36Z

🐛 Describe the bug

Passed in Aug 6, refer to #495

python benchmarks/dynamo/huggingface.py --accuracy --amp --amp-dtype float16 -d xpu -n10 --training --only tts_angular --backend=inductor

xpu  train tts_angular                        
Traceback (most recent call last):
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2738, in validate_model
    self.model_iter_fn(model, example_inputs)
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 460, in forward_and_backward_pass
    pred = mod(*cloned_inputs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/models/tts_angular/model.py", line 59, in forward
    d = self.layers(x)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/models/tts_angular/model.py", line 18, in forward
    o, (_, _) = self.lstm(x)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 1124, in forward
    result = _VF.lstm(
NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device. Please open a feature on https://github.com/intel/torch-xpu-ops/issues. You can set the environment variable `PYTORCH_ENABLE_XPU_FALLBACK=1` to use the CPU implementation as a fallback for XPU unimplemented operators. WARNING: this will bring unexpected performance compared with running natively on XPU.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4873, in run
    ) = runner.load_model(
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 372, in load_model
    self.validate_model(model, example_inputs)
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2740, in validate_model
    raise RuntimeError("Eager run failed") from e
RuntimeError: Eager run failed

eager_fail_to_run

Versions

Device PVC 1100
OS Ubuntu 22.04 LTS
Driver 803.61
Torch release/2.6 (31b520a59915990a9ea2d1f4c1d90a18a3a90dfe)
Torch-xpu-ops release/2.6 (214f33b)
Triton e98b6fcb8df5b44eb0d0addb6767c573d37ba024
Transformers 243e186efbf7fb93328dd6b34927a4e8c8f24395
Torchvision d23a6e1664d20707c11781299611436e1f0c104f
Torchaudio 332760d4b300f00a0d862e3cfe1495db3b1a14f9
Torchbench 03cde49eba0580ed17f9ae2250832fd8af4ed756
Timms ac3470188b914c5d7a5058a7e28b9eb685a62427
Bundle 2025.0.1.20241113

The text was updated successfully, but these errors were encountered:

mengfei25 · 2024-12-30T07:11:52Z

demucs has same issue

ekaakurniawan · 2024-12-31T21:33:07Z

Has same issue here. Please find the detail below.

https://github.com/ekaakurniawan/DLND/blob/a770-project4/P4-Generating-TV-Script/dlnd_tv_script_generation.ipynb

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[17], line 69
     63         hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device),
     64                   weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device))
     66 """
     67 DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
     68 """
---> 69 tests.test_rnn(RNN, device)

File [~/Workspace/DLND/P4-Generating-TV-Script/problem_unittests.py:161](http://localhost:8888/lab/tree/P4-Generating-TV-Script/P4-Generating-TV-Script/problem_unittests.py#line=160), in test_rnn(RNN, device)
    158 rnn.to(device)
    159 b = b.to(device)
--> 161 output, hidden_out = rnn(b, hidden)
    163 assert_test = AssertTest({
    164                          'Input Size': vocab_size,
    165                          'Output Size': output_size,
   (...)
    169                          'Sequence Length': sequence_length,
    170                          'Input': b})
    172 # initialization

File [~/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py:1739](http://localhost:8888/home/eka/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py#line=1738), in Module._wrapped_call_impl(self, *args, **kwargs)
   1737     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738 else:
-> 1739     return self._call_impl(*args, **kwargs)

File [~/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py:1750](http://localhost:8888/home/eka/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py#line=1749), in Module._call_impl(self, *args, **kwargs)
   1745 # If we don't have any hooks, we want to skip the rest of the logic in
   1746 # this function, and just call forward.
   1747 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1748         or _global_backward_pre_hooks or _global_backward_hooks
   1749         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750     return forward_call(*args, **kwargs)
   1752 result = None
   1753 called_always_called_hooks = set()

Cell In[17], line 43, in RNN.forward(self, nn_input, hidden)
     40 batch_size = nn_input.size(0)
     42 nn_output = self.embedding_layer(nn_input.long())
---> 43 nn_output, hidden = self.lstm_layer(nn_output, hidden)
     44 nn_output = nn_output.contiguous().view(-1, self.hidden_dim)
     45 nn_output = self.dropout_layer(nn_output)

File [~/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py:1739](http://localhost:8888/home/eka/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py#line=1738), in Module._wrapped_call_impl(self, *args, **kwargs)
   1737     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738 else:
-> 1739     return self._call_impl(*args, **kwargs)

File [~/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py:1750](http://localhost:8888/home/eka/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/module.py#line=1749), in Module._call_impl(self, *args, **kwargs)
   1745 # If we don't have any hooks, we want to skip the rest of the logic in
   1746 # this function, and just call forward.
   1747 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1748         or _global_backward_pre_hooks or _global_backward_hooks
   1749         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750     return forward_call(*args, **kwargs)
   1752 result = None
   1753 called_always_called_hooks = set()

File [~/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/rnn.py:1124](http://localhost:8888/home/eka/Workspace/pytorch_arc/pytorch_arc_env/lib/python3.12/site-packages/torch/nn/modules/rnn.py#line=1123), in LSTM.forward(self, input, hx)
   1121         hx = self.permute_hidden(hx, sorted_indices)
   1123 if batch_sizes is None:
-> 1124     result = _VF.lstm(
   1125         input,
   1126         hx,
   1127         self._flat_weights,  # type: ignore[arg-type]
   1128         self.bias,
   1129         self.num_layers,
   1130         self.dropout,
   1131         self.training,
   1132         self.bidirectional,
   1133         self.batch_first,
   1134     )
   1135 else:
   1136     result = _VF.lstm(
   1137         input,
   1138         batch_sizes,
   (...)
   1145         self.bidirectional,
   1146     )

NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device. Please open a feature on https://github.com/intel/torch-xpu-ops/issues. You can set the environment variable `PYTORCH_ENABLE_XPU_FALLBACK=1` to use the CPU implementation as a fallback for XPU unimplemented operators. WARNING: this will bring unexpected performance compared with running natively on XPU.

$ python collect_env.py
Collecting environment information...
PyTorch version: 2.6.0+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               24
On-line CPU(s) list:                  0-23
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) Ultra 9 285K
CPU family:                           6
Model:                                198
Thread(s) per core:                   1
Core(s) per socket:                   1
Socket(s):                            24
Stepping:                             2
CPU(s) scaling MHz:                   28%
CPU max MHz:                          5100.0000
CPU min MHz:                          800.0000
BogoMIPS:                             7372.80
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault intel_ppin ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni lam wbnoinvd dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid bus_lock_detect movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            768 KiB (20 instances)
L1i cache:                            1.3 MiB (20 instances)
L2 cache:                             40 MiB (12 instances)
L3 cache:                             36 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-23
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] pytorch-triton-xpu==3.2.0
[pip3] torch==2.6.0+xpu
[pip3] torchaudio==2.6.0+xpu
[pip3] torchvision==0.21.0+xpu
[pip3] triton==3.2.0
[conda] Could not collect

xytintel · 2025-01-03T03:50:12Z

This operator has already been cherry-picked to release/2.6: #1233

mengfei25 added Accuracy amp_fp16 E2E torchbench amp_bf16 inference training float16 float32 bfloat16 performance labels Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[E2E] Tts_angular NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device #1231

[E2E] Tts_angular NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device #1231

mengfei25 commented Dec 30, 2024

mengfei25 commented Dec 30, 2024

ekaakurniawan commented Dec 31, 2024

xytintel commented Jan 3, 2025

[E2E] Tts_angular NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device #1231

[E2E] Tts_angular NotImplementedError: The operator 'aten::_thnn_fused_lstm_cell' is not currently implemented for the XPU device #1231

Comments

mengfei25 commented Dec 30, 2024

🐛 Describe the bug

Versions

mengfei25 commented Dec 30, 2024

ekaakurniawan commented Dec 31, 2024

xytintel commented Jan 3, 2025