Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I debug a reproducible error? #20792

Open
MithrilMan opened this issue May 23, 2024 · 7 comments
Open

How can I debug a reproducible error? #20792

MithrilMan opened this issue May 23, 2024 · 7 comments
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform stale issues that have not been addressed in a while; categorized by a bot

Comments

@MithrilMan
Copy link

MithrilMan commented May 23, 2024

Describe the issue

I'm quite new to onnxruntime, I'm using it in C# and my first use is TTS by generating phonemes with espeak ng, and then generate audio with a VITS voice taken from rhasspy/piper/ converting the phonemes in the voice phonemeIds.
It's working most of the time but sometimes I have some errors from the model
I'm using onnxruntime 18.0 and my problem are both with CPU and GPU but I'd like to understand how to debug and fix these problems, so let's focus on GPU.

I've find a specific input that generates an error most of the time.
The error is:

2024-05-23 21:53:42.7549611 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running Reshape node. Name:'/Reshape_1' Status Message: C:\a\_work\1\s\onnxruntime\core/providers/cpu/tensor/reshape_helper.h:30 onnxruntime::ReshapeHelper::ReshapeHelper i < input_shape.NumDimensions() was false. The dimension with value zero exceeds the dimension size of the input tensor.

The first run works, next ones gave the error above (but only with specific inputs)
I managed to isolate the phonemes (and then the input) that generate this but I don't know how to fix and I'm clueless.

To reproduce

I've seen around people talking about Netro, so I used it on the model and the offending node is this
image
image
image

The model is this https://huggingface.co/rhasspy/piper-voices/tree/main/it/it_IT/riccardo/x_low

The input that generate the error is

[1,28,0,30,0,27,0,38,0,18,0,66,0,35,0,22,0,120,0,14,0,25,0,27,0,3,0,23,0,27,0,26,0,3,0,24,0,18,0,3,0,17,0,27,0,25,0,120,0,14,0,26,0,17,0,18,0,3,0,10,0,24,0,18,0,3,0,28,0,30,0,27,0,34,0,27,0,23,0,14,0,32,0,31,0,21,0,120,0,27,0,26,0,74,0,3,0,18,0,3,0,24,0,18,0,3,0,30,0,21,0,19,0,24,0,18,0,31,0,31,0,22,0,120,0,27,0,26,0,74,0,2]

I initialize the InferenceSession this way:

using var cudaOptions = new OrtCUDAProviderOptions();

var providerOptionsDict = new Dictionary<string, string>
{
   ["device_id"] = "0",
   ["cudnn_conv_algo_search"] = "HEURISTIC"
};

cudaOptions.UpdateOptions(providerOptionsDict);

using var opts = SessionOptions.MakeSessionOptionWithCudaProvider(cudaOptions);
opts.AppendExecutionProvider_CPU();
_session = new InferenceSession(voiceModelPath, opts);

and I leave the _session alive, then whenever I need to generate speach, I call this code:

private float[] SynthesizeIdsToRaw(long[] phonemeIds, float lengthScale = 1.0f, float noiseScale = 0.667f, float noiseW = 0.8f)
{
   EnsureHasSession();

   var inputLength = phonemeIds.Length;

   var inputs = new Dictionary<string, OrtValue>
   {
      ["input"] = OrtValue.CreateTensorValueFromMemory(phonemeIds, [1, inputLength]),
      ["input_lengths"] = OrtValue.CreateTensorValueFromMemory(new long[] { inputLength }, [1]),
      ["scales"] = OrtValue.CreateTensorValueFromMemory(new[] { noiseScale, lengthScale, noiseW }, [3])
   };

   try
   {
      using var runOptions = new RunOptions() { LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE, LogVerbosityLevel = 1 };
      using var results = _session.Run(runOptions, inputs, _session.OutputNames);
      var audio = results[0].GetTensorDataAsSpan<float>().ToArray();
      return audio;
   }
   catch (Exception ex)
   {
      _logger.LogError(ex, "Error synthesizing audio");
      return [];
   }
   finally
   {
      foreach (var input in inputs.Values)
      {
         input.Dispose();
      }
   }
}

Is this code correct? I can't find proper updated and detailed documentation and I'm not into model generation/training, I'm very proficient in C# but quite new on the technical side of model developments

Urgency

I want to solve the problem, but the most important thing to me is understand "HOW" to debug these issues rather than having a cooked solution

Thanks

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18

ONNX Runtime API

C#

Architecture

X64

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform labels May 23, 2024
@yuslepukhin
Copy link
Member

yuslepukhin commented May 28, 2024

You have a model with dynamic dimensions. This means that dimension can vary, typically due to the input shape.
Often times that variable dimension affects output shapes with cascading effect downstream which means you have to make other nodes input and outputs have variable dimensions.
It is not always easy to make that work in all cases.

In this case Reshape takes data input with the first and the last dimensions being dynamic, meaning you have multiple variable dimensions.

The error message indicates that at least one dimension has a value of zero, but the attribute allow_zero is false(0) according to the image you posted.

To debug this, you have a couple of options. 1) Visually trace the data flow (it is actually possible, but not always) and attempt to compute where the zero is coming from and whether it should be there or should Reshape allow it 2) Rebuild from source (C++) and enable some debugging facilities and dump the input/output shapes or store it in the SQL Light database.

@MithrilMan
Copy link
Author

MithrilMan commented May 29, 2024

@yuslepukhin Thanks for the feedback, the model is a https://github.com/rhasspy/piper model that was probably converted by their converter from a ckpt to onnx model, so you mean that a problem could be in the export script?

What I found strange is that with onnxruntime v1.16 I didn't had these errors

This is the code I suppose was used to convert that model: https://github.com/rhasspy/piper/blob/master/src/python/piper_train/export_onnx.py

About the debugging, can you visually trace interactively? because there are tons of nodes, would be nice to be able to have a way to put a breakpoint and then inspecting the call stack :)

@yuslepukhin
Copy link
Member

It may be a converter issue, may be the original model issue.
The call stack would not help you as it is virtually the same when running any kernel.
You can put a breakpoint and stop when a particular node is executing, no problem.
What is of interest if the input shape and how shape is changing from node to node and how one arrives to the dimension value of zero.

You insert some code here: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/sequential_executor.cc#L452

To interrogate the kernel inputs/outputs and print shapes.

@MithrilMan
Copy link
Author

Thanks, unluckily I'm using the C# version of onnxruntime so I can't tap into the cc

@yuslepukhin
Copy link
Member

Thanks, unluckily I'm using the C# version of onnxruntime so I can't tap into the cc

There are no different versions of onnxruntime. There is only one. C# along with different languages are just interfaces to consume onnxruntime.

@MithrilMan
Copy link
Author

MithrilMan commented Jun 5, 2024

yes I mean I'm consuming C# nuget packages that are wrappers around native onnxruntime so I can't put a breakpoint on .cc code, or at least I think so, I know nuget can include symbols but I thought they were only on .net code, would be nice to be able to put breakpoints on .cc too.
Am I wrong?

image

image

Where can i found (updated) technical documentation about onnxruntime?
I'd like to dig into

Copy link
Contributor

github-actions bot commented Jul 6, 2024

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

2 participants