You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We export ONNX transformer encoder models in OML4Py with the tokenizer attached to the bottom of the model, so the ONNX model accepts a string tensor input and returns the embedding vector. We use the tokenizer operations from onnxruntime-extensions, which are CPU only, and have wrapped around an ONNX graph which batches, pads and truncates the tokenized representation using the SequenceMap operation.
When increasing the batch size we've noticed that much of the runtime is spent in the Loop op which SequenceMap uses, which is very odd considering it doesn't actually do very much. After some investigation we determined that this was due to most of the ops in the tokenizer graph being placed on the GPU rather than the CPU, even though the subgraph we're looping over must be placed on the CPU due to the presence of the BertTokenizer op. We would like the whole tokenization graph including the Loop op to be placed on the CPU EP, but there doesn't appear to be a way to control op placement in the C API. Alternatively there might be a bug in the way that ops are placed on the CPU, as this code looks like it should fallback the whole subgraph & loop to CPU, but I don't understand it well enough to see if that's part of the issue.
To reproduce
Run the supplied bert-tok.onnx graph with a batch size of 100 with the CUDA EP enabled, most of the runtime is spent in the tokenization loop operation transferring data between CPU and GPU.
Urgency
This performance issue prevents the use of GPUs to accelerate our models.
Describe the issue
We export ONNX transformer encoder models in OML4Py with the tokenizer attached to the bottom of the model, so the ONNX model accepts a string tensor input and returns the embedding vector. We use the tokenizer operations from onnxruntime-extensions, which are CPU only, and have wrapped around an ONNX graph which batches, pads and truncates the tokenized representation using the
SequenceMap
operation.When increasing the batch size we've noticed that much of the runtime is spent in the
Loop
op whichSequenceMap
uses, which is very odd considering it doesn't actually do very much. After some investigation we determined that this was due to most of the ops in the tokenizer graph being placed on the GPU rather than the CPU, even though the subgraph we're looping over must be placed on the CPU due to the presence of theBertTokenizer
op. We would like the whole tokenization graph including theLoop
op to be placed on the CPU EP, but there doesn't appear to be a way to control op placement in the C API. Alternatively there might be a bug in the way that ops are placed on the CPU, as this code looks like it should fallback the whole subgraph & loop to CPU, but I don't understand it well enough to see if that's part of the issue.To reproduce
Run the supplied
bert-tok.onnx
graph with a batch size of 100 with the CUDA EP enabled, most of the runtime is spent in the tokenization loop operation transferring data between CPU and GPU.Urgency
This performance issue prevents the use of GPUs to accelerate our models.
Platform
Linux
OS Version
OL8
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.20.0
ONNX Runtime API
C
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
No response
Model File
bert-tok.onnx.zip
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: