-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Onnx granite #2043
Onnx granite #2043
Conversation
e4b94a4
to
29fc359
Compare
can you add this architecture to the tests suite using a small tiny random model, and push it to the hub (you can create it with |
I've created a tiny-random model: https://huggingface.co/hf-internal-testing/tiny-random-GraniteForCausalLM. Unfortunately, ONNX export currently fails
with the following error: See log
Investigating why 🤔 |
Turns it it is a known (and fixed) bug: pytorch/pytorch#135615 Upgrading to torch nightly fixes it 👍 |
Oooh great ! it's the error I've been seeing when exporting clip with sdpa |
Thanks for all the quick review! I realized I missed a lot of context on the issue itself, so will update with details of how I tested this locally. |
and just to confirm, I tested the ONNX model with Transformers.js, and it matches the python version exactly 👍 |
Perfect, thanks a lot @xenova
For the minimum pytorch version, it can be set with
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@gabe-l-hart Let me know if you need any help with the final steps (as outlined above)! 🤗 PyTorch 2.5 just released, which should fix the above issue (untested; previous nightly version did fix the bug). |
Thank you! I've been sidetracked on other threads, but hope to get back to this one later in the week. |
No worries! I'm doing the conversions of the latest set of released models with this branch. Will report back when complete. Also, looks like https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-instruct is tagged as |
FYI: I can confirm that https://huggingface.co/onnx-community/granite-3.0-2b-instruct exported correctly and works in Transformers.js 👍 The MoE model is slightly more complicated, and throws an error when performing top_k on a tensor whose dim is < k. Looks like we need to insert a min op there. |
29fc359
to
5925174
Compare
huggingface#2043 (comment) Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Ok, I've added
When I looked at this myself, I got stuck on the "more complicated" part and punted on it for this PR. Is adding this |
you can also add the architecture for testing with onnxruntime 😄 |
huggingface#2043 (comment) Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
f936bad
to
22de7b7
Compare
Thanks for pointing that out! I think I've got that added too. I tried running the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
I'll do some final tests in my conversion script, and will merge soon. Thanks @gabe-l-hart 👍
Thanks @xenova! Do you have any further insight on what we would need to add to support the MoE models? I haven't been able to dig deep on it and it looked like you had a thought around inserting a |
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
…ttention Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
huggingface#2043 (comment) Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
Branch: OnnxGranite Signed-off-by: Gabe Goodhart <[email protected]>
22de7b7
to
52da69f
Compare
Thanks @gabe-l-hart for the PR! Everything looks good in my conversion script too (for Transformers.js models). As for the MoE models, I think this would require a modification to the transformers modelling code (here I think), just ensuring that the
|
Thank you! I'll dig in and see if I can get the MoE ones working with tweaks in transformers |
Great! The |
Here's a tiny random model which you can use for testing: https://huggingface.co/hf-internal-testing/tiny-random-GraniteMoeForCausalLM |
Awesome, thank you! |
it seems that granite tests with onnxruntime are not passing on main, any idea why ?
|
Ah that's an issue with the tiny model. My bad! Will re-export with the correct vocab size (49152, not 32000) |
Fixed 👍 |
Thank you! I've been in-and-out today and hadn't had a chance to look into it. Appreciate the quick fix. |
@xenova Can you share the export command you're running that produced this error? I'm working on the MoE support and I'm not seeing a case where The command I'm running is: optimum-cli export onnx --model /Users/ghart/models/granite-3.0-1b-a400m-instruct /Users/ghart/models/granite-3.0-1b-a400m-instruct-onnx/ --task text-generation-with-past When I run this, I see that it completes successfully, but the verification step shows some huge diffs indicating that the converted model is not valid:
|
The error occurs during runtime 😅 ... the export process seems to produce a valid .onnx graph, but throws an error when actually running the model. |
What does this PR do?
This PR adds support for models using IBM's GraniteForCausalLM architecture when converting to ONNX. The key changes are:
Allow users to opt into usingNo longer neededtransformers>=4.45
foronnx
conversions"granite"
to model configs and tasks"granite"
as amodel_type
that uses grouped attentionNOTE: I encountered an issue very similar to the one discussed in #1835. The root cause for me was the need to add
"granite"
to the list of models requiring Grouped Query Attention inmodeling_decoder.py
. I don't believe that is the root cause for #1835 since"llama"
is already present there, but it is likely a similar issue showing up in the inference module usingnum_attention_heads
instead ofnum_key_value_heads
.Rationale
This PR specifically addresses the
"GraniteForCausalLM"
architecture for IBM's forthcomingGranite
family of models. The current ibm/PowerLM-3b model use this architecture and can be used as a placeholder for testing until the new models are released. The one exception is that thePowerLM
model hasnum_attention_heads
andnum_key_value_heads
set to match (no Grouped Query Attention) whereas the new models will use that (thus the need for the change to ensure GQA is used for"granite"
at inference time).Testing
When testing locally, I had the following dependency versions:
To test the conversion, I did the following:
To evaluate the output side-by-side with the source model, I used the following script:
side_by_side.py
Before submitting
Who can review?