Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onnx granite #2043

Merged
merged 9 commits into from
Oct 31, 2024
Merged

Onnx granite #2043

merged 9 commits into from
Oct 31, 2024

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented Oct 8, 2024

What does this PR do?

This PR adds support for models using IBM's GraniteForCausalLM architecture when converting to ONNX. The key changes are:

  • Allow users to opt into using transformers>=4.45 for onnx conversions No longer needed
  • Add "granite" to model configs and tasks
  • Add "granite" as a model_type that uses grouped attention

NOTE: I encountered an issue very similar to the one discussed in #1835. The root cause for me was the need to add "granite" to the list of models requiring Grouped Query Attention in modeling_decoder.py. I don't believe that is the root cause for #1835 since "llama" is already present there, but it is likely a similar issue showing up in the inference module using num_attention_heads instead of num_key_value_heads.

Rationale

This PR specifically addresses the "GraniteForCausalLM" architecture for IBM's forthcoming Granite family of models. The current ibm/PowerLM-3b model use this architecture and can be used as a placeholder for testing until the new models are released. The one exception is that the PowerLM model has num_attention_heads and num_key_value_heads set to match (no Grouped Query Attention) whereas the new models will use that (thus the need for the change to ensure GQA is used for "granite" at inference time).

Testing

When testing locally, I had the following dependency versions:

onnx==1.16.2
onnxruntime==1.19.2
torch==2.4.1
torchvision==0.19.1
transformers==4.45.2

To test the conversion, I did the following:

optimum-cli export onnx \
  --model $HOME/models/powerlm-3b \
  $HOME/models/powerlm-3b-onnx \
  --task text-generation-with-past

To evaluate the output side-by-side with the source model, I used the following script:

side_by_side.py
"""
Simple function to run and time pre and post optimized models
"""

# Standard
from datetime import timedelta
import argparse
import os
import time

# Third Party
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def maybe_to_device(inputs: dict[str, torch.Tensor], device: str | None):
    """Send inputs to the device if desired"""
    if device:
        for k, v in inputs.items():
            inputs[k] = v.to(device)

def run_and_time(
    label: str,
    model_path: str,
    model_class: ORTModelForCausalLM | AutoModelForCausalLM,
    prompt: str,
    device: str | None,
    **kwargs,
):
    start_time = time.time()
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = model_class.from_pretrained(model_path, device_map=device)
    load_end_time = time.time()
    inputs = tokenizer(prompt, return_tensors="pt")

    #DEBUG
    breakpoint()

    tok_end_time = time.time()
    maybe_to_device(inputs, device)
    outputs = model.generate(**inputs, **kwargs)
    gen_end_time = time.time()
    res = tokenizer.decode(outputs[0])
    end_time = time.time()
    print(f"------ {label} ------")
    print(res)
    print(f"Total Time: {timedelta(seconds=end_time-start_time)}")
    print(f"Load Time: {timedelta(seconds=load_end_time-start_time)}")
    print(f"Generate Time: {timedelta(seconds=gen_end_time-tok_end_time)}")
    print()

# Defaults
home = os.getenv("HOME")
assert home, "Need $HOME!"
orig_model_path = f"{home}/models/PowerLM-3b"
onnx_model_path = f"{home}/models/PowerLM-3b-onnx-O4"
prompt = "Write a code to find the maximum value in a list of numbers."
device = "cuda" if torch.cuda.is_available() else None

def main():
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--prompt", "-p", default=prompt)
    parser.add_argument("--raw-model", "-r", type=str, default=None)
    parser.add_argument("--onnx-model", "-o", type=str, default=None)
    parser.add_argument("--device", "-d", type=str, default=device)
    args = parser.parse_args()
    if args.raw_model:
        run_and_time("Transformers", args.raw_model, AutoModelForCausalLM, args.prompt, args.device, max_new_tokens=100)
    if args.onnx_model:
        run_and_time("ONNX", args.onnx_model, ORTModelForCausalLM, args.prompt, args.device, max_new_tokens=100)


if __name__ == "__main__":
    main()

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • (N/A) Did you make sure to update the documentation with your changes?
    • This is a model addition and there is not model-specific documentation that I can find
  • (N/A) Did you write any new necessary tests?
    • This is a model addition and there are not model-specific tests that I can find

Who can review?

@IlyasMoutawwakil
Copy link
Member

can you add this architecture to the tests suite using a small tiny random model, and push it to the hub (you can create it with GraniteForCausalLM.from_config())

@xenova
Copy link
Contributor

xenova commented Oct 11, 2024

I've created a tiny-random model: https://huggingface.co/hf-internal-testing/tiny-random-GraniteForCausalLM.

Unfortunately, ONNX export currently fails

optimum-cli export onnx -m hf-internal-testing/tiny-random-GraniteForCausalLM /tmp/

with the following error:

See log
Framework not specified. Using pt to export the model.
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: GraniteForCausalLM *****
Using framework PyTorch: 2.4.1+cu121
Overriding 1 configuration item(s)
	- use_cache -> True
We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py:447: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
/usr/local/lib/python3.10/dist-packages/transformers/models/granite/modeling_granite.py:982: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sequence_length != 1:
/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py:432: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
  elif len(self.key_cache[layer_idx]) == 0:  # fills previously skipped layers; checking for tensor causes errors
Traceback (most recent call last):
  File "/usr/local/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/optimum/commands/optimum_cli.py", line 208, in main
    service.run()
  File "/usr/local/lib/python3.10/dist-packages/optimum/commands/export/onnx.py", line 265, in run
    main_export(
  File "/usr/local/lib/python3.10/dist-packages/optimum/exporters/onnx/__main__.py", line 374, in main_export
    onnx_export_from_model(
  File "/usr/local/lib/python3.10/dist-packages/optimum/exporters/onnx/convert.py", line 1188, in onnx_export_from_model
    _, onnx_outputs = export_models(
  File "/usr/local/lib/python3.10/dist-packages/optimum/exporters/onnx/convert.py", line 782, in export_models
    export(
  File "/usr/local/lib/python3.10/dist-packages/optimum/exporters/onnx/convert.py", line 887, in export
    export_output = export_pytorch(
  File "/usr/local/lib/python3.10/dist-packages/optimum/exporters/onnx/convert.py", line 583, in export_pytorch
    onnx_export(
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 551, in export
    _export(
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1648, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1174, in _model_to_graph
    graph = _optimize_graph(
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 714, in _optimize_graph
    graph = _C._jit_pass_onnx(graph, operator_export_type)
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1997, in _run_symbolic_function
    return symbolic_fn(graph_context, *inputs, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/symbolic_helper.py", line 292, in wrapper
    return fn(g, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/symbolic_opset14.py", line 177, in scaled_dot_product_attention
    query_scaled = g.op("Mul", query, g.op("Sqrt", scale))
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 93, in op
    return _add_op(self, opname, *raw_args, outputs=outputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 244, in _add_op
    inputs = [_const_if_tensor(graph_context, arg) for arg in args]
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 244, in <listcomp>
    inputs = [_const_if_tensor(graph_context, arg) for arg in args]
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 276, in _const_if_tensor
    return _add_op(graph_context, "onnx::Constant", value_z=arg)
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 252, in _add_op
    node = _create_node(
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 312, in _create_node
    _add_attribute(node, key, value, aten=aten)
  File "/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py", line 363, in _add_attribute
    return getattr(node, f"{kind}_")(name, value)
TypeError: z_(): incompatible function arguments. The following argument types are supported:
    1. (self: torch._C.Node, arg0: str, arg1: torch.Tensor) -> torch._C.Node

Invoked with: %427 : Tensor = onnx::Constant(), scope: transformers.models.granite.modeling_granite.GraniteForCausalLM::/transformers.models.granite.modeling_granite.GraniteModel::model/transformers.models.granite.modeling_granite.GraniteDecoderLayer::layers.0/transformers.models.granite.modeling_granite.GraniteSdpaAttention::self_attn
, 'value', 1.0 
(Occurred when translating scaled_dot_product_attention).

Investigating why 🤔

@xenova
Copy link
Contributor

xenova commented Oct 11, 2024

Turns it it is a known (and fixed) bug: pytorch/pytorch#135615

Upgrading to torch nightly fixes it 👍

@IlyasMoutawwakil
Copy link
Member

Upgrading to torch nightly fixes it 👍

Oooh great ! it's the error I've been seeing when exporting clip with sdpa

@gabe-l-hart
Copy link
Contributor Author

Thanks for all the quick review! I realized I missed a lot of context on the issue itself, so will update with details of how I tested this locally.

@xenova
Copy link
Contributor

xenova commented Oct 12, 2024

and just to confirm, I tested the ONNX model with Transformers.js, and it matches the python version exactly 👍
This PR should be good to merge once we add the tiny random test (https://huggingface.co/hf-internal-testing/tiny-random-GraniteForCausalLM). Not exactly sure how to handle the minimum pytorch version though (cc @IlyasMoutawwakil)

@echarlaix
Copy link
Collaborator

and just to confirm, I tested the ONNX model with Transformers.js, and it matches the python version exactly 👍

Perfect, thanks a lot @xenova

This PR should be good to merge once we add the tiny random test (https://huggingface.co/hf-internal-testing/tiny-random-GraniteForCausalLM). Not exactly sure how to handle the minimum pytorch version though (cc @IlyasMoutawwakil)

For the minimum pytorch version, it can be set with MIN_TORCH_VERSION

MIN_TORCH_VERSION = version.parse("1.11")

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova
Copy link
Contributor

xenova commented Oct 22, 2024

@gabe-l-hart Let me know if you need any help with the final steps (as outlined above)! 🤗 PyTorch 2.5 just released, which should fix the above issue (untested; previous nightly version did fix the bug).

@gabe-l-hart
Copy link
Contributor Author

Thank you! I've been sidetracked on other threads, but hope to get back to this one later in the week.

@xenova
Copy link
Contributor

xenova commented Oct 22, 2024

No worries! I'm doing the conversions of the latest set of released models with this branch. Will report back when complete.

Also, looks like https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-instruct is tagged as granitemoe, which we can also do in this PR.

@xenova
Copy link
Contributor

xenova commented Oct 22, 2024

FYI: I can confirm that https://huggingface.co/onnx-community/granite-3.0-2b-instruct exported correctly and works in Transformers.js 👍

The MoE model is slightly more complicated, and throws an error when performing top_k on a tensor whose dim is < k. Looks like we need to insert a min op there.

gabe-l-hart added a commit to gabe-l-hart/optimum that referenced this pull request Oct 22, 2024
@gabe-l-hart
Copy link
Contributor Author

Ok, I've added MIN_TORCH_VERSION and put "granite" into the mapping in exporters_utils.py.

The MoE model is slightly more complicated, and throws an error when performing top_k on a tensor whose dim is < k. Looks like we need to insert a min op there.

When I looked at this myself, I got stuck on the "more complicated" part and punted on it for this PR. Is adding this min op something that you can point me in the direction on?

@IlyasMoutawwakil
Copy link
Member

you can also add the architecture for testing with onnxruntime 😄

gabe-l-hart added a commit to gabe-l-hart/optimum that referenced this pull request Oct 25, 2024
@gabe-l-hart
Copy link
Contributor Author

Thanks for pointing that out! I think I've got that added too. I tried running the onnxruntime tests locally but ran myself out of disk space (was already pushing it with so many models!)

Copy link
Contributor

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

I'll do some final tests in my conversion script, and will merge soon. Thanks @gabe-l-hart 👍

@gabe-l-hart
Copy link
Contributor Author

Thanks @xenova! Do you have any further insight on what we would need to add to support the MoE models? I haven't been able to dig deep on it and it looked like you had a thought around inserting a min operation somewhere.

Branch: OnnxGranite

Signed-off-by: Gabe Goodhart <[email protected]>
@xenova xenova merged commit 7e8d857 into huggingface:main Oct 31, 2024
@gabe-l-hart gabe-l-hart deleted the OnnxGranite branch October 31, 2024 21:27
@xenova
Copy link
Contributor

xenova commented Oct 31, 2024

Thanks @gabe-l-hart for the PR! Everything looks good in my conversion script too (for Transformers.js models).

As for the MoE models, I think this would require a modification to the transformers modelling code (here I think), just ensuring that the top_k parameter is less than the available number of options. Here is the error message I get when exporting with the same config as granite:

Error: Non-zero status code returned while running TopK node. Name:'/model/layers.0/block_sparse_moe/router/TopK_1' Status Message: k argument [256] should not be greater than specified axis dim value [200]

@gabe-l-hart
Copy link
Contributor Author

Thank you! I'll dig in and see if I can get the MoE ones working with tweaks in transformers

@xenova
Copy link
Contributor

xenova commented Oct 31, 2024

Great! The topK operator in ONNX/ONNXRuntime seems to be a bit stricter than the version in pytorch, which seems to be causing the error.

@xenova
Copy link
Contributor

xenova commented Oct 31, 2024

Here's a tiny random model which you can use for testing: https://huggingface.co/hf-internal-testing/tiny-random-GraniteMoeForCausalLM

@gabe-l-hart
Copy link
Contributor Author

Awesome, thank you!

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Nov 1, 2024

it seems that granite tests with onnxruntime are not passing on main, any idea why ?

=========================== short test summary info ============================
FAILED onnxruntime/test_modeling.py::ORTModelForCausalLMIntegrationTest::test_compare_merged_and_not_merged_models_outputs_09_granite_True - RuntimeError: Error in execution: Non-zero status code returned while running Gather node. Name:'/model/embed_tokens/Gather' Status Message: indices element out of data bounds, idx=35790 must be within the inclusive range [-32000,31999]
FAILED onnxruntime/test_modeling.py::ORTModelForCausalLMIntegrationTest::test_pipeline_ort_model_18_granite_False - onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'/model/embed_tokens/Gather' Status Message: indices element out of data bounds, idx=35790 must be within the inclusive range [-32000,31999]
FAILED onnxruntime/test_modeling.py::ORTModelForCausalLMIntegrationTest::test_pipeline_ort_model_19_granite_True - RuntimeError: Error in execution: Non-zero status code returned while running Gather node. Name:'/model/embed_tokens/Gather' Status Message: indices element out of data bounds, idx=35790 must be within the inclusive range [-32000,31999]
FAILED onnxruntime/test_modeling.py::ORTModelForCausalLMIntegrationTest::test_compare_with_and_without_past_key_values_09_granite - onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'/model/embed_tokens/Gather' Status Message: indices element out of data bounds, idx=35790 must be within the inclusive range [-32000,31999]
==== 4 failed, 891 passed, 909 skipped, 3041 warnings in 2021.46s (0:33:41) ====

@xenova
Copy link
Contributor

xenova commented Nov 1, 2024

it seems that granite tests with onnxruntime are not passing on main, any idea why ?

Ah that's an issue with the tiny model. My bad! Will re-export with the correct vocab size (49152, not 32000)

@xenova
Copy link
Contributor

xenova commented Nov 1, 2024

Fixed 👍
image

@gabe-l-hart
Copy link
Contributor Author

Thank you! I've been in-and-out today and hadn't had a chance to look into it. Appreciate the quick fix.

@gabe-l-hart
Copy link
Contributor Author

gabe-l-hart commented Nov 5, 2024

@xenova Can you share the export command you're running that produced this error?

I'm working on the MoE support and I'm not seeing a case where self.top_k is less than the shape on the given dimension. I've tried with both the tiny model and with granite-3.0-1b-a400m-instruct.

The command I'm running is:

optimum-cli export onnx --model /Users/ghart/models/granite-3.0-1b-a400m-instruct /Users/ghart/models/granite-3.0-1b-a400m-instruct-onnx/ --task text-generation-with-past

When I run this, I see that it completes successfully, but the verification step shows some huge diffs indicating that the converted model is not valid:

- logits: max diff = 36.944740295410156
- present.1.key: max diff = 17.91434097290039
- present.1.value: max diff = 4.968014717102051
- present.2.key: max diff = 26.278032302856445
- present.2.value: max diff = 4.0905561447143555
- present.3.key: max diff = 26.25764274597168
- present.3.value: max diff = 4.223917007446289
- present.4.key: max diff = 23.928207397460938
- present.4.value: max diff = 3.691723108291626
- present.5.key: max diff = 33.76817321777344
- present.5.value: max diff = 3.630763530731201
- present.6.key: max diff = 34.154476165771484
- present.6.value: max diff = 6.545462608337402
- present.7.key: max diff = 28.924449920654297
- present.7.value: max diff = 6.081088542938232
- present.8.key: max diff = 36.824974060058594
- present.8.value: max diff = 5.863564491271973
- present.9.key: max diff = 36.66212844848633
- present.9.value: max diff = 5.411710262298584
- present.10.key: max diff = 29.212482452392578
- present.10.value: max diff = 6.31791353225708
- present.11.key: max diff = 40.33808898925781
- present.11.value: max diff = 6.859192371368408
- present.12.key: max diff = 36.58208465576172
- present.12.value: max diff = 6.555930137634277
- present.13.key: max diff = 31.688520431518555
- present.13.value: max diff = 7.645174503326416
- present.14.key: max diff = 34.46550750732422
- present.14.value: max diff = 8.418960571289062
- present.15.key: max diff = 30.98021697998047
- present.15.value: max diff = 11.715303421020508
- present.16.key: max diff = 29.858247756958008
- present.16.value: max diff = 13.081101417541504
- present.17.key: max diff = 33.57026672363281
- present.17.value: max diff = 14.176477432250977
- present.18.key: max diff = 39.252296447753906
- present.18.value: max diff = 12.325156211853027
- present.19.key: max diff = 30.54557228088379
- present.19.value: max diff = 13.315317153930664
- present.20.key: max diff = 27.43828773498535
- present.20.value: max diff = 17.336055755615234
- present.21.key: max diff = 30.883085250854492
- present.21.value: max diff = 18.55562973022461
- present.22.key: max diff = 31.344200134277344
- present.22.value: max diff = 23.377347946166992
- present.23.key: max diff = 24.807292938232422
- present.23.value: max diff = 26.19655990600586.

@xenova
Copy link
Contributor

xenova commented Nov 6, 2024

@xenova Can you share the export command you're running that produced #2043 (comment)?

The error occurs during runtime 😅 ... the export process seems to produce a valid .onnx graph, but throws an error when actually running the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants