Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JS/Web] External weights load #18535

Closed
wants to merge 1 commit into from

Conversation

dakenf
Copy link
Contributor

@dakenf dakenf commented Nov 21, 2023

Description

Much cleaner approach to load external weights for wasm/webgpu providers via ExecutionProviderOption

@fs-eire
Copy link
Contributor

fs-eire commented Nov 22, 2023

Thanks very much for the PR. External weights support is one of the very important features for ort-web to support large models. However, it seems that this change still cannot break the 4GB memory limit for wasm32. @guschmue is trying to add support for ORT core to load models > 4GB into wasm32 by not reading the data into WebAssembly memory at all - still WIP.

for webgpu, 4GB memory size is a hard limit for wasm32 and to break this limit a change in ORT core is required to skip the loading of weights in the models and instead create GPU buffers directly

@dakenf
Copy link
Contributor Author

dakenf commented Nov 22, 2023

Well, since it mmaps chunks of weights file to memory for each layer and then frees them, it might be possible to fit more than 4gb with 32bit. I've loaded 2.2gb latent consistency model and run it without any out of memory issues with wasm32 build. However, i got NaN as a result but that's most likely because new code is not compatible with Tensor class from transformers.js. Will do some more tests closer to the weekend

@guschmue
Copy link
Contributor

supporting the external data format is super high on our list.
We are not 100% sure if we want to use FS or if we want to pass in a dict from js.
But ok, this sure is the easiest way to do it since we don't need to change ort core things.
Thinking ...

@guschmue
Copy link
Contributor

guschmue commented Nov 28, 2023

should add - we are thinking if we can add some way to pass the weights via reference so they don't get copied into wasm heap until needed. ort would call the EP to put them into the right place so webgpu can just copy them directly from JS to gpu.
That would happen in SessionFinalize() when the graph is assigned to the EP.
Needs some changes in ort - ort is looking at the data for some reason, need to make it not do that.
The point would be to not require so much space on the wasm heap and push the need for wasm64 out for some time.

@fs-eire
Copy link
Contributor

fs-eire commented Nov 28, 2023

As per our discussion in team meeting, using MEMFS is an option. However, we need a build flag to allow this feature to be enabled/disabled. Please add a build time flag --enable_wasm_memfs with default value to false in tools\ci_build\build.py as the flag for this feature. Add corresponding definition -Donnxruntime_ENABLE_WEBASSEMBLY_MEMFS in tools\ci_build\build.py and use it in cmake\onnxruntime_webassembly.cmake

If the flag is ON, ort-web should be able to work with external data via MEMFS; otherwise, ort-web should not change (not including unnecessary FS modules or extra unused JS code)

I will later review the rest of the changes in this PR.

@QimingZheng
Copy link

As per our discussion in team meeting, using MEMFS is an option. However, we need a build flag to allow this feature to be enabled/disabled. Please add a build time flag --enable_wasm_memfs with default value to false in tools\ci_build\build.py as the flag for this feature. Add corresponding definition -Donnxruntime_ENABLE_WEBASSEMBLY_MEMFS in tools\ci_build\build.py and use it in cmake\onnxruntime_webassembly.cmake

If the flag is ON, ort-web should be able to work with external data via MEMFS; otherwise, ort-web should not change (not including unnecessary FS modules or extra unused JS code)

I will later review the rest of the changes in this PR.

Hey fs-eire do you mean MEMFS will be enabled in future release to support large models? does
onnxruntime-js team have any plan for this?

@fs-eire
Copy link
Contributor

fs-eire commented Nov 30, 2023

As per our discussion in team meeting, using MEMFS is an option. However, we need a build flag to allow this feature to be enabled/disabled. Please add a build time flag --enable_wasm_memfs with default value to false in tools\ci_build\build.py as the flag for this feature. Add corresponding definition -Donnxruntime_ENABLE_WEBASSEMBLY_MEMFS in tools\ci_build\build.py and use it in cmake\onnxruntime_webassembly.cmake
If the flag is ON, ort-web should be able to work with external data via MEMFS; otherwise, ort-web should not change (not including unnecessary FS modules or extra unused JS code)
I will later review the rest of the changes in this PR.

Hey fs-eire do you mean MEMFS will be enabled in future release to support large models? does onnxruntime-js team have any plan for this?

Before answering this question, let's take a look at what are the problems that we are trying to resolve.

Problem 1: Large model (> 2GB) does not work in ONNX Runtime. This is because of the 2G hard limit of protobuf, the format that ONNX model is using. To resolve this problem, the ONNX Runtime team introduces external data feature - a raw data file including model weights which can be very large, and the corresponding ONNX model containing the weights by offset and length referring the raw data. This works in ONNX Runtime, but when it turns into Web Assembly, 2 new problems came out:

Problem 2: Incompatible file system API being used in external data feature. ONNX Runtime uses synchromized file I/O API to read the external data when initializing the model. However, in web we don't have any sync I/O APIs. This is the technical blocker for ONNX Runtime Web to use external data feature without modifying anything. Emscripten, the WebAssembly C++ compiler offers this MEMFS utility to simulate a in memory file system to work with synchronized file read API with pre-loaded data. This is what this PR resolved.

Problem 3: The 4GB hard limit of memory space of wasm32. Because wasm uses 32-bit pointers, the memory space is no more than 4GB. There are 2 ways to resolve this problem:

  • use WASM64, the 64-bit Web Assembly proposal. However there are a few concerns for using WASM64:
    • standard and browser support is not yet ready
    • memory usage will be very huge. it really need a big memory to work. ORT is known for consuming 2x~3x of model size of memory consumption in initialization phrase.
    • one more variants for the ort-web wasm artifacts publishing ( we still need wasm32 )
  • do not load the weights into WebAssembly memory. Load weights into WebGPU or WebNN directly. As long as they don't get loaded into WebAssembly memory space, there is no 4GB limit. This solution does not work for CPU EP, but we make the assumption that a GB level large model on pure CPU is hardly a real use case. @guschmue is now investigating this solution.

Now let's answer your question. We are not sure whether MEMFS will be enabled by default in future, but we will keep the capabilities to build from source with enabling/disabling this feature. It resolves the problem 2, which extends the model size that supported by ort-web from 2GB to ~4GB, but still not working with models > 4GB. Personally I don't want to add too many things into ort-web as the artifacts is already very large now and our CI pipeline takes more and more time to build. So we will carefully think about features like MEMFS and WASM64 should be enabled or not. We will try to figure out the answer eventually when a real usage of a huge model on browser is born (instead of a very "cool" demo). Otherwise we keep looking and keep things in control.

@QimingZheng
Copy link

QimingZheng commented Nov 30, 2023

As per our discussion in team meeting, using MEMFS is an option. However, we need a build flag to allow this feature to be enabled/disabled. Please add a build time flag --enable_wasm_memfs with default value to false in tools\ci_build\build.py as the flag for this feature. Add corresponding definition -Donnxruntime_ENABLE_WEBASSEMBLY_MEMFS in tools\ci_build\build.py and use it in cmake\onnxruntime_webassembly.cmake
If the flag is ON, ort-web should be able to work with external data via MEMFS; otherwise, ort-web should not change (not including unnecessary FS modules or extra unused JS code)
I will later review the rest of the changes in this PR.

Hey fs-eire do you mean MEMFS will be enabled in future release to support large models? does onnxruntime-js team have any plan for this?

Before answering this question, let's take a look at what are the problems that we are trying to resolve.

Problem 1: Large model (> 2GB) does not work in ONNX Runtime. This is because of the 2G hard limit of protobuf, the format that ONNX model is using. To resolve this problem, the ONNX Runtime team introduces external data feature - a raw data file including model weights which can be very large, and the corresponding ONNX model containing the weights by offset and length referring the raw data. This works in ONNX Runtime, but when it turns into Web Assembly, 2 new problems came out:

Problem 2: Incompatible file system API being used in external data feature. ONNX Runtime uses synchromized file I/O API to read the external data when initializing the model. However, in web we don't have any sync I/O APIs. This is the technical blocker for ONNX Runtime Web to use external data feature without modifying anything. Emscripten, the WebAssembly C++ compiler offers this MEMFS utility to simulate a in memory file system to work with synchronized file read API with pre-loaded data. This is what this PR resolved.

Problem 3: The 4GB hard limit of memory space of wasm32. Because wasm uses 32-bit pointers, the memory space is no more than 4GB. There are 2 ways to resolve this problem:

  • use WASM64, the 64-bit Web Assembly proposal. However there are a few concerns for using WASM64:

    • standard and browser support is not yet ready
    • memory usage will be very huge. it really need a big memory to work. ORT is known for consuming 2x~3x of model size of memory consumption in initialization phrase.
    • one more variants for the ort-web wasm artifacts publishing ( we still need wasm32 )
  • do not load the weights into WebAssembly memory. Load weights into WebGPU or WebNN directly. As long as they don't get loaded into WebAssembly memory space, there is no 4GB limit. This solution does not work for CPU EP, but we make the assumption that a GB level large model on pure CPU is hardly a real use case. @guschmue is now investigating this solution.

Now let's answer your question. We are not sure whether MEMFS will be enabled by default in future, but we will keep the capabilities to build from source with enabling/disabling this feature. It resolves the problem 2, which extends the model size that supported by ort-web from 2GB to ~4GB, but still not working with models > 4GB. Personally I don't want to add too many things into ort-web as the artifacts is already very large now and our CI pipeline takes more and more time to build. So we will carefully think about features like MEMFS and WASM64 should be enabled or not. We will try to figure out the answer eventually when a real usage of a huge model on browser is born (instead of a very "cool" demo). Otherwise we keep looking and keep things in control.

Thank you for such a detailed explanation!

Being able to support model between 2GB ~ 4GB can unblock the landing of many stable diffusion models on web I believe (e.g. https://huggingface.co/runwayml/stable-diffusion-v1-5). I think this is a "cool" and real huge model that worth attentions. Thank you anyway, if there are plans to support it, I'm happy to offer help by building a SD demo (currently blocked by the large model size issue, the unet model is about 3.4 GB, which falls into the 2~4GB range perfectly).

@dakenf
Copy link
Contributor Author

dakenf commented Nov 30, 2023

Being able to support model between 2GB ~ 4GB can unblock the landing of many stable diffusion models on web I believe (e.g. https://huggingface.co/runwayml/stable-diffusion-v1-5). I think this is a "cool" and real huge model that worth attentions. Thank you anyway, if there are plans to support it, I'm happy to offer help by building a SD demo (currently blocked by the large model size issue, the unet model is about 3.4 GB, which falls into the 2~4GB range perfectly).

it is already done here https://islamov.ai/diffusers.js/
64bit support and external data loading are working as a proof of concept and we just need to figure out the best way to move changes to the main repo

I think the best way would be to support wasm32 with some way to load weights directly into WebGPU memory because

  1. webkit does not support wasm64 at all but i've heard rumors they will bring back webgpu to safari with vision os release. and since iPhone15 can run huge games, it would be possible to run a diffusion model or LLM in mobile browser
  2. wasm64 seems to be slower than 32

@QimingZheng
Copy link

it is already done here https://islamov.ai/diffusers.js/

yea, I also noticed that project before (just found you're the author), but it is based on the modified version of onnxruntime-js, looking forward to see official supports for large models ;)

@guschmue
Copy link
Contributor

guschmue commented Dec 5, 2023

We totally agree that we need external data format and a way to deal with models > 4GB after that.

Using FS is an easy way to add support for the external data format but longer term we want to be able to pass in a dictionary with the external data. The reason is that if we use FS the data will go thought the wasm heap, while with the dictionary we think we can add some method to have an EP copy the data directly from js heap to the device without going through the wasm heap.

We are thinking to merge this PR but make it a build option and in the very near team add the support for the dictionary followed by the wasm heap bypass. Later needs some changes in onnxruntime but we think we can make that work.
Adding the wasm heap bypass would save lots of memory and gets us much more milage with wasm32.

@fs-eire
Copy link
Contributor

fs-eire commented Jan 17, 2024

External data is implemented in #19087 and merged in main branch as a replacement of this PR.

@fs-eire fs-eire closed this Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants