Skip to content

Commit

Permalink
Update 2024-06-13-webllm-a-high-performance-in-browser-llm-inference-…
Browse files Browse the repository at this point in the history
…engine.md
  • Loading branch information
tqchen authored Jun 17, 2024
1 parent c2ba8be commit 02e91d0
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ The architecture of WebLLM can be split into three parts, as shown in Figure 2.

**Web Workers** We hide all the heavy computation in the background thread through different kinds of Web Workers (e.g. Service Workers), simplifying web app development while ensuring a smooth user interface. Under the hood, the `ServiceWorkerMLCEngine` communicates with an internal `MLCEngine` in the worker thread via message-passing, forwarding the OpenAI-API request while getting responses streamed back. The `MLCEngine` loads the specified model, executes the WGSL kernels with WebGPU (which translates kernels to native GPU code), and runs non-kernel functions with WebAssembly. Everything happens inside the worker thread of the browser with near-native performance.

**Compile time** The model that the `MLCEngine` loads in is compiled ahead of time and hosted online. We leverage MLC-LLM and TVM to compile any open-source model (e.g. Llama3, Phi3) into two components: converted/quantized model weights and a WASM library. The WASM library contains both compute kernels in WGSL (e.g. prefill, decode) and non-kernel functions in WebAsembly (e.g. BNFGrammar for JSON mode). WebLLM provides prebuilt models while allowing users to bring their own models. Note that the weights and wasm are downloaded once and cached locally.
**Compile time** The model that the `MLCEngine` loads in is compiled ahead of time and hosted online. We leverage MLC-LLM and Apache TVM to compile any open-source model (e.g. Llama3, Phi3) into two components: converted/quantized model weights and a WASM library. The WASM library contains both compute kernels in WGSL (e.g. prefill, decode) and non-kernel functions in WebAsembly (e.g. BNFGrammar for JSON mode). WebLLM provides prebuilt models while allowing users to bring their own models. Note that the weights and wasm are downloaded once and cached locally.

## Use WebLLM via OpenAI-style API
As we build an in-browser inference engine, it is important to design a set of APIs that developers are familiar with and find easy to use. Thus, we choose to adopt OpenAI-style API. Developers can treat WebLLM as an in-place substitute for OpenAI API – but with any open source models with local inference.
Expand Down

0 comments on commit 02e91d0

Please sign in to comment.