From 02e91d0e525abb6f682f05d9e4fd2940b7130de1 Mon Sep 17 00:00:00 2001 From: Tianqi Chen Date: Mon, 17 Jun 2024 08:30:27 -0400 Subject: [PATCH] Update 2024-06-13-webllm-a-high-performance-in-browser-llm-inference-engine.md --- ...webllm-a-high-performance-in-browser-llm-inference-engine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-06-13-webllm-a-high-performance-in-browser-llm-inference-engine.md b/_posts/2024-06-13-webllm-a-high-performance-in-browser-llm-inference-engine.md index 5787b96..aa80f79 100644 --- a/_posts/2024-06-13-webllm-a-high-performance-in-browser-llm-inference-engine.md +++ b/_posts/2024-06-13-webllm-a-high-performance-in-browser-llm-inference-engine.md @@ -38,7 +38,7 @@ The architecture of WebLLM can be split into three parts, as shown in Figure 2. **Web Workers** We hide all the heavy computation in the background thread through different kinds of Web Workers (e.g. Service Workers), simplifying web app development while ensuring a smooth user interface. Under the hood, the `ServiceWorkerMLCEngine` communicates with an internal `MLCEngine` in the worker thread via message-passing, forwarding the OpenAI-API request while getting responses streamed back. The `MLCEngine` loads the specified model, executes the WGSL kernels with WebGPU (which translates kernels to native GPU code), and runs non-kernel functions with WebAssembly. Everything happens inside the worker thread of the browser with near-native performance. -**Compile time** The model that the `MLCEngine` loads in is compiled ahead of time and hosted online. We leverage MLC-LLM and TVM to compile any open-source model (e.g. Llama3, Phi3) into two components: converted/quantized model weights and a WASM library. The WASM library contains both compute kernels in WGSL (e.g. prefill, decode) and non-kernel functions in WebAsembly (e.g. BNFGrammar for JSON mode). WebLLM provides prebuilt models while allowing users to bring their own models. Note that the weights and wasm are downloaded once and cached locally. +**Compile time** The model that the `MLCEngine` loads in is compiled ahead of time and hosted online. We leverage MLC-LLM and Apache TVM to compile any open-source model (e.g. Llama3, Phi3) into two components: converted/quantized model weights and a WASM library. The WASM library contains both compute kernels in WGSL (e.g. prefill, decode) and non-kernel functions in WebAsembly (e.g. BNFGrammar for JSON mode). WebLLM provides prebuilt models while allowing users to bring their own models. Note that the weights and wasm are downloaded once and cached locally. ## Use WebLLM via OpenAI-style API As we build an in-browser inference engine, it is important to design a set of APIs that developers are familiar with and find easy to use. Thus, we choose to adopt OpenAI-style API. Developers can treat WebLLM as an in-place substitute for OpenAI API – but with any open source models with local inference.