diff --git a/DEVELOPERS.md b/DEVELOPERS.md
deleted file mode 100644
index 9e2bb1f..0000000
--- a/DEVELOPERS.md
+++ /dev/null
@@ -1,86 +0,0 @@
-# Developers
-
-This note is for developers who want to contribute to the gpu.cpp library.
-
-## Design Objectives
-
-1. Maximal Leverage. Maximize the space of implementations that this
-   library is useful for with the least amount of implementation complexity.
-   Implementation complexity. 
-
-2. Minimize integration complexity. Whereas the integration pattern for custom
-   low-level GPU algorithm code is to integrate it into an existing engine (eg
-   an inference runtime, or a compiler), the objective of gpu.cpp is to enable
-   adding GPU computation code inside your own project with a minimal amount of
-   integration complexity.
-
-2. High ceiling on low-level control.
-    - Direct control of on-device GPU code unconstrained by fixed set of ops
-    - Direct control of on-device GPU memory management
-
-## Separating Resource Acquisition and Dispatch
-
-We can think of the use of gpu.cpp as GPU resources modeled by the type
-definitions of the library and actions on GPU resources, modeled by the
-functions of the library. 
-
-The key functions can be further subdivided into two categories in relation to
-when the GPU computation occurs: 
-
-1) Ahead-of-time preparation of resources and state: thess are functions that
-   acquire resources and prepare state for GPU computation. These are less
-   performance critical.
-
-2) Performance critical dispatch of GPU computation: these are functions that
-   dispatch GPU computation to the GPU, usually in a tight hot-path loop. 
-
-This pattern is different from non-performance critical application code where
-resource acquisition is often interleaved with computation throughout the
-program execution.
-
-This is a pattern for performance critical GPU computation that gpu.cpp is
-intended for. Some example use cases that fit this are custom neural network
-inference engines, render loops, simulations loops, etc.
-
-We'll see how the functions and types of the library are organized around these
-two types of actions.
-
-## Resource Type Definitions and Acquisition
-
-The main resources are:
-
-- `GPUContext` - the state of resources for interacting with the GPU.
-- `GPUTensor` - a buffer of data on the GPU.
-- `ShaderCode` - the code for a shader program that can be dispatched to the
-  GPU. This is a thin wrapper around a WGSL string but also includes the
-  workgroup size the code is designed to run with.
-- `Kernel` - a GPU program that can be dispatched to the GPU. This accepts a
-  `ShaderCode` and a list of `GPUTensor` resources to bind for the dispatch
-  computation.
-- `MultiKernel` - a collection of kernels that can be dispatched to the GPU.
-
-Resources are acquired using the `Create` functions. These are assumed to be
-ahead-of-time and not performance critical.
-
-- `GPUContext CreateGPUContext(...)` - creates a GPU context.
-- `GPUTensor CreateTensor(...)` - creates and allocates a buffer for a tensor
-  on the GPU.
-- `Kernel CreateKernel(...)` - creates and prepares a kernel on the GPU,
-  including underlying GPU buffer data bindings and compute pipeline for the
-  shader code.
-- `MultiKernel CreateMultiKernel(...)` - Same as `CreateKernel`, but for
-  multiple kernels to be dispatched together.
-
-There's a few supporting types in addition to these. `Shape` is a simple type
-to specify the shape of a tensor. `KernelDesc` and `MultiKernelDesc` are
-effectively. `TensorPool` manages `GPUTensor` resources and is used as context
-for allocating and deallocating tensors data on the GPU. In practice
-`TensorPool` is managed as a member variable of `GPUContext`.
-
-## Dispatching GPU Computation
-
-GPU computation is launched using the `Dispatch` functions. These are assumed
-to be performance critical.
-
-- `void DispatchKernel(...)` - dispatches a single kernel to the GPU.
-- `void DispatchMultiKernel(...)` - dispatches multiple kernels to the GPU.
diff --git a/Makefile b/Makefile
index 9aaaed0..9890711 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 NUM_JOBS=$(shell nproc)
 CXX=clang++
 
-.PHONY: default examples/hello_world/build/hello_world tests libgpu debug build check-entr check-clang clean-build clean all watch-tests
+.PHONY: default examples/hello_world/build/hello_world tests libgpu debug build check-entr check-clang clean-build clean all watch-tests docs
 
 GPUCPP ?= $(PWD)
 LIBDIR ?= $(GPUCPP)/third_party/lib
@@ -24,7 +24,7 @@ all: dawnlib check-clang check-linux-vulkan
 	cd examples/physics && make build/physics
 	cd examples/render && make build/render
 
-docs:
+docs: Doxyfile
 	doxygen Doxyfile
 
 ################################################################################
diff --git a/README.md b/README.md
index 38fe336..d373536 100644
--- a/README.md
+++ b/README.md
@@ -8,9 +8,77 @@ GPU code in C++ projects and have it run on Nvidia, Intel, AMD, and other GPUs.
 The same C++ code can work on a wide variety of laptops, workstations, mobile
 devices or virtually any hardware with Vulkan, Metal, or DirectX support.
 
-## Hello World: A GELU Kernel
+## Technical Objectives: Lightweight, Fast Iteration, and Low Boilerplate
+
+With gpu.cpp we want to enable a high-leverage library for individual developers and researchers to incorporate GPU computation into programs relying on nothing more than a standard C++ compiler as tooling. Our goals are:
+
+- High power-to-weight ratio API: Provide the smallest API surface area that can cover the full range of GPU compute needs.
+- Fast compile/run cycles: Ensure projects can build nearly instantaneously, compile/run cycles should be <5 seconds on a modern laptop.
+- Minimal dependencies and tooling overhead: A standard clang C++ compiler should be enough, no external library dependencies beyond the WebGPU native implementation.
+
+The implementation aims for a small API surface area with minimum boilerplate. There are a small number of library operations to carry out an broad range of low-level GPU operations. We avoid abstractions that add layers of indirection, making the mapping between the gpu.cpp library to raw WebGPU API clear when it's needed.
+
+In this spirit of fast experimentation, we also want near- instantaneous C++ builds taking no more than a second or two even on modestly capable personal computing devices. With this in mind, we not only keep the API surface area small, but also keep the implementation small and we also provide a prebuilt binary of the Dawn native WebGPU implementation.
+
+The core library implementation in the header-only `gpu.h` source code is around 1000 lines of code. In addition to enabling instantaneous, semi-interactive compilation cycles, the small implementation surface area keeps maintenance burden low and the velocity of improvements high. 
+We also pre-build Google's Dawn WebGPU implementation as a shared library binary. This allows builds to link the shared library with each build and incorporate Google's powerful native WebGPU implementation without paying the cost of re-compiling Dawn during development cycles.
+
+For more advanced users and release deployments, we include `cmake` examples for building both Dawn with gpu.cpp end-to-end, but this is not required nor recommended for most users to get started.
+
+## Quick Start: Building and Running
+
+To build a gpu.cpp project, you will need to have installed on your system:
+
+- `clang++` compiler installed with support for C++17.
+- `python3` and above, to run the script which downloads the Dawn shared library.
+make to build the project.
+- `make` to build the project.
+- Only on Linux systems - Vulkan drivers. If Vulkan is not installed, you can run `sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools` to install them.
 
-As a real-world test case, we start with an example of a GPU kernel from neural networks.
+The only library dependency of gpu.cpp is a WebGPU implementation. Currently we support the Dawn native backend, but we plan to support other targets and WebGPU implementations (web browsers or other native implementations such as wgpu). Currently we support MacOS, Linux, and Windows (via WSL).
+
+Optionally, Dawn can be built from scratch with gpu.cpp using the cmake build scripts provided - see the -cmake targets in the Makefile. However, this is recommended for advanced users only. Building Dawn dependencies with cmake takes much longer than using the precompiled Dawn shared library.
+
+After cloning the repo, from the top-level gpu.cpp, you should be able to build and run the hello world GELU example by typing:
+
+```
+make
+```
+
+The first time you build and run the project this way, it will download a prebuilt shared library for the Dawn native WebGPU implementation automatically (using the setup.py script). This places the Dawn shared library in the third_party/lib directory. Afterwards you should see `libdawn.dylib` on MacOS or `libdawn.so` on Linux. This download only occurs once.
+
+The build process itself should take a few seconds. If the build and executions is successful, you should see the output of the GELU computation:
+
+```
+Hello gpu.cpp!
+--------------
+
+  gelu(0.00) = 0.00
+  gelu(0.10) = 0.05
+  gelu(0.20) = 0.12
+  gelu(0.30) = 0.19
+  gelu(0.40) = 0.26
+  gelu(0.50) = 0.35
+  gelu(0.60) = 0.44
+  gelu(0.70) = 0.53
+  gelu(0.80) = 0.63
+  gelu(0.90) = 0.73
+  gelu(1.00) = 0.84
+  gelu(1.10) = 0.95
+  ...
+
+Computed 10000 values of GELU(x)
+```
+
+If you need to clean up the build artifacts, you can run:
+
+```
+make clean
+```
+
+## Hello World Tutorial: A GELU Kernel
+
+As a real-world example for how to use gpu.cpp, let's start with a practical-but-simple example of a GPU kernel from neural networks.
 
 GELU is a non-linear embarassingly parallel operation often used in modern large language model transformer-based architectures.
 
@@ -79,69 +147,41 @@ int main(int argc, char **argv) {
 
 Here we see the GPU code is quoted in a domain specific language called WGSL (WebGPU Shading Language). In a larger project, you might store this code in a separate file to be loaded at runtime (see [examples/shadertui](https://github.com/AnswerDotAI/gpu.cpp/tree/main/examples/shadertui) for a demonstration of live WGSL code re-loading).
 
-The CPU code in main() sets up the host coordination for the GPU computation. The ahead-of-time resource acquisition functions are prefaced with `create*`, such as:
-
-- `createContext()` - constructs a reference to the GPU device context.
-- `createTensor()` - constructs a contiguous buffer on the GPU. 
-- `createShader()` - constructs WGSL code to run on the GPU).
-- `createKernel()` - constructs resources for the GPU computation, which combines bindings to GPU buffers from `createTensor()` with the computation definition from `createShader()`.
-
-In this example, the GELU computation is performed only once and the program immediately exits so preparing resources and dispatch are side-by-side. Other examples in the [examples/](https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/) directory illustrate how resource acquisition is prepared ahead of time and dispatch occurs in the hot path like a render, model inference, or simulation loop.
-
-The dispatch occurs asynchronously via the `dispatchKernel()` invocation. Each dispatch is associated with a promise and a future so that `wait()` can block until the GPU computation is complete. `toCPU()` moves data from the GPU to CPU, while `toGPU()` (not needed in this example since the data movement is handled by when the `Tensor` is initialized) is used when the converse data movement to the GPU is needed.
+The CPU code in main() sets up the host coordination for the GPU computation.
+We can think of the use of gpu.cpp library as a collection of GPU nouns and
+verbs.
 
-This example is available in [examples/hello_world/run.cpp](https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/hello_world/run.cpp).
-
-## Quick Start: Building and Running
-
-To build a gpu.cpp project, you will need to have installed on your system:
-
-- `clang++` compiler installed with support for C++17.
-- `python3` and above, to run the script which downloads the Dawn shared library.
-make to build the project.
-- `make` to build the project.
-- Only on Linux systems - Vulkan drivers. If Vulkan is not installed, you can run `sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools` to install them.
-
-The only library dependency of gpu.cpp is a WebGPU implementation. Currently we support the Dawn native backend, but we plan to support other targets and WebGPU implementations (web browsers or other native implementations such as wgpu). Currently we support MacOS, Linux, and Windows (via WSL).
-
-Optionally, Dawn can be built from scratch with gpu.cpp using the cmake build scripts provided - see the -cmake targets in the Makefile. However, this is recommended for advanced users only. Building Dawn dependencies with cmake takes much longer than using the precompiled Dawn shared library.
-
-After cloning the repo, from the top-level gpu.cpp, you should be able to build and run the hello world GELU example by typing:
+The "nouns" are GPU resources modeled by the type definitions of the library
+and the "verbs" actions on GPU resources, modeled by the functions of the
+library. The ahead-of-time resource acquisition functions are prefaced with
+`create*`, such as:
 
-```
-make
-```
+- `createContext()` - constructs a reference to the GPU device context (`Context`).
+- `createTensor()` - acquires a contiguous buffer on the GPU (`Tensor`). 
+- `createShader()` - constructs WGSL code string to run on the GPU) (`ShaderCode`)
+- `createKernel()` - constructs a handle to resources for the GPU computation (`Kernel`), which combines bindings to GPU buffers from `createTensor()` with the computation definition from `createShader()`.
 
-The first time you build and run the project this way, it will download a prebuilt shared library for the Dawn native WebGPU implementation automatically (using the setup.py script). This places the Dawn shared library in the third_party/lib directory. Afterwards you should see `libdawn.dylib` on MacOS or `libdawn.so` on Linux. This download only occurs once.
+These resource acquisition functions are tied to resource types for interacting with the GPU:
 
-The build process itself should take a few seconds. If the build and executions is successful, you should see the output of the GELU computation:
-
-```
-Hello gpu.cpp!
---------------
+- `Context` - a handle to the state of resources for interacting with the GPU device.
+- `Tensor` - a buffer of data on the GPU.
+- `ShaderCode` - the code for a shader program that can be dispatched to the
+  GPU. This is a thin wrapper around a WGSL string but also includes the
+  workgroup size the code is designed to run with.
+- `Kernel` - a GPU program that can be dispatched to the GPU. This accepts a
+  `ShaderCode` and a list of `Tensor` resources to bind for the dispatch
+  computation. This takes an argument `Bindings` that is a list of `Tensor` instances and should map the bindings declared at the top of the WGSL code. In this example there's two bindings corresponding to the `input` buffer on the GPU and the `ouptut` buffer on the GPU.
 
-  gelu(0.00) = 0.00
-  gelu(0.10) = 0.05
-  gelu(0.20) = 0.12
-  gelu(0.30) = 0.19
-  gelu(0.40) = 0.26
-  gelu(0.50) = 0.35
-  gelu(0.60) = 0.44
-  gelu(0.70) = 0.53
-  gelu(0.80) = 0.63
-  gelu(0.90) = 0.73
-  gelu(1.00) = 0.84
-  gelu(1.10) = 0.95
-  ...
+In this example, the GELU computation is performed only once and the program immediately exits so preparing resources and dispatch are side-by-side. Other examples in the [examples/](https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/) directory illustrate how resource acquisition is prepared ahead of time and dispatch occurs in the hot path like a render, model inference, or simulation loop.
 
-Computed 10000 values of GELU(x)
-```
+Besides the `create*` resource acquisition functions, there are a few more "verbs" in the gpu.cpp library for handling dispatching execution to the GPU and data movement:
 
-If you need to clean up the build artifacts, you can run:
+- `dispatchKernel()` - dispatches a `Kernel` to the GPU for computation. This is an asynchronous operation that returns immediately.
+- `wait()` - blocks until the GPU computation is complete. This is a standard C++ future/promise pattern.
+- `toCPU()` - moves data from the GPU to the CPU. This is a synchronous operation that blocks until the data is copied.
+- `toGPU()` - moves data from the CPU to the GPU. This is a synchronous operation that blocks until the data is copied. In this particular example, `toGPU()` is not used because there's only one data movement from CPU to GPU in the program and that happens when the `createTensor()` function is called.
 
-```
-make clean
-```
+This example is available in [examples/hello_world/run.cpp](https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/hello_world/run.cpp).
 
 ## Other Examples: Matrix Multiplication, Physics Sim, and SDF Rendering
 
@@ -168,10 +208,6 @@ Interestingly, with a starting example, LLMs such as Claude 3.5 Sonnet can be qu
   <img src="docs/images/shadertui.gif" alt="shadertui example animated gif" width=88%>
 </div>
 
-## Troubleshooting
-
-If you run into issues building the project, please open an issue.
-
 ## Who is gpu.cpp for?
 
 gpu.cpp is aimed at enabling projects requiring portable on-device GPU computation with minimal implementation complexity and friction. Some example use cases are:
@@ -194,21 +230,6 @@ We want to make it easier for a broader range of projects to harness the power o
 
 gpu.cpp lets us implement and drop-in any algorithm with fine-grained control of data movement and GPU code, and explore outside boundaries of what is supported by existing production-oriented inference runtimes. At the same time we can write code that is portable and immediately usable on a wide variety of and GPU vendors and compute form factors - workstations, laptops, mobile, or even emerging hardware platforms such as AR/VR and robotics.
 
-## Technical Objectives: Lightweight, Fast Iteration, and Low Boilerplate
-
-
-With gpu.cpp we want to enable a high-leverage library for individual developers and researchers to incorporate GPU computation into programs relying on nothing more than a standard C++ compiler as tooling. Our goals are:
-
-- High power-to-weight ratio API: Provide the smallest API surface area that can cover the full range of GPU compute needs.
-- Fast compile/run cycles: Ensure projects can build nearly instantaneously, compile/run cycles should be <5 seconds on a modern laptop.
-- Minimal dependencies and tooling overhead: A standard clang C++ compiler should be enough, no external library dependencies beyond the WebGPU native implementation.
-
-The implementation aims for a small API surface area with minimum boilerplate. There are a small number (about a dozen) of library operations to carry out an broad range of low-level GPU operations. We avoid abstractions that add layers of indirection, making the mapping between the gpu.cpp library to raw WebGPU API clear when it's needed.
-
-In this spirit of lightweight experimentation, we also want fast iteration - instantaneous C++ builds taking no more than a second or two even on modestly capable personal computing devices. With this in mind, we not only keep the API surface area small, but also keep the implementation small and we also provide a prebuilt binary of the Dawn native WebGPU implementation.
-
-The core library implementation in the header-only `gpu.h` source code is around 1000 lines of code. In addition to enabling instantaneous, semi-interactive compilation cycles, the small implementation surface area keeps maintenance burden low and the velocity of improvements high. We also pre-build Google's Dawn WebGPU implementation as a shared library binary. This allows builds to link the shared library with each build and incorporate Google's powerful native WebGPU implementation without paying the cost of re-compiling Dawn during development cycles. For more advanced users and release deployments, we include `cmake` examples for building both Dawn with gpu.cpp end-to-end.
-
 ## What gpu.cpp is not
 
 gpu.cpp is meant for developers with some familiarity with C++ and GPU programming. It is not a high-level numerical computing or machine learning framework or inference engine, though it can be used in support of such implementations.
@@ -229,6 +250,10 @@ Finally, the focus of gpu.cpp is general-purpose GPU computation rather than ren
 
 *More Use Case Examples and Tests* - Expect an iteration loop of use cases to design tweaks and improvements, which in turn make the use cases cleaner and easier to write. One short term use cases to flesh out the kernels from [llm.c](https://github.com/karpathy/llm.c) in WebGPU form. As these mature into a reusable kernel library, we hope to help realize the potential for WebGPU compute in AI.
 
+## Troubleshooting
+
+If you run into issues building the project, please open an issue.
+
 ## Acknowledgements
 
 gpu.cpp makes use of:
diff --git a/examples/hello_world/Makefile b/examples/hello_world/Makefile
index 90cd367..b5b1eed 100644
--- a/examples/hello_world/Makefile
+++ b/examples/hello_world/Makefile
@@ -6,11 +6,26 @@ NUM_JOBS?=$(shell nproc)
 FLAGS=-stdlib=libc++ -std=c++17 -I$(GPUCPP) -I$(GPUCPP)/third_party/headers -L$(GPUCPP)/third_party/lib run.cpp -ldl -ldawn
 TARGET=hello_world
 
-run: ./build/$(TARGET)
+run: ./build/$(TARGET) dawnlib
 	$(LIBSPEC) && ./build/$(TARGET)
 
+dawnlib: $(if $(wildcard $(GPUCPP)/third_party/lib/libdawn.so $(GPUCPP)/third_party/lib/libdawn.dylib),,run_setup)
+
+run_setup: check-python
+	cd $(GPUCPP) && python3 setup.py
+
+all: dawnlib check-clang check-linux-vulkan
+	cd examples/gpu_puzzles && make build/gpu_puzzles
+	cd examples/hello_world && make build/hello_world
+	cd examples/matmul && make build/mm
+	cd examples/physics && make build/physics
+	cd examples/render && make build/render
+
 build/$(TARGET): run.cpp
 	mkdir -p build && $(CXX) $(FLAGS) -DNDEBUG -o ./build/$(TARGET)
 
 clean:
 	read -r -p "This will delete the contents of build/*. Are you sure? [CTRL-C to abort] " response && rm -rf build/*
+
+check-python:
+	@command -v python3 >/dev/null 2>&1 || { echo >&2 "Python needs to be installed and in your path."; exit 1; } 
diff --git a/setup.py b/setup.py
index 7493810..4e28348 100644
--- a/setup.py
+++ b/setup.py
@@ -1,6 +1,7 @@
 import os
 import platform
 import sys
+import ssl
 import urllib.request
 from pathlib import Path
 
@@ -28,13 +29,14 @@ def report_progress(block_num, block_size, total_size):
         print(f"\rDownloaded {total_downloaded // (1024 * 1024)} MB", end="")
 
     try:
+        ssl._create_default_https_context = ssl._create_stdlib_context
         urllib.request.urlretrieve(url, output_filename, reporthook=report_progress)
         print(f"\nDownloaded {output_filename}")
         return True
     except Exception as e:
         print(f"\nFailed to download {output_filename}")
         print(f"Error: {str(e)}")
-        return False
+        sys.exit(1)
 
 def check_os(os_name):
     print("\nChecking System")
@@ -63,7 +65,7 @@ def download_dawn(os_name):
 
     if not outfile or not url:
         print(f"No download information for {os_name}")
-        return
+        sys.exit(1)
 
     print(f"  URL              : {url}")
     print(f"  Download File    : {outfile}\n")
@@ -71,7 +73,7 @@ def download_dawn(os_name):
 
     if Path(outfile).exists():
         print(f"  File {outfile} already exists, skipping.")
-        return
+        sys.exit(0)
 
     Path(outfile).parent.mkdir(parents=True, exist_ok=True)
     download_file(url, outfile)
diff --git a/tutorial.md b/tutorial.md
deleted file mode 100644
index db98d57..0000000
--- a/tutorial.md
+++ /dev/null
@@ -1,328 +0,0 @@
-   ____ _____  __  __ _________  ____ 
-  / __ `/ __ \/ / / // ___/ __ \/ __ \
- / /_/ / /_/ / /_/ // /__/ /_/ / /_/ /
- \__, / .___/\__,_(_)___/ .___/ .___/ 
-/____/_/               /_/   /_/
-
-Intro
--------
-
-This is a brief intro to the gpu.cpp library.
-
-You can use the library by simply including the gpu.h header:
-
-  #include "gpu.h"
-
-and starting with a build template (see examples/hello_gpu/ for a template
-project that builds the library).
-
-# First Program
-
-Before diving into the details of the library, let's test out some code to
-perform a simple GPU computation - a GELU activation function. These activation
-functions are common in deep learning large language models.
-
-The code is broken into two parts:
-
-*The code that runs on the GPU*
-
-.. is written in WGSL the WebGPU Shading Language. WGSL is a domain specific
-language for writing GPU compute kernels approximately maps to the computations
-available on the GPU. If you are familiar with CUDA, this is similar to writing
-a CUDA kernel.
-
-*The code that runs on the host (CPU)*
-
-.. is written in C++ and uses the gpu.cpp which invokes the WebGPU C API.
-
-We'll see a WGSL example later, for now let's see the host CPU C++ code that
-uses the gpu.cpp to run the GELU activation function.
-
-Here is the host CPU C++ code that uses the gpu.cpp to run the GELU activation
-function:
-
-```
-int main(int argc, char **argv) {
-  printf("\nHello, gpu.cpp\n\n");
-  Context ctx = CreateContext();
-  static constexpr size_t N = 3072;
-  std::array<float, N> inputArr, outputArr;
-  for (int i = 0; i < N; ++i) {
-    inputArr[i] = static_cast<float>(i) / 2.0; // dummy input data
-  }
-  Tensor input = CreateTensor(ctx, Shape{N}, kf32, inputArr.data());
-  Tensor output = CreateTensor(ctx, Shape{N}, kf32);
-  std::promise<void> promise;
-  std::future<void> future = promise.get_future();
-  Kernel op = CreateKernel(ctx, CreateShader(kGelu, 256, kf32), Bindings{input, output},
-                           /* nthreads */ {N, 1, 1});
-  DispatchKernel(ctx, op, promise);
-  Wait(ctx, future);
-  ToCPU(ctx, output, outputArr.data(), sizeof(outputArr));
-  for (int i = 0; i < 32; ++i) {
-    printf("out[%d] : gelu(%.2f) = %.2f\n", i, inputArr[i], outputArr[i]);
-  }
-  printf("...\n\n");
-  return 0;
-}
-```
-
-gpu.cpp vs. the raw WebGPU API
-------------------------------
-
-The main responsibility of the types and functions of the library is to make it
-simple to represent these common building blocks of computation.
-
-If you look at `examples/webgpu_intro/run.cpp` you can learn more about what
-it's like to interact directly with the WebGPU API.
-
-Design Objectives of gpu.cpp
-----------------------------
-
-1. Maximal Leverage. Maximize the space of implementations that this
-   library is useful for with the least amount of implementation complexity.
-   Implementation complexity. 
-
-2. Minimize integration complexity. Whereas the common pattern for custom
-   low-level GPU algorithm code is to integrate it into an existing engine (eg
-   an inference runtime, or a compiler), the objective of gpu.cpp is to enable
-   adding GPU computation code inside your own project with a minimal amount of
-   integration complexity.
-
-3. High ceiling on low-level control.
-    - Direct control of on-device GPU code unconstrained by fixed set of ops
-    - Direct control of on-device GPU memory management
-
-
-Separating Resource Acquisition and Dispatch
---------------------------------------------
-
-We can think of the use of gpu.cpp library as a collection of GPU nouns and
-verbs.
-
-The "nouns" are GPU resources modeled by the type definitions of the library
-and the "verbs" actions on GPU resources, modeled by the functions of the
-library. 
-
-The key functions can be further subdivided into two categories in relation to
-when the GPU computation occurs: 
-
-1) Ahead-of-time GPU Resource Preparation: these are functions that
-   acquire resources and prepare state for GPU computation. These are less
-   performance critical.
-
-2) Performance critical dispatch of GPU computation: these are functions that
-   dispatch GPU computation to the GPU, usually in a tight hot-path loop. 
-
-Ahead-of-time GPU Resource Preparation
---------------------------------------
-
-In the next sections, we'll look at the ahead-of-time GPU resource preparation
-
-These are functions that acquire resources and prepare state for GPU
-computation. These are assumed to be less performance critical and not on hot
-code paths.
-
-Preparing GPU Resources I: Resource Type Definitions
-----------------------------------------------------
-
-The main resources are:
-
-- `Context` - the state of resources for interacting with the GPU.
-- `Tensor` - a buffer of data on the GPU.
-- `ShaderCode` - the code for a shader program that can be dispatched to the
-  GPU. This is a thin wrapper around a WGSL string but also includes the
-  workgroup size the code is designed to run with.
-- `Kernel` - a GPU program that can be dispatched to the GPU. This accepts a
-  `ShaderCode` and a list of `Tensor` resources to bind for the dispatch
-  computation.
-
-Preparing GPU Resources II: Acquiring GPU Resources with `Create*()` Functions
-------------------------------------------------------------------------------
-
-Resources are acquired using the `Create` functions. These are assumed to be
-ahead-of-time and not performance critical.
-
-- `Context CreateContext(...)` - creates a GPU context.
-- `Tensor CreateTensor(...)` - creates and allocates a buffer for a tensor
-  on the GPU.
-- `Kernel CreateKernel(...)` - creates and prepares a kernel on the GPU,
-  including underlying GPU buffer data bindings and compute pipeline for the
-  shader code.
-- `MultiKernel CreateMultiKernel(...)` - Same as `CreateKernel`, but for
-  multiple kernels to be dispatched together.
-
-There's a few supporting types in addition to these. `Shape` is a simple type
-to specify the shape of a tensor. `KernelDesc` and `MultiKernelDesc` are
-effectively. `TensorPool` manages `Tensor` resources and is used as context
-for allocating and deallocating tensors data on the GPU. In practice
-`TensorPool` is managed as a member variable of `Context`.
-
-`CreateContext()` creates a Context
---------------------------------------
-
-Let's zoom in a bit on the invocation of these Create functions, starting with
-CreateContext:
-
-The Context is the main entry point for interacting with the GPU. It
-represents the state of the GPU and is used to allocate resources and execute
-kernels.
-
-In your program, you can create a Context like this:
-
-  Context ctx = CreateContext();
-
-`CreateTensor()` allocates Tensor on the GPU
------------------------------------------------
-
-As a low-level library, gpu.cpp primarily deals with flat arrays of data either
-on the CPU or GPU. 
-
-The main data structure is the GPUArray which represents a flat buffer of
-values on the GPU. Tensor is a thin wrapper around GPUArray that adds shape
-metadata.
-
-In most applications, you may prepare arrays or allocated
-chunks on the CPU (eg for model weights or input data), and then 
-
-  std::array<float, N> inputArr;
-  std::array<float, N> outputArr;
-  for (int i = 0; i < N; ++i) {
-    inputArr[i] = static_cast<float>(i); // dummy input data
-  }
-  Tensor input = CreateTensor(ctx, {N}, kf32, inputArr.data());
-  Tensor output = CreateTensor(ctx, {N}, kf32, outputArr.data());
-
-Let's try creating some data on the GPU now.
-
-```
-  std::array<float, 3072> inputArr;
-  std::array<float, 3072> outputArr;
-  for (int i = 0; i < 3072; ++i) {
-    inputArr[i] = static_cast<float>(i); // dummy input data
-  }
-  Tensor input = CreateTensor(ctx, {3072}, kf32, inputArr.data());
-  Tensor output = CreateTensor(ctx, {3072}, kf32, outputArr.data());
-  fprintf(stdout, "\nSuccessfully created input and output tensors.\n\n");
-  wait();
-```
-
-Create a Kernel with `CreateKernel()`
--------------------------------------
-
-Reviewing our GELU example and after using `CreateTensor()` to allocate and
-bind buffers for input and output data, we can use `CreateKernel()` to create a
-kernel.
-
-```
-  // Previously: Create the input and output tensors
-  Tensor input = CreateTensor(ctx, {N}, kf32, inputArr.data());
-  Tensor output = CreateTensor(ctx, {N}, kf32, outputArr.data());
-
-  // ...
-
-  Kernel op =
-      CreateKernel(ctx, ShaderCode{kGELU, 256}, input, output, /*nthreads*/{N, 1, 1});
-```
-
-Note this *does not* run the kernel, it just prepares the kernel as a resource
-to be dispatched later.
-
-There are four arguments to `CreateKernel()`:
-- `Context` - the context for the GPU
-- `ShaderCode` - the shader code for the kernel
-- `Tensor` - the input tensor. Even though the kernel is not executed,
-Tensor provides a handle to the buffers on the GPU to be loaded when the
-kernel is run. If there's more than one input, `Tensors` can be used which
-is an ordered collection of `Tensor`.
-- `Tensor` - the output tensor. As with the input tensor, the values are not
-important at this point, the underlying reference to the GPU buffer is bound to
-the kernel so that when the kernel is dispatched, it will know where to write
-the output data.
-
-The kGELU string that goes into ShaderCode is the WGSL shader code for the
-kernel. We'll look at this next.
-
-WGSL Compute Kernels are Programs that run Computation on the GPU
-------------------------------------------------------------------
-
-Device code in WebGPU uses the WGSL shading language. In addition to mechanisms
-for invoking WGSL shaders as compute kernels as shown so far, you can write
-your own WGSL shaders and use the same mechanisms to invoke them.
-
-Here is an example of a custom WGSL shader that implements the GELU activation:
-
-```
-const GELU_SCALING_FACTOR: f32 = 0.7978845608028654; // sqrt(2.0 / PI)
-@group(0) @binding(0) var<storage, read_write> inp: array<f32>;
-@group(0) @binding(1) var<storage, read_write> out: array<f32>;
-@compute @workgroup_size(256, 1, 1)
-fn main(
-    @builtin(global_invocation_id) GlobalInvocationID: vec3<u32>) {
-    let i: u32 = GlobalInvocationID.x;
-    if (i < arrayLength(&inp)) {
-        let x: f32 = inp[i];
-        // select is more stable for larger values of x
-        out[i] = select(0.5 * x * (1.0 + tanh(GELU_SCALING_FACTOR 
-                  * (x + .044715 * x * x * x))), x, x > 10.0);
-    }
-}
-```
-
-If you are familiar with CUDA, this is pretty similar to the code you would
-find in a CUDA kernel. Like a CUDA kernel, there are invocation ids that are
-passed in.
-
-The `@group(0)` and `@binding(0)` annotations are used to specify the binding
-points for the input and output buffers. The `@compute` annotation specifies
-that this is a compute kernel. The `@workgroup_size(256, 1, 1)` annotation
-specifies the workgroup size for the kernel.
-
-Performance critical dispatch of GPU computation
-------------------------------------------------
-
-The past few sections have covered the ahead-of-time GPU resource preparation
-consisting of `Create*()` and supporting functions. 
-
-None of these actually execute computation on the GPU yet.
-
-Next we'll look at the dispatch functions which asynchronously dispatches the
-kernel for execution.
-
-Dispatch a kernel for execution with `DispatchKernel()`
-------------------------------------------------------
-
-After creating a kernel, you can dispatch it for execution on the GPU using
-`DispatchKernel()`.
-
-```
-  // Previously: Create the kernel 
-  Kernel op =
-      CreateKernel(ctx, ShaderCode{kGELU, 256}, input, output);
-
-  // ...
-
-  DispatchKernel(ctx, op);
-  Wait(ctx, op.future);
-  ToCPU(ctx, output, outputArr.data(), sizeof(outputArr));
-}
-```
-
-Note that the kernel is executed asynchronously on the GPU, in other words,
-execution will continue on the CPU while the GPU is running the kernel.
-
-To wait for the kernel to finish, you can use `Wait(ctx, op.future)`. This will
-block until the kernel has finished executing.
-
-Note the output of the kernel (if any) is written to the output tensor on the
-GPU. It is not copied back to CPU by default until you call `ToCPU()` to copy
-the data back to the CPU.
-
-This is intentional to allow for efficient pipelining of GPU computation and
-reusing GPU resources without copying data back and forth unless it's specified.
-
-Resetting the Command Buffer
------------------------------
-
-(( TODO ))
-