Skip to content

Commit

Permalink
Update overview docs in README.md for sharktank and shortfin. (#388)
Browse files Browse the repository at this point in the history
Progress on #359:

* Highlight each sub-project in the root README.md
* Highlight key folders for working with SDXL and llama
* Add badges for build status, PyPI packages, and the Apache-2.0 license
* Standardize some capitalization and fix some link redirects

This is just the first step in getting the repository ready for getting
more attention. The development instructions will need more cleanup
and/or movement into the sub-projects and I've added some TODOs for what
to document next.

---------

Co-authored-by: Marius Brehler <[email protected]>
  • Loading branch information
ScottTodd and marbre authored Oct 31, 2024
1 parent 6c7d4a4 commit c7dd7fa
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 41 deletions.
63 changes: 57 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,61 @@
**WARNING: This is an early preview that is in progress. It is not ready for
general use.**

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
![GitHub License](https://img.shields.io/github/license/nod-ai/SHARK-Platform)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

<!-- TODO: high level overview, features when components are used together -->

## Development Getting Started
## Sub-projects

### [`sharktank/`](./sharktank/)

[![PyPI version](https://badge.fury.io/py/sharktank.svg)](https://badge.fury.io/py/sharktank) [![CI - sharktank](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-sharktank.yml/badge.svg?event=push)](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-sharktank.yml?query=event%3Apush)

The SHARK Tank sub-project contains a collection of model recipes and
conversion tools to produce inference-optimized programs.

<!-- TODO: features list here? -->

* See the [SHARK Tank Programming Guide](./docs/programming_guide.md) for
information about core concepts, the development model, dataset management,
and more.
* See [Direct Quantization with SHARK Tank](./docs/quantization.md)
for information about quantization support.

### [`shortfin/`](./shortfin/)

<!-- TODO: features list here? -->

[![PyPI version](https://badge.fury.io/py/shortfin.svg)](https://badge.fury.io/py/shortfin) [![CI - shortfin](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_linux_x64-libshortfin.yml/badge.svg?event=push)](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_linux_x64-libshortfin.yml?query=event%3Apush)

The shortfin sub-project is SHARK's high performance inference library and
serving engine.

* API documentation for shortfin is available on
[readthedocs](https://shortfin.readthedocs.io/en/latest/).

### [`tuner/`](./tuner/)

[![CI - Tuner](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-tuner.yml/badge.svg?event=push)](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-tuner.yml?query=event%3Apush)

The Tuner sub-project assists with tuning program performance by searching for
optimal parameter configurations to use during model compilation.

## Support matrix

<!-- TODO: version requirements for Python, ROCm, Linux, etc. -->

### Models

Model name | Model recipes | Serving apps
---------- | ------------- | ------------
SDXL | [`sharktank/sharktank/models/punet/`](https://github.com/nod-ai/SHARK-Platform/tree/main/sharktank/sharktank/models/punet) | [`shortfin/python/shortfin_apps/sd/`](https://github.com/nod-ai/SHARK-Platform/tree/main/shortfin/python/shortfin_apps/sd)
llama | [`sharktank/sharktank/models/llama/`](https://github.com/nod-ai/SHARK-Platform/tree/main/sharktank/sharktank/models/llama) | [`shortfin/python/shortfin_apps/llm/`](https://github.com/nod-ai/SHARK-Platform/tree/main/shortfin/python/shortfin_apps/llm)

## Development getting started

<!-- TODO: Remove or update this section. Common setup for all projects? -->

Use this as a guide to get started developing the project using pinned,
pre-release dependencies. You are welcome to deviate as you see fit, but
Expand All @@ -22,7 +73,7 @@ python -m venv --prompt sharktank .venv
source .venv/bin/activate
```

### Install PyTorch for Your System
### Install PyTorch for your system

If no explicit action is taken, the default PyTorch version will be installed.
This will give you a current CUDA-based version. Install a different variant
Expand All @@ -40,7 +91,7 @@ pip install -r pytorch-cpu-requirements.txt
pip install -r pytorch-rocm-requirements.txt
```

### Install Development Packages
### Install development packages

```
# Clone and install editable iree-turbine dep in deps/
Expand All @@ -51,14 +102,14 @@ pip install -f https://iree.dev/pip-release-links.html --src deps \
pip install -r requirements.txt -e sharktank/ shortfin/
```

### Running Tests
### Running tests

```
pytest sharktank
pytest shortfin
```

### Optional: Pre-commits and developer settings
### Optional: pre-commits and developer settings

This project is set up to use the `pre-commit` tooling. To install it in
your local repo, run: `pre-commit install`. After this point, when making
Expand Down
4 changes: 2 additions & 2 deletions docs/model_cookbook.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Model cookbook

Note: These are early notes and commands that the sharktank team is using and
will turn into proper docs later.
Note: These are early notes and commands that the SHARK-Platform team is using
and will turn into proper docs later.

## Diagrams

Expand Down
58 changes: 29 additions & 29 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ author: Stella Laurenzo
date: June 30, 2024
---

# Direct Quantization with sharktank
# Direct Quantization with SHARK Tank

As a toolkit for building and adapting PyTorch based models for deployment,
sharktank provides rich quantization support. By targeting the
SHARK Tank provides rich quantization support. By targeting the
[IREE compiler](https://github.com/iree-org/iree) for optimizations, we can
strike a balance with our quantization setup that:

Expand Down Expand Up @@ -36,7 +36,7 @@ supports these indirect schemes -- effectively using compiler transformations
under the covers to do opaque model transformations that mirror a subset of
what is exposed directly to the user in the rest of this document.

As an alternative, when developing sharktank and bringing up the initial
As an alternative, when developing SHARK Tank and bringing up the initial
models, we wanted something more flexible, easier to debug/extend, and
less laden with needing to lowest common denominator something for everyone
in order to fit into fixed-function op sets that are very expensive to change.
Expand All @@ -63,12 +63,12 @@ amount of Python code implementing direct math and packing schemes.
drop-in replacements for subsets of the functionality available in stock
PyTorch modules like `Linear` and `Conv2D`.
2. Types/Ops: The `nn.Module` implementations we provide are built in terms
of sharktank custom
[`InferenceTensor`](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/types/tensors.py#L153)
and [polymorphic functional ops library](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/ops/signatures.py).
of SHARK Tank custom
[`InferenceTensor`](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/types/tensors.py#L153)
and [polymorphic functional ops library](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/ops/signatures.py).
3. Op specializations for optimized subsets of op type signatures and features
(for example, [an optimized affine quantized linear specialization for
supported combinations of `TensorScaledLayout` arguments](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/ops/qlinear_impls.py)).
supported combinations of `TensorScaledLayout` arguments](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/ops/qlinear_impls.py)).

(TODO: good place for a diagram)

Expand All @@ -78,18 +78,18 @@ amount of Python code implementing direct math and packing schemes.
Available modules that support direct quantization (TODO: refactor to use
torch "Module" terminology and naming schemes consistently):

* [`LinearLayer`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/layers/linear.py)
* [convolution layers](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/layers/conv.py)
* [`LinearLayer`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/layers/linear.py)
* [convolution layers](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/layers/conv.py)

Note that most sharktank modules extend
[`ThetaLayer`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/layers/base.py#L63),
[`ThetaLayer`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/layers/base.py#L63),
which calls for a bit of explanation. Traditional PyTorch Modules directly
instantiate their backing parameters in their constructor. For dataset-heavy
and polymorphic implementations like we commonly see in quantization and
distribution, however, it can be beneficial to separate these concerns.

The `ThetaLayer` simply takes a
[`Theta` object](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/theta.py#L74),
[`Theta` object](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/theta.py#L74),
which is a tree-structured bag of native `torch.Tensor` or `InferenceTensor`
instances, and it adopts the tensors in the bag as its own vs creating them.
For those familiar with the concept, this is a form of dependency-injection
Expand All @@ -114,7 +114,7 @@ tree to a specific Module instance.

We've already met the `Theta` object above, which holds a tree of something
called an
[`InferenceTensor`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/tensors.py#L153).
[`InferenceTensor`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/tensors.py#L153).
Now we describe what this is. Note that presently, `InferenceTensor` is not a
`torch.Tensor` but its own `ABC` type that:

Expand All @@ -140,11 +140,11 @@ pipelines.
There is a growing list of `InferenceTensor` sub-types, many of which are
related to quantization:

* [`PrimitiveTensor`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/tensors.py#L286):
* [`PrimitiveTensor`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/tensors.py#L286):
A simple composition of a single `torch.Tensor`. This is often used
interchangeably with a `torch.Tensor` but is present for completeness of
the type hierarchy and to be able to type select on.
* [`QuantizedTensor`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/tensors.py#L372):
* [`QuantizedTensor`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/tensors.py#L372):
Abstract base class of all quantized tensors, providing two primary operations:

* `unpack`: Accesses the backing `QuantizedLayout` of the tensor, which is
Expand All @@ -154,12 +154,12 @@ related to quantization:
layout, this explodes it into a canonical representation of individual
tensors which can be algebraically implemented individually/generically).

* [`PlanarQuantizedTensor`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/tensors.py#L408):
* [`PlanarQuantizedTensor`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/tensors.py#L408):
Concrete implementation for all non-packed quantized tensors that can be
losslessly represented by a layout based on individual tensor components.
All `QuantizedTensor` instances can be converted to a `PlanarQuantizedTensor`.

* [`QuantizerTensor`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/tensors.py#L408):
* [`QuantizerTensor`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/tensors.py#L408):
(note the "r" in the name) An abstract `InferenceTensor` that exposes a
`quantize(torch.Tensor | InferenceTensor) -> QuantizedTensor` operation used
to transform an arbitrary tensor to a quantized form. There are a handful
Expand All @@ -178,7 +178,7 @@ manipulate tensor contents via `QuantizedLayout`, but we haven't yet defined
that. The *Tensor types are structural and exist to give identity, but the
`QuantizedLayout` is where the "magic happens".

[`QuantizedLayout`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/tensors.py#L44)
[`QuantizedLayout`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/tensors.py#L44)
is an `ABC`, supporting:

* Serialization/interop with parameter archives.
Expand All @@ -193,7 +193,7 @@ is an `ABC`, supporting:
There are a number of implementations, as every quantization scheme typically
needs at least one concrete `QuantizedLayout`. Simple schemes like affine
quantization can be fully defined in terms of a single
[`TensorScaledLayout`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/layouts.py#L43).
[`TensorScaledLayout`](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/types/layouts.py#L43).
Whereas packed schemes like we find in inference engines like GGML and XNNPACK
optimally require both a packed layout and a planar layout.

Expand Down Expand Up @@ -224,7 +224,7 @@ interpreting/transforming using their natively defined forms.
Previously, we found a rich type system defining all manner of layouts and
quantization schemes, but what can be done with it? That is where the
sharktank functional op library comes in. These
[logical ops](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/ops/signatures.py)
[logical ops](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/ops/signatures.py)
provide the building blocks to implement built-in and custom `nn.Module`
implementations operating on `InferenceTensor` (and torch.Tensor) types.

Expand All @@ -239,12 +239,12 @@ implementation at any needed level of granularity:
structures and preserve it when computing (when combined with a
fusing compiler, this alone provides decent fallback implementations for a
variety of "weight compression" oriented techniques). See
[some examples](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/ops/custom_impls.py#L51).
[some examples](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/ops/custom_impls.py#L51).
* Pure-Torch decompositions for algebraic techniques like affine quantization
(when combined with a fusing compiler, this alone is sufficient for
optimization). See
[qlinear](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/ops/qlinear_impls.py) and
[qconv](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/ops/qconv_impls.py)
[qlinear](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/ops/qlinear_impls.py) and
[qconv](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/ops/qconv_impls.py)
implementations of actual affine quantized decompositions.
* Completely custom packed/optimized implementation. These can be written to
activate on any level of detail of the type hierarchy. The implementation
Expand Down Expand Up @@ -280,8 +280,8 @@ level. Some examples:
[tensor trace/print](https://github.com/iree-org/iree-turbine/blob/main/iree.turbine/ops/iree.py#L52)
* [Simple linalg based template expansion](https://github.com/iree-org/iree-turbine/blob/main/iree.turbine/ops/_jinja_test_ops.py#L28)
(see backing example [jinja template](https://github.com/iree-org/iree-turbine/blob/main/iree.turbine/ops/templates/test_add_jinja.mlir)).
* Optimal linalg-based [8-bit block scaled mmt for weight compression](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/kernels/mmt_block_scaled_q8.py)
(see backing [jinja template](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/kernels/templates/mmt_block_scaled_q8_3d.mlir)).
* Optimal linalg-based [8-bit block scaled mmt for weight compression](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/kernels/mmt_block_scaled_q8.py)
(see backing [jinja template](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/kernels/templates/mmt_block_scaled_q8_3d.mlir)).
* DSL based [like this fused attention kernel](https://github.com/iree-org/iree-turbine/blob/main/tests/kernel/fused_attention_test.py#L20)
(note that in this case, the DSL exports to the unerlying IR-based registration
mechanism used in the previous examples).
Expand All @@ -292,8 +292,8 @@ Since all of these types of custom kernels are just defined with simple Python
tooling, they are really fast to iterate on. The linalg based kernels specifically
tend to be highly portable, and we don't hesitate to write one of those when
we need something specific that PyTorch doesn't provide out of the box
(i.e. [proper mixed-precision integer conv](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/kernels/conv_2d_nchw_fchw.py)
([template](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/kernels/templates/conv_2d_nchw_fchw.mlir))).
(i.e. [proper mixed-precision integer conv](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/kernels/conv_2d_nchw_fchw.py)
([template](https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/kernels/templates/conv_2d_nchw_fchw.mlir))).

## Dataset transformation

Expand All @@ -307,7 +307,7 @@ We take a practical approach to this, writing implementation specific converters
where needed, and taking advantage of industry-standard consolidation points
where available (like GGUF) in order to cover a wider surface area.

Behind both is the notion of a [`Dataset`](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/types/theta.py#L263),
Behind both is the notion of a [`Dataset`](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/types/theta.py#L263),
which combines some set of hyper-parameters with a root `Theta` object
(typically representing the layer-tree of frozen tensors). Datasets can be
losslessly persisted to IREE IRPA files, which can then be loaded by either
Expand All @@ -321,9 +321,9 @@ transform, shard, etc.

See some examples:

* [models/punet/tools/import_hf_dataset.py](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/models/punet/tools/import_hf_dataset.py) :
* [models/punet/tools/import_hf_dataset.py](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/models/punet/tools/import_hf_dataset.py) :
Creating a `Dataset` object from an HF diffusers safetensors file and config.json.
* [models/punet/tools/import_brevitas_dataset.py](https://github.com/nod-ai/sharktank/blob/quant_docs/sharktank/sharktank/models/punet/tools/import_brevitas_dataset.py) :
* [models/punet/tools/import_brevitas_dataset.py](https://github.com/nod-ai/SHARK-Platform/blob/quant_docs/sharktank/sharktank/models/punet/tools/import_brevitas_dataset.py) :
Creates a quantized `Dataset` by combining:

* HF diffusers `config.json`
Expand Down
2 changes: 1 addition & 1 deletion sharktank/sharktank/ops/custom_impls.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@


# Fused FP matmul.
# Disabled: See https://github.com/nod-ai/sharktank/issues/44
# Disabled: See https://github.com/nod-ai/SHARK-Platform/issues/44
# @matmul.override(Tensor, Tensor)
# def matmul_mmtfp_tensor_tensor(lhs, rhs, *, transpose_rhs: bool):
# lhs = unbox_tensor(lhs)
Expand Down
6 changes: 3 additions & 3 deletions sharktank/tests/ops/ops_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ def testMatchFail(self):
):
ops.matmul(1, 2)

@unittest.skip("https://github.com/nod-ai/sharktank/issues/44")
@unittest.skip("https://github.com/nod-ai/SHARK-Platform/issues/44")
def testTorchImplTransposedRHS(self):
ops._registry._test_enable_last_op_dispatch(True)
t1 = torch.rand(32, 16, dtype=torch.float32)
Expand All @@ -149,7 +149,7 @@ def testTorchImplTransposedRHS(self):
ops.custom_impls.matmul_mmtfp_tensor_tensor,
)

@unittest.skip("https://github.com/nod-ai/sharktank/issues/44")
@unittest.skip("https://github.com/nod-ai/SHARK-Platform/issues/44")
def testTorchImplNonTransposedRHS(self):
ops._registry._test_enable_last_op_dispatch(True)
t1 = torch.rand(32, 16, dtype=torch.float32)
Expand All @@ -162,7 +162,7 @@ def testTorchImplNonTransposedRHS(self):
ops.custom_impls.matmul_mmtfp_tensor_tensor,
)

@unittest.skip("https://github.com/nod-ai/sharktank/issues/44")
@unittest.skip("https://github.com/nod-ai/SHARK-Platform/issues/44")
def testTorchImplTransposedPrimitiveRHS(self):
ops._registry._test_enable_last_op_dispatch(True)
t1 = torch.rand(32, 16, dtype=torch.float32)
Expand Down

0 comments on commit c7dd7fa

Please sign in to comment.