[nnpackage] Define block quantization type on circle format #13743

hseok-oh · 2024-08-22T07:12:49Z

What?

Let's support block quantization data type on circle format to support LLM model.

Why?

To support LLM model, we need to support small size weight quantization with small precision loss.
So we need to introduce chunk quantization such as ggml (llama.cpp) 's quantization type.

To represent this, we need to expand circle schema's QuantizationParameters table or/and QuantizationDetails union.

Related issue: #13742

The text was updated successfully, but these errors were encountered:

hseok-oh · 2024-08-27T10:41:40Z

Below is schema draft to represent ggml's quantization type (block quntization)
Please give your opinion about this.
@seanshpark @mhs4670go @jinevening @chunseoklee @glistening @jyoungyun @ragmani

https://github.com/Samsung/ONE/pull/13758/files#diff-8b2942eef0fd7474ef49ec5245f9d288bd6d62c94ef20689c24edf07ce77c095

// Block quantization: from ggml quantization (https://github.com/ggerganov/ggml)
table CircleBlockQuantization {
  name:string;
}

// Represents a specific quantization technique's parameters.
union QuantizationDetails {
  CustomQuantization,
  CircleBlockQuantization
}

// Parameters for converting a quantized tensor back to float.
table QuantizationParameters {
  // These four parameters are the asymmetric linear quantization parameters.
  // Given a quantized value q, the corresponding float value f should be:
  //   f = scale * (q - zero_point)
  // For other quantization types, the QuantizationDetails below is used.
  // NOTE min/max values are valid if
  // 1. length of min/max == 0 or
  // 2. length of min/max == length of scale/zero_point
  // Otherwise, min/max are not valid (undefined behavior).
  min:[float];
  max:[float];
  scale:[float];  // For dequantizing the tensor's values.
  zero_point:[long];
  // If this is not none, the other quantization parameters (i.e. min, max,
  // scale, zero_point fields above) are ignored and the value of the
  // QuantizationDetails union should be used.
  details:QuantizationDetails;
  // Specifies the dimension of the Tensor's shape that the scales and
  // zero_points correspond to. For example, a tensor t, with dims=[4, 3, 2, 1]
  // with quantization params:
  //   scale=[1.0, 2.0, 3.0], zero_point=[1, 2, 3], quantization_dimension=1
  // will be quantized across the second dimension of t.
  //   t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
  //   t[:, 1, :, :] will have scale[1]=2.0, zero_point[0]=2
  //   t[:, 2, :, :] will have scale[2]=3.0, zero_point[0]=3
  quantized_dimension:int;
}

It introduces new QuantizationDetails's union table CircleBlockQuantization for detail field. details field is never used yet, so it will become first usage. CircleBlockQuantization has name field and it has ggml's quantization type name (ex. Q4_0, Q4_1, Q8_0, etc). If details field have any value, other QuantizationParameters field will not be used to decide quantization type.
Quantization parameters such as scales are in buffer with quantized value, so there is no field to save deltas (scales) for each block, and it is same policy with ggml's quantization - dequantization.

Below is Q4_0 type block structure in buffer.
https://github.com/ggerganov/ggml/blob/2438d62cb9290b5b5dc6228dec76fe81cf64238e/src/ggml-common.h#L144-L149

#define QK4_0 32
typedef struct {
    ggml_half d;           // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
static_assert(sizeof(block_q4_0) == sizeof(ggml_half) + QK4_0 / 2, "wrong q4_0 block size/padding");

(ggml_half: fp16)

Addition:
@glistening 's comment #13693 (comment)

Is the prefix Circle necessary to avoid name conflict from flatbuffers generated files? I guess GGMLBlockQuantization may be better as @jinevening suggested offline. It makes it clear what CircleBlockQuantization means.

jinevening · 2024-08-28T01:46:35Z

How about adding a new dtype (QK4_0, etc) rather than extending CircleQuantParam? If parameters are saved with weights, we may not need additional data structure for qparam.

Why?

Easy interpretation: We can identify new quantized tensors simply by its dtype (no need to see quantparam). And, it is a bit difficult to know Q4_0 is U4 (not S4) and Q8_0 is S8 (not U8).
Better SW design: CircleQuantParam will have a single responsibility. It is only used for affine quantization.
Reduce side effect: CircleQuantParam is used in many places, so I'd like to minimize side effects.

hseok-oh · 2024-08-28T06:14:29Z

@jinevening I've updated circle schema based on your comment

enum TensorType : byte {
  UINT4 = -1,
  FLOAT32 = 0,
  FLOAT16 = 1,
  INT32 = 2,
  UINT8 = 3,
  INT64 = 4,
  STRING = 5,
  BOOL = 6,
  INT16 = 7,
  COMPLEX64 = 8,
  INT8 = 9,
  FLOAT64 = 10,
  COMPLEX128 = 11,
  UINT64 = 12,
  // Experimental: Resource and variant types are experimental, that are subject
  // to change. Do not implement custom kernels using resource & variant types
  // now.
  RESOURCE = 13,
  VARIANT = 14,
  UINT32 = 15,
  UINT16 = 16,
  INT4 = 17,
  // Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
  Q4_0 = 18,
  Q4_1 = 19,
  Q8_0 = 20,
  Q8_1 = 21,
}

There is no issue on runtime to use this spec.
@seanshpark @mhs4670go Is it OK to use this type on compiler?

seanshpark · 2024-08-28T06:25:48Z

UINT4 = -1, was added not in tflite, so, does new Qx_y exist in tflite?

hseok-oh · 2024-08-28T07:10:15Z

UINT4 = -1, was added not in tflite, so, does new Qx_y exist in tflite?

No. I'll update to use negative value.

hseok-oh · 2024-08-28T07:44:21Z

Updated

// The type of data stored in a tensor.
// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {
  FLOAT32 = 0,
  FLOAT16 = 1,
  INT32 = 2,
  UINT8 = 3,
  INT64 = 4,
  STRING = 5,
  BOOL = 6,
  INT16 = 7,
  COMPLEX64 = 8,
  INT8 = 9,
  FLOAT64 = 10,
  COMPLEX128 = 11,
  UINT64 = 12,
  // Experimental: Resource and variant types are experimental, that are subject
  // to change. Do not implement custom kernels using resource & variant types
  // now.
  RESOURCE = 13,
  VARIANT = 14,
  UINT32 = 15,
  UINT16 = 16,
  INT4 = 17,
  // Belows are using negative value to represent not existing TensorType on TensorFlow Lite schema
  UINT4 = -1,
  Q4_0 = -2,
  Q4_1 = -3,
  Q8_0 = -4,
  Q8_1 = -5,
}

seanshpark · 2024-08-28T07:53:25Z

negative value items are placed in the back.. does generated header code have no problem?

hseok-oh · 2024-08-28T08:25:14Z

negative value items are placed in the back.. does generated header code have no problem?

No problem. I checked generated header code.

hseok-oh · 2024-08-29T01:08:21Z

If there is no more opinion, I'll update generated header file for runtime first (runtime/libs/circle-schema/include/circle_schema_generated.h) based on this schema.
IMO, we can update schema file with schema version up.after 1.29.0 release (#13796) is finished.

glistening · 2024-08-29T01:09:22Z

I've found @jinevening's suggestion now. I think we need prefix before Q4_0 (e.g. BLK_Q4_0 or GGML_Q4_0). Without prefix, it may be considered as simple affine quantization.

(ADD)

I've found the comment on Q4_0, ... on top.

// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {

It would be better to move the comment immediately before Q4_0, ...
Still, personally I prefer more specific names instead of comment.

However, if others are ok, I don't oppose.

glistening · 2024-08-29T01:27:08Z

@jinevening

Reduce side effect: CircleQuantParam is used in many places, so I'd like to minimize side effects.

What do you mean by CircleQuantParam?

Assuming you mean QuantizationParameters in circle schema, if something goes wrong by QuantizationDetails, it means it has some bug. It should check whether QuantizationDetails is null or not.

  // If this is not none, the other quantization parameters (i.e. min, max,
  // scale, zero_point fields above) are ignored and the value of the
  // QuantizationDetails union should be used.
  details:QuantizationDetails;

I think using QuantizationDetails has no problem. But as @jinevening suggested, if it only has name, we don't need to extend. I agree to add TensorType only.

hseok-oh · 2024-08-29T01:33:45Z

I think we agree to use new TensorType for ggml block quantization. So I'll update runtime's generated header file for next step.
We can change type name until circle schema version up, and it does not make any implementation issue if we don't change enum's actual value because flatbuffers file does not save enum's name string.

And maybe it will be ok to change enum name after release because name is used for print out only.

jinevening · 2024-08-29T01:55:14Z

What do you mean by CircleQuantParam?

It's about existing cpp class in luci. It has been used for affine quantization only.

hseok-oh · 2024-09-09T02:21:49Z

Schema is updated.

hseok-oh added the type/discussion We need discussion. Discussion itself can help. Even without conclusions! label Aug 22, 2024

hseok-oh added this to [ONE] onert - LLM support Aug 23, 2024

hseok-oh mentioned this issue Aug 26, 2024

PoC: Block weight quantize tool for LLM [skip ci] #13758

Draft

hseok-oh moved this to Ready to Start in [ONE] onert - LLM support Aug 27, 2024

hseok-oh changed the title ~~[nnpackage] Define chunk quantization type on circle format~~ [nnpackage] Define block quantization type on circle format Aug 27, 2024

hseok-oh moved this from Ready to Start to In Progress in [ONE] onert - LLM support Aug 29, 2024

hseok-oh self-assigned this Aug 29, 2024

This was referenced Aug 29, 2024

[onert] Support block quantization type on loader #13835

Closed

[onert] Update circle schema generated header for block quantization #13837

Merged

[onert] Support block quantization in CircleLoader #13884

Merged

hseok-oh added this to the ONERT LLM Milestone 1 milestone Sep 2, 2024

hseok-oh mentioned this issue Sep 4, 2024

[nnpackage] Update circle schema #13930

Merged

hseok-oh closed this as completed Sep 9, 2024

github-project-automation bot moved this from In Progress to Done in [ONE] onert - LLM support Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nnpackage] Define block quantization type on circle format #13743

[nnpackage] Define block quantization type on circle format #13743

hseok-oh commented Aug 22, 2024 •

edited

Loading

hseok-oh commented Aug 27, 2024

jinevening commented Aug 28, 2024

hseok-oh commented Aug 28, 2024

seanshpark commented Aug 28, 2024

hseok-oh commented Aug 28, 2024

hseok-oh commented Aug 28, 2024

seanshpark commented Aug 28, 2024

hseok-oh commented Aug 28, 2024 •

edited

Loading

hseok-oh commented Aug 29, 2024 •

edited

Loading

glistening commented Aug 29, 2024 •

edited

Loading

glistening commented Aug 29, 2024 •

edited

Loading

hseok-oh commented Aug 29, 2024 •

edited

Loading

jinevening commented Aug 29, 2024

hseok-oh commented Sep 9, 2024

[nnpackage] Define block quantization type on circle format #13743

[nnpackage] Define block quantization type on circle format #13743

Comments

hseok-oh commented Aug 22, 2024 • edited Loading

What?

Why?

hseok-oh commented Aug 27, 2024

jinevening commented Aug 28, 2024

hseok-oh commented Aug 28, 2024

seanshpark commented Aug 28, 2024

hseok-oh commented Aug 28, 2024

hseok-oh commented Aug 28, 2024

seanshpark commented Aug 28, 2024

hseok-oh commented Aug 28, 2024 • edited Loading

hseok-oh commented Aug 29, 2024 • edited Loading

glistening commented Aug 29, 2024 • edited Loading

glistening commented Aug 29, 2024 • edited Loading

hseok-oh commented Aug 29, 2024 • edited Loading

jinevening commented Aug 29, 2024

hseok-oh commented Sep 9, 2024

hseok-oh commented Aug 22, 2024 •

edited

Loading

hseok-oh commented Aug 28, 2024 •

edited

Loading

hseok-oh commented Aug 29, 2024 •

edited

Loading

glistening commented Aug 29, 2024 •

edited

Loading

glistening commented Aug 29, 2024 •

edited

Loading

hseok-oh commented Aug 29, 2024 •

edited

Loading