Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nnpackage] Define block quantization type on circle format #13743

Closed
hseok-oh opened this issue Aug 22, 2024 · 14 comments
Closed

[nnpackage] Define block quantization type on circle format #13743

hseok-oh opened this issue Aug 22, 2024 · 14 comments
Assignees
Labels
type/discussion We need discussion. Discussion itself can help. Even without conclusions!

Comments

@hseok-oh
Copy link
Contributor

hseok-oh commented Aug 22, 2024

What?

Let's support block quantization data type on circle format to support LLM model.

Why?

To support LLM model, we need to support small size weight quantization with small precision loss.
So we need to introduce chunk quantization such as ggml (llama.cpp) 's quantization type.

To represent this, we need to expand circle schema's QuantizationParameters table or/and QuantizationDetails union.

Related issue: #13742

@hseok-oh hseok-oh added the type/discussion We need discussion. Discussion itself can help. Even without conclusions! label Aug 22, 2024
@hseok-oh hseok-oh moved this to Ready to Start in [ONE] onert - LLM support Aug 27, 2024
@hseok-oh hseok-oh changed the title [nnpackage] Define chunk quantization type on circle format [nnpackage] Define block quantization type on circle format Aug 27, 2024
@hseok-oh
Copy link
Contributor Author

Below is schema draft to represent ggml's quantization type (block quntization)
Please give your opinion about this.
@seanshpark @mhs4670go @jinevening @chunseoklee @glistening @jyoungyun @ragmani

https://github.com/Samsung/ONE/pull/13758/files#diff-8b2942eef0fd7474ef49ec5245f9d288bd6d62c94ef20689c24edf07ce77c095

// Block quantization: from ggml quantization (https://github.com/ggerganov/ggml)
table CircleBlockQuantization {
  name:string;
}

// Represents a specific quantization technique's parameters.
union QuantizationDetails {
  CustomQuantization,
  CircleBlockQuantization
}

// Parameters for converting a quantized tensor back to float.
table QuantizationParameters {
  // These four parameters are the asymmetric linear quantization parameters.
  // Given a quantized value q, the corresponding float value f should be:
  //   f = scale * (q - zero_point)
  // For other quantization types, the QuantizationDetails below is used.
  // NOTE min/max values are valid if
  // 1. length of min/max == 0 or
  // 2. length of min/max == length of scale/zero_point
  // Otherwise, min/max are not valid (undefined behavior).
  min:[float];
  max:[float];
  scale:[float];  // For dequantizing the tensor's values.
  zero_point:[long];
  // If this is not none, the other quantization parameters (i.e. min, max,
  // scale, zero_point fields above) are ignored and the value of the
  // QuantizationDetails union should be used.
  details:QuantizationDetails;
  // Specifies the dimension of the Tensor's shape that the scales and
  // zero_points correspond to. For example, a tensor t, with dims=[4, 3, 2, 1]
  // with quantization params:
  //   scale=[1.0, 2.0, 3.0], zero_point=[1, 2, 3], quantization_dimension=1
  // will be quantized across the second dimension of t.
  //   t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
  //   t[:, 1, :, :] will have scale[1]=2.0, zero_point[0]=2
  //   t[:, 2, :, :] will have scale[2]=3.0, zero_point[0]=3
  quantized_dimension:int;
}

It introduces new QuantizationDetails's union table CircleBlockQuantization for detail field. details field is never used yet, so it will become first usage. CircleBlockQuantization has name field and it has ggml's quantization type name (ex. Q4_0, Q4_1, Q8_0, etc). If details field have any value, other QuantizationParameters field will not be used to decide quantization type.
Quantization parameters such as scales are in buffer with quantized value, so there is no field to save deltas (scales) for each block, and it is same policy with ggml's quantization - dequantization.

Below is Q4_0 type block structure in buffer.
https://github.com/ggerganov/ggml/blob/2438d62cb9290b5b5dc6228dec76fe81cf64238e/src/ggml-common.h#L144-L149

#define QK4_0 32
typedef struct {
    ggml_half d;           // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
static_assert(sizeof(block_q4_0) == sizeof(ggml_half) + QK4_0 / 2, "wrong q4_0 block size/padding");

(ggml_half: fp16)


Addition:
@glistening 's comment #13693 (comment)

Is the prefix Circle necessary to avoid name conflict from flatbuffers generated files? I guess GGMLBlockQuantization may be better as @jinevening suggested offline. It makes it clear what CircleBlockQuantization means.

@jinevening
Copy link
Contributor

How about adding a new dtype (QK4_0, etc) rather than extending CircleQuantParam? If parameters are saved with weights, we may not need additional data structure for qparam.

Why?

  1. Easy interpretation: We can identify new quantized tensors simply by its dtype (no need to see quantparam). And, it is a bit difficult to know Q4_0 is U4 (not S4) and Q8_0 is S8 (not U8).
  2. Better SW design: CircleQuantParam will have a single responsibility. It is only used for affine quantization.
  3. Reduce side effect: CircleQuantParam is used in many places, so I'd like to minimize side effects.

@hseok-oh
Copy link
Contributor Author

@jinevening I've updated circle schema based on your comment

enum TensorType : byte {
  UINT4 = -1,
  FLOAT32 = 0,
  FLOAT16 = 1,
  INT32 = 2,
  UINT8 = 3,
  INT64 = 4,
  STRING = 5,
  BOOL = 6,
  INT16 = 7,
  COMPLEX64 = 8,
  INT8 = 9,
  FLOAT64 = 10,
  COMPLEX128 = 11,
  UINT64 = 12,
  // Experimental: Resource and variant types are experimental, that are subject
  // to change. Do not implement custom kernels using resource & variant types
  // now.
  RESOURCE = 13,
  VARIANT = 14,
  UINT32 = 15,
  UINT16 = 16,
  INT4 = 17,
  // Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
  Q4_0 = 18,
  Q4_1 = 19,
  Q8_0 = 20,
  Q8_1 = 21,
}

There is no issue on runtime to use this spec.
@seanshpark @mhs4670go Is it OK to use this type on compiler?

@seanshpark
Copy link
Contributor

UINT4 = -1, was added not in tflite, so, does new Qx_y exist in tflite?

@hseok-oh
Copy link
Contributor Author

UINT4 = -1, was added not in tflite, so, does new Qx_y exist in tflite?

No. I'll update to use negative value.

@hseok-oh
Copy link
Contributor Author

Updated

// The type of data stored in a tensor.
// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {
  FLOAT32 = 0,
  FLOAT16 = 1,
  INT32 = 2,
  UINT8 = 3,
  INT64 = 4,
  STRING = 5,
  BOOL = 6,
  INT16 = 7,
  COMPLEX64 = 8,
  INT8 = 9,
  FLOAT64 = 10,
  COMPLEX128 = 11,
  UINT64 = 12,
  // Experimental: Resource and variant types are experimental, that are subject
  // to change. Do not implement custom kernels using resource & variant types
  // now.
  RESOURCE = 13,
  VARIANT = 14,
  UINT32 = 15,
  UINT16 = 16,
  INT4 = 17,
  // Belows are using negative value to represent not existing TensorType on TensorFlow Lite schema
  UINT4 = -1,
  Q4_0 = -2,
  Q4_1 = -3,
  Q8_0 = -4,
  Q8_1 = -5,
}

@seanshpark
Copy link
Contributor

negative value items are placed in the back.. does generated header code have no problem?

@hseok-oh
Copy link
Contributor Author

hseok-oh commented Aug 28, 2024

negative value items are placed in the back.. does generated header code have no problem?

No problem. I checked generated header code.

@hseok-oh
Copy link
Contributor Author

hseok-oh commented Aug 29, 2024

If there is no more opinion, I'll update generated header file for runtime first (runtime/libs/circle-schema/include/circle_schema_generated.h) based on this schema.
IMO, we can update schema file with schema version up.after 1.29.0 release (#13796) is finished.

@glistening
Copy link
Contributor

glistening commented Aug 29, 2024

I've found @jinevening's suggestion now. I think we need prefix before Q4_0 (e.g. BLK_Q4_0 or GGML_Q4_0). Without prefix, it may be considered as simple affine quantization.

(ADD)

I've found the comment on Q4_0, ... on top.

// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {

It would be better to move the comment immediately before Q4_0, ...
Still, personally I prefer more specific names instead of comment.

However, if others are ok, I don't oppose.

@glistening
Copy link
Contributor

glistening commented Aug 29, 2024

@jinevening

  1. Reduce side effect: CircleQuantParam is used in many places, so I'd like to minimize side effects.

What do you mean by CircleQuantParam?

Assuming you mean QuantizationParameters in circle schema, if something goes wrong by QuantizationDetails, it means it has some bug. It should check whether QuantizationDetails is null or not.

  // If this is not none, the other quantization parameters (i.e. min, max,
  // scale, zero_point fields above) are ignored and the value of the
  // QuantizationDetails union should be used.
  details:QuantizationDetails;

I think using QuantizationDetails has no problem. But as @jinevening suggested, if it only has name, we don't need to extend. I agree to add TensorType only.

@hseok-oh hseok-oh moved this from Ready to Start to In Progress in [ONE] onert - LLM support Aug 29, 2024
@hseok-oh hseok-oh self-assigned this Aug 29, 2024
@hseok-oh
Copy link
Contributor Author

hseok-oh commented Aug 29, 2024

I think we agree to use new TensorType for ggml block quantization. So I'll update runtime's generated header file for next step.
We can change type name until circle schema version up, and it does not make any implementation issue if we don't change enum's actual value because flatbuffers file does not save enum's name string.

And maybe it will be ok to change enum name after release because name is used for print out only.

@jinevening
Copy link
Contributor

What do you mean by CircleQuantParam?

It's about existing cpp class in luci. It has been used for affine quantization only.

@hseok-oh
Copy link
Contributor Author

hseok-oh commented Sep 9, 2024

Schema is updated.

@hseok-oh hseok-oh closed this as completed Sep 9, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in [ONE] onert - LLM support Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/discussion We need discussion. Discussion itself can help. Even without conclusions!
Projects
Status: Done
Development

No branches or pull requests

4 participants