You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to apply SmoothQuant during W8A8 quantization of meta-llama/Llama-3.2-11B-Vision-Instruct where I ignore all of the modules except for language_model. However I find that it crashes when going through the vision model that I have chosen to ignore
File "/home/mgoin/code/llm-compressor/src/llmcompressor/modifiers/smoothquant/base.py", line 276, in _apply_smoothing
self.scales_[mapping.smooth_name].max_channel_vals
KeyError: 'vision_model.transformer.layers.0.input_layernorm'
Code to trigger:
fromdatasetsimportload_datasetfromtransformersimportAutoTokenizer, MllamaForConditionalGenerationfromllmcompressor.modifiers.quantizationimportGPTQModifierfromllmcompressor.modifiers.smoothquantimportSmoothQuantModifierfromllmcompressor.transformersimportoneshot, wrap_hf_model_class# Select model and load it.MODEL_ID="meta-llama/Llama-3.2-11B-Vision-Instruct"model_class=wrap_hf_model_class(MllamaForConditionalGeneration)
model=model_class.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
processor=AutoTokenizer.from_pretrained(MODEL_ID)
# Select calibration dataset.DATASET_ID="HuggingFaceH4/ultrachat_200k"DATASET_SPLIT="train_sft"# Select number of samples. 512 samples is a good place to start.# Increasing the number of samples can improve accuracy.NUM_CALIBRATION_SAMPLES=4MAX_SEQUENCE_LENGTH=2048# Load dataset and preprocess.ds=load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds=ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
defpreprocess(example):
return {
"text": processor.apply_chat_template(
example["messages"],
tokenize=False,
)
}
ds=ds.map(preprocess)
# Tokenize inputs.deftokenize(sample):
returnprocessor(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds=ds.map(tokenize, remove_columns=ds.column_names)
print(ds)
# Configure algorithms. In this case, we:# * apply SmoothQuant to make the activations easier to quantize# * quantize the weights to int8 with GPTQ (static per channel)# * quantize the activations to int8 (dynamic per token)# Note: set sequential_update: true in the recipe to reduce memoryignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"]
recipe= [
SmoothQuantModifier(smoothing_strength=0.8, ignore=ignore),
GPTQModifier(targets="Linear", scheme="W8A8", ignore=ignore),
]
# Apply algorithms.oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Confirm generations of the quantized model look sane.print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids=processor("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output=model.generate(input_ids, max_new_tokens=100)
print(processor.decode(output[0]))
print("==========================================\n\n")
# Save to disk compressed.SAVE_DIR=MODEL_ID.split("/")[1] +"-W8A8-Dynamic-Per-Token"model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
@markurtz@dsikka@kylesayrs Hey, just wanted to follow up on this. I'm facing the same issue on my end. Is this issue supposed to be resolved, or is it currently unsupported?
I am trying to apply SmoothQuant during W8A8 quantization of
meta-llama/Llama-3.2-11B-Vision-Instruct
where I ignore all of the modules except for language_model. However I find that it crashes when going through the vision model that I have chosen to ignoreError:
Code to trigger:
Full log and error:
The text was updated successfully, but these errors were encountered: