Releases · intel/auto-round

21 Oct 04:12

wenhuach21

v0.3.1

0c4319c

Intel® auto-round v0.3.1 Release Latest

Latest

Release Highlights:

New Features:

Full-Range Symmetric Quantization: We’ve introduced full-range symmetric quantization, which often matches or even exceeds the performance of asymmetric quantization, especially at lower bit widths, such as 2.

Command-Line Support: You can now quantize models using the command auto-round --model xxx --format xxx

Enhanced AutoRound Format: The AutoRound format now supports leveraging the auto_awq kernel by exporting to the auto_round:awq format

Default Exporting Format Change: The default format has been updated to auto_round instead of auto_gptq.

Muiti-thread packing: up to 2X speed up on packing phase

Bug Fixes:

Resolved Missing Cached Position Embeddings: Fixed an issue with missing cached position embeddings in Transformer version 4.45.2.

Mutable Default Values Issue: Addressed problems related to mutable default values.

3 bit packing bug for AutoGPTQ format

Assets 2

14 Aug 11:33

wenhuach21

v0.3

4ac1104

Intel® auto-round v0.3 Release

Highlights:
- Broader Device Support:
  - Expanded support for CPU, HPU, and CUDA inference in the AutoRound format, resolving the 2-bit accuracy issue.
- New Recipes and Model Releases:
  - Published numerous recipes on the Low Bit Open LLM Leaderboard, showcasing impressive results on LLaMa 3.1 and other leading models.
- Experimental Features:
  - Introduced several experimental features, including activation quantization and mx_fp, with promising outcomes with AutoRound.
- Multimodal Model Support:
  - Extended capabilities for tuning and inference across several multimodal models.
Lowlights:
- Implemented support for low_cpu_mem_usage, auto_awq format, calibration dataset concatenation, and calibration datasets with chat templates.

Assets 2

30 May 02:13

wenhuach21

v0.2

aafb82e

Intel® auto-round v0.2 Release

Overview

We supported the Intel XPU format and implemented lm-head quantization and inference, reducing the model size from 5.4GB to 4.7GB for LLAMA3 at W4G128. Additionally, we supported both local and mixed online datasets for calibration. By optimizing memory usage and tuning costs, the calibration process now takes approximately 20 minutes for 7B models and 2.5 hours for 70B models with 512 samples by setting disable_low_gpu_mem_usage.

Others:

More accuracy data as presented in [paper](https://arxiv.org/pdf/2309.05516) and [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)

More technical details as presented in [paper](https://arxiv.org/pdf/2309.05516)

Known issues:

Large discrepancy between gptq model and qdq model for asymmetric quantization in some scenarios. We are working on it.

Assets 2

08 Mar 08:11

wenhuach21

v0.1

514aa49

Intel® auto-round v0.1 Release

Overview

AutoRound introduces an innovative weight-only quantization algorithm designed specifically for low-bit LLM inference, approaching near-lossless compression for a range of popular models including gemma-7B, Mistral-7b, Mixtral-8x7B-v0.1, Mixtral-8x7B-Instruct-v0.1, Phi2, LLAMA2 and more at W4G128. AutoRound consistently outperforms established methods across the majority of scenarios at W4G128, W4G-1, W3G128, and W2G128 .

Key Features

Wide Model Support: AutoRound caters to a diverse range of model families. About 20 model families have been verified.
Export Flexibility: Effortlessly export quantized models to ITREX[1] and AutoGPTQ[2] formats for seamless deployment on Intel CPU and Nvidia GPU platforms respectively.
Device Compatibility: Compatible with tuning devices including Intel CPUs, Intel Guadi2, and Nvidia GPUs.
Dataset Flexibility: AutoRound supports calibration with Pile10k and MBPP datasets, with easy extensibility to incorporate additional datasets.

Examples

Explore language modeling and code generation examples to unlock the full potential of AutoRound.

Additional Benefits

PreQuantized Models: Access a variety of pre-quantized models on Hugging Face for immediate integration into your projects, with more models under review and coming soon.
Comprehensive Accuracy Data: Simplified user deployment with extensive accuracy data provided.

Known issues:

baichuan-inc/Baichuan2-13B-Chat has some issues, we will support it soon

Reference:

[1] https://github.com/intel/intel-extension-for-transformers

[2] https://github.com/AutoGPTQ/AutoGPTQ

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: intel/auto-round

Intel® auto-round v0.3.1 Release

Intel® auto-round v0.3 Release

Intel® auto-round v0.2 Release

Intel® auto-round v0.1 Release