Skip to content

Scripts and tools for optimizing quantizations in llama.cpp with GGUF imatrices.

License

Notifications You must be signed in to change notification settings

robbiemu/llama-gguf-optimize

Repository files navigation

<code>❯ llama-gguf-optimize</code>-logo

❯ llama-gguf-optimize v0.6

Optimize. Quantize. Perfect the Efficiency.

Built with the tools and technologies:

Jupyter Pydantic YAML SciPy Python Optuna NumPy


Table of Contents

Overview

Llama-gguf-optimize is the result of work and research in creating high-quality quantizations for multilingual models, specifically the salamandra series. With a focus on preserving language diversity, the project leverages llama.cpp’s importance matrix approach to minimize quantization loss across distinct language domains. Existing importance matrix datasets often lack even basic multilingual support, so this toolkit includes tools and scripts to generate custom importance matrices and refined quantizations, tailored for datasets beyond common sources like WikiText. While initially addressing quantization needs for 2B models, llama-gguf-optimize has grown into a broader resource, providing insights and tools for other researchers facing similar challenges.

It currently contains new scripts to help analyze dataset performace in i-matrix based qunatization, the notes on_kl-divergence-optimization.md which summarize various discussions on llama.cpp regarding imatrices, as well a Iterative Quantization and Comparison usage guide, which will eventually detail the main flows envisioned for this project and current details the primary one. It also contains the original notes used to help inform my search to create these datasets, and the Jupyter notebooks that were used to generate the datasets and the quantized models. These notes are in some cases incorrect, but will eventually be updated with later insights. The notebooks will be generalized into scripts to help others in the process, regardless of the datasets they are using.

The imatrix_dataset.py script provides functionalities for dataset sampling and matrix generation for specific domains in (huggingface) datasets, for example languages. It can be customized via plugins in the src/imatrix_dataset directory, enabling flexible integration with various data sources. This script complements the quantization process by generating importance matrices that optimize models for specific languages and contexts.

The quantize.py script facilitates the quantization of Large Language Models (LLMs) using llama.cpp's quantization tools. It provides a streamlined way to quantize models with various options, including specifying the quantization type, output directory, and base model. It also allows for perplexity measurement and summarization to assess the impact of quantization on model performance. This script is particularly useful for researchers and developers aiming to optimize LLMs for deployment in resource-constrained environments.

The best_bub.py script is a performance optimization tool developed to fine-tune batch (--batch) and ubatch (--ubatch) parameters for logit generation processes in llama.cpp, such as those in llama-perplexity or llama-imatrix. Using a fully Bayesian approach, this script explores runtime configurations tailored to your model’s context size and available memory resources, achieving notable time savings over default configurations (with a 33% improvement observed in the author's case). best_bub.py employs Optuna’s TPESampler for intelligent sampling and incorporates Bayesian metrics and a MedianPruner to refine trial evaluations based on runtime performance trends. This approach ensures optimal parameter selection while adapting to real-time memory constraints and model-specific behavior.

The kl_d_bench.py script coordinates the generation and comparison of logits across models, running through the dataset one chunk at a time. By handling each chunk sequentially, it keeps storage needs low—requiring only enough space for the current and previous chunks—while ensuring consistency and smooth progress. kl_d_bench.py can easily pause and pick up where it left off. Though currently optimized for single-chunk processing, future updates could allow multi-chunk handling.

The generate_logits.py script is a specialized tool designed to generate and store logits efficiently from a llama.cpp model for large datasets. Built to overcome limitations of storage and memory efficiency in previous tools, it uses the HDF5 format for compact storage and a reuse list to manage processed chunks, making it resumable and able to operate within limited disk space. These optimizations make generate_logits.py particularly useful for quantization analysis and similar downstream tasks that require consistent, efficient logit generation for models with large vocabulary sizes or extensive datasets.

The compare_logits.py script is a specialized tool for comparing sets of logits on a chunk-by-chunk basis, providing detailed KL-divergence metrics essential for quantization analysis. By calculating statistics such as median, standard deviation, and specific percentiles (e.g., 90th, 95th, 99th) for each chunk, it highlights outliers where quantization diverges most from baseline. These metrics, stored in an HDF5 format for efficient storage and resumability, can support the evaluation and calibration of quantization quality, particularly for fine-tuning dataset importance.

usage-guide.notebooklm.webm

Features

Feature Description
⚙️ Architecture The project utilizes a modular src directory with various scripts for quantization, model optimization, and logging. It adheres to Python best practices and leverages external libraries for machine learning tasks.
🔩 Code Quality (Continually improving towards) High-quality code maintained through static type checking (py.typed), documentation, and consistent use of tools like Optuna and PyTorch for optimization and model execution.
📄 Documentation Comprehensive documentation is available, as well as configuration details in pyproject.toml and requirements.txt. Additional markdown files provide insights into the repository's goals and methodologies, enhancing user understanding.
🔌 Integrations Key integrations with machine learning libraries (PyTorch, NumPy), optimization tools (Optuna), and data handling modules (HDF5). External dependencies are well-managed and specified in requirements.txt.
🧩 Modularity The codebase is highly modular, with functionalities split into different scripts within the src directory. Core functions for quantization and logging are separated into dedicated files (library.py, gguf_optimize_logging.py), enhancing reuse.
🧪 Testing Using unittest scripts like compare_logits.py indicate functionality for validation through KL-divergence calculations, suggesting an implicit testing strategy.
⚡️ Performance Optimized for performance and memory usage, and detailed logging configurations that can dynamically adjust based on runtime needs (gguf_optimize_logging.py).
🛡️ Security No explicit security measures are mentioned, but the use of versioning and static type checking enhance maintainability and reliability, indirectly supporting secure code practices.
🔗 Dependencies Managed through pyproject.toml and available in requirements.txt, including PyTorch for deep learning tasks, NumPy for numerical computation, HDF5 for dataset handling, and Optuna for optimization tasks.

Compare logits script:

The compare_logits.py script calculates KL-divergence between two models' logits to evaluate differences in their predictions. This analysis supports model comparison and benchmarking, especially in the context of quantization.

Core Features

  • Chunk-wise KL-Divergence Analysis:

    • Computes KL-divergence with numerical stability through log-softmax probabilities.
    • Supports chunk-based processing for large datasets.
  • Early Stopping:

    • Dynamically estimates the minimum dataset size needed for robust comparisons.
    • Employs Bayesian prior updates and Beta distribution modeling to determine stopping points.
    • Uses the Kuiper test to assess the statistical significance of observed differences.
  • Efficient and Resumable:

    • Resumable processing with support for specifying a starting chunk.
    • Outputs chunk-specific and cumulative statistics in HDF5 format for detailed analysis.

For a detailed explanation of the methodology and statistical foundations, see the compare_logits_specification.md document.


Modules

[root]
File Summary
requirements.txt Lists essential external libraries ensuring consistent development environment across different setups. Highlights dependencies crucial for model optimization and data processing, supporting repositorys focus on advanced quantization techniques and optimization benchmarks.
pyproject.toml Defines project metadata and dependencies for llama-gguf-optimize, ensuring compatible Python version and listing essential packages for machine learning tasks. Streamlines building and managing project versions with Hatch tooling support.
src
File Summary
version.py Defines version for package management.
library.py Custom library tag used in Python packages to indicate the source of the library, facilitating easy identification and tracking of dependencies within the repositorys ecosystem.
compare_logits.py Compares logits from two models using KL-divergence. Processes data in chunks with resumability, detailed statistical outputs, and optional early stopping to optimize comparisons.
gguf_optimize_model_fns.py Estimates model parameters and precision for optimization within repository architecture. Utilizes metadata for parameter estimation and calculates bits per weight to assess model efficiency, logging critical information for debugging and verification.
+ quantize.py
generate_logits.py Generates and stores logits from a model over a dataset in HDF5 format. Supports resumability, context size verification, compression, and enhanced token processing.
gguf_optimize_logging.py Configures logging for library operations, setting up message formats and output levels to standard out, facilitating consistent logging across modules with versioning information included in debug mode outputs.
imatrix_dataset.py Manages dataset sampling and i-matrix generation with enhanced token balancing, chunking, and optional shuffling. It integrates plugin support (in src/imatrix_dataset) for handling different data sources, allowing for flexible dataset sampling and importance matrix generation.
kl_d_bench.py Orchestrates processing a dataset to generate and compare model logits, manages dataset input, ensures mutually exclusive options, validates parameters, sets logging level, and executes main function. Validates presence of necessary arguments and prevents conflicting options.
src/extras
File Summary
analyze_comparison_progress_from_logs.py Visualizes early stopping factors and projects progress when analyzing logs from compare_logits.py runs. Can export raw data or maintain a live update of graphs.
append_overall.py Helper script to compute and add the "overall" property to a comparison output file when interrupted or incomplete.
best_bub.py The primary purpose of best_bub.py is to automate the search for the best possible --batch and --ubatch configuration (dubbed BUB within the context) that maximizes performance (inference speed). Critical Features:-Parameter Tuning: It uses tools like Optuna for hyperparameter optimization to explore different configurations of the models.-Performance Evaluation: Integrates with llama_cpp and other scientific computing libraries to evaluate how well each configuration performs on specific tasks.-Efficiency Optimization: Incorporates multiprocessing capabilities to distribute parameter tuning across multiple processes, enhancing computational efficiency.Overall, best_bub.py serves as a key component for efficiently optimizing model configurations to achieve the best performance in terms of speed and accuracy.
composite_comparsion.py (Archetypal script)Evaluates multiple comparisons (and quantizations) with a suggested metric. Provides chunk-by-chunk scores, overall KL-divergence curves, and a 3D manifold visualization.
read_kl_d_benchmarks.py Extracts and displays KL-divergence statistics from comparison HDF5 files, optionally filtering by chunk range or including overall metrics.
reshape_logits.py Reshapes large logit chunks into smaller, evenly divided chunks, useful for experimenting with settings over many chunks.
unfree.py Resets the freed_chunks dataset in HDF5 logit files, allowing interrupted processes to resume cleanly.
visualize_results.py Visualizes chunk-by-chunk KL-divergence outputs from a comparison as 3D manifolds, with options for debugging and sampling.

Getting Started

Prerequisites

  • Python: Version 3.12 or compatible versions (3.6 and above may also work)
  • Ensure the necessary libraries from requirements.txt are installed.

Installation

  1. Clone the repository:

    ❯ git clone <repository_url>
  2. Navigate to the project directory:

    cd <project_directory>
  3. Install the required dependencies (optional if using uv):

    ❯ pip install -r requirements.txt

Usage

note: There is a usage guide!

Each script in llama-gguf-optimize can be run independently, offering a range of model optimization, logit generation, and comparison capabilities:

  • Data Sampling for Importance Matrix with imatrix_dataset.py
    The imatrix_dataset.py script generalizes the data sampling process, enabling the generation of importance matrices for specific languages. It supports custom data sources through plugins in src/imatrix_dataset.

    ❯ uv run src/imatrix_dataset.py --langs <languages> --num-samples <samples_count> --skip-samples <skip_count> --output <output_file>

    Additional options:

    ❯ uv run src/imatrix_dataset.py --help
  • Model quantization and perplexity analysis with quantize.py

    ❯ uv run src/quantize.py quantize --model-name <model_name> --base-model <path_to_model> --config <config_file>

    To measure perplexity:

    ❯ uv run src/quantize.py perplexity --base-model-name <model_name> --config <config_file> --dataset <ppl_test_data>

    See all options with: ❯ uv run src/quantize.py --help

  • Optimize Batch Sizes with best_bub.py
    Run the best_bub.py script to optimize batch (--batch) and ubatch (--ubatch) parameters:

    ❯ uv run src/extras/best_bub.py --model <path_to_model> --context-size <size> [optional model parameters]...

    For a full list of options, see:

    ❯ uv run src/extras/best_bub.py --help
  • Generate Logits with generate_logits.py
    Use generate_logits.py to generate and save logits to an HDF5 file:

    ❯ uv run src/generate_logits.py --model <path_to_model> --dataset <path_to_dataset> --output <output_file>

    Resumable processing is supported, with chunk management, informed by calculated context size and token requirements.

    For additional options:

    ❯ uv run src/generate_logits.py --help
  • Compare Logits with compare_logits.py
    The compare_logits.py script calculates KL-divergence between two HDF5 logit files:

    ❯ uv run src/compare_logits.py <baseline_file> <target_file> --output_file <output_file>

    Access more options with:

    ❯ uv run src/compare_logits.py --help
  • Orchestrate Logit Generation and Comparison with kl_d_bench.py
    Run kl_d_bench.py to manage logit generation and comparison in a synchronized workflow:

    ❯ uv run src/kl_d_bench.py --baseline-model <baseline_model> --target-model <target_model> --dataset <path_to_dataset> --output-file <output_file>

    For further options:

    ❯ uv run src/kl_d_bench.py --help

Running Tests

Execute the full test suite using:

❯ PYTHONPATH=src uv run --module unittest discover -s src/tests

Project Roadmap

  • v0.1: best_bub.py script.
  • v0.3: generation and comparison scripts.
  • v0.5: kl-divergence comparison script.
  • v0.5.n: Usage guides. Convert jupyter notebooks to general scripts.
  • v0.6: Add early stopping prediction capability to compare_logits.
  • [] v0.6.n: Audit and update: on_perplexity.md for citations and accuracy.
  • [] v0.7: Allow specifying ema-decay, and clamp values for $\theta_E$ and$\theta_P$ in early stopping.
  • v1.0: PyPl submission, github actions, changelog.

Contributing

Contributions are welcome! Here are several ways you can contribute:

Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your LOCAL account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone .
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
  6. Push to LOCAL: Push the changes to your forked repository.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


License

This project is protected under the GNU Lesser General Public License License. For more details, refer to the LICENSE file.


Acknowledgments

  • List any resources, contributors, inspiration, etc. here.

About

Scripts and tools for optimizing quantizations in llama.cpp with GGUF imatrices.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages