Skip to content

Feature Request: Layer-wise (Mixed Precision) Quantization Configuration #3571

@lhpqaq

Description

@lhpqaq

Summary

It would be highly valuable if whisper.cpp supported mixed precision and layer-wise quantization configuration, allowing users to specify different quantization types for different layers or tensors when converting models. This would enable more flexible and optimized deployments, balancing memory, accuracy, and performance based on user and hardware requirements.

Motivation

  • Real-world edge deployments and research increasingly require fine-grained quantization strategies to optimize for specific accuracy, performance, and size constraints.
  • Currently, whisper. cpp supports applying only one quantization type per model conversion. There is no straightforward way to assign different quantization types (e.g., Q8_0 for encoder, Q4_0 for decoder, FP16 for attention layers, etc.) to specific layers or tensor name patterns.
  • llama.cpp already provides a tensor_types-like configuration for flexible per-layer quantization control. Bringing similar options to whisper.cpp would enable hardware-aware and use-case-tailored deployment.

Proposed Solution

  • Provide a mechanism (such as a config file or command-line flag) for users to specify desired quantization types for each layer, group of layers, or by regex/pattern match on tensor names.
  • Default to the current unified quantization type when no mapping is provided, ensuring backward compatibility.

Thank you for considering this improvement to whisper.cpp!

Do you think this is feasible? If so, I can try to implement it myself. Do you have any suggestions or advice?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions