-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Open
Description
Summary
It would be highly valuable if whisper.cpp supported mixed precision and layer-wise quantization configuration, allowing users to specify different quantization types for different layers or tensors when converting models. This would enable more flexible and optimized deployments, balancing memory, accuracy, and performance based on user and hardware requirements.
Motivation
- Real-world edge deployments and research increasingly require fine-grained quantization strategies to optimize for specific accuracy, performance, and size constraints.
- Currently, whisper. cpp supports applying only one quantization type per model conversion. There is no straightforward way to assign different quantization types (e.g., Q8_0 for encoder, Q4_0 for decoder, FP16 for attention layers, etc.) to specific layers or tensor name patterns.
- llama.cpp already provides a
tensor_types-like configuration for flexible per-layer quantization control. Bringing similar options to whisper.cpp would enable hardware-aware and use-case-tailored deployment.
Proposed Solution
- Provide a mechanism (such as a config file or command-line flag) for users to specify desired quantization types for each layer, group of layers, or by regex/pattern match on tensor names.
- Default to the current unified quantization type when no mapping is provided, ensuring backward compatibility.
Thank you for considering this improvement to whisper.cpp!
Do you think this is feasible? If so, I can try to implement it myself. Do you have any suggestions or advice?
Metadata
Metadata
Assignees
Labels
No labels