Skip to content

Conversation

@tingqli
Copy link

@tingqli tingqli commented Nov 15, 2025

The kv-projection in cross-attention is calculated in every decoding step which is redundant since encoder_outputs doesn't change during whole decoding phase, this PR add a simple caching mechanism in cross-attn to avoid recomputing. in my test case (batch-size=32 beam-size=3 audio-length=20s), the e2e latency reduced from 20seconds to 6.1 seconds on H20.

@kaituoxu
Copy link
Collaborator

Thanks for your PR, we will review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants