Skip to content

[docs] kernels#13139

Open
stevhliu wants to merge 2 commits intohuggingface:mainfrom
stevhliu:kernels
Open

[docs] kernels#13139
stevhliu wants to merge 2 commits intohuggingface:mainfrom
stevhliu:kernels

Conversation

@stevhliu
Copy link
Member

adds a kernels section in the Accelerate inference docs with the results:

  • cross-linked to Attention backends docs which demonstrates support for loading attention kernels with set_attention_backend
  • defer to the blog post and pipeline integration guide for more details about implementing non-attention kernels since this is more involved and already well-documented there

@stevhliu stevhliu requested a review from sayakpaul February 13, 2026 17:07
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for prioritizing it.


[Kernels](https://huggingface.co/docs/kernels/index) is a library for building, distributing, and loading optimized compute kernels on the [Hub](https://huggingface.co/kernels-community). It supports [attention](./attention_backends#set_attention_backend) kernels and custom CUDA kernels for operations like RMSNorm.

The [Diffusers Pipeline Integration](https://github.com/huggingface/kernels/blob/main/skills/cuda-kernels/references/diffusers-integration.md) guide shows how to integrate a kernel. Create a custom optimized attention processor, patch all modules in the model, and inject the kernel into the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel skill basically lets users get an agent to write custom kernels for a model and hardware. It's not specific to the attention processor but also other modules as well such RMSNorm. Should we make it clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk if this is clearer!

> [!TIP]
> Install the [add cuda-kernels](https://github.com/huggingface/kernels/blob/main/skills/cuda-kernels/SKILL.md) skill to teach Claude or Codex how to write a kernel. The [Custom kernels for all from Codex and Claude](https://huggingface.co/blog/custom-cuda-kernels-agent-skills) blog post covers this in more detail.

For example, a custom RMSNorm kernel with [torch.compile](#torchcompile) speeds up LTX-Video generation 1.43x on an H100.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't just RMSNorm but also other modules implemented with custom kernels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mention RMSNorm as an example only for the benchmark results below

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I would like to also see what @burtenshaw thinks about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants