Skip to content

Conversation

@miguelmartin75
Copy link

@miguelmartin75 miguelmartin75 commented Dec 17, 2025

What does this PR do?

This PR adds Cosmos Predict2.5 Base. It has been tested using the 2B model checkpoint official HF checkpoint is here. The converted checkpoints have yet to be uploaded to HF.

This change is largely based off the previous predict1/predict2 support done by @a-r-r-o-w

Testing:

  • Aside from the unit tests, the examples in this README from cosmos predict2.5 have been checked against, and the inference pipeline produces similar outputs.
  • These are included in the docstring for the pipeline

Additions

Pipelines:

  • A base pipeline which handles all modes for Predict2.5 Base checkpoint: Text2World, Image2World, Video2World
    • This pipeline loads the Reason1 checkpoint via Qwen2_5_VLForConditionalGeneration
    • In fact this pipeline supports image output if providing num_frames=1
    • Limitations: the pipeline assumes batch_size == 1
  • Three derivative pipelines are made based on this pipeline, to handle the official modes: Text2World, Image2World, Video2World
  • Unit tests are added, based on tests present for cosmos predict2

Scheduler:

  • FlowUniPCMultistepScheduler: is a new scheduler introduced, which uses the EDM noise schedule (Karras sigmas) using the UniPC algorithm as predict2.5 uses flow matching. This name can be changed.
  • This is done to match the predict2.5 codebase
  • Unit tests are added covering basics. Note, the UniPC algorithm is not unit tested.

Model changes:

  • Modified CosmosTransformer to accept an optional cross-attention projection layer (used for text embeddings from Reason1)

Scripts:

  • Extended scripts/convert_cosmos_to_diffusers.py to support Predict2.5

Who can review?

@miguelmartin75 miguelmartin75 changed the title Cosmos/predict2.5 base pr ready Cosmos Predict2.5 Base Model Dec 17, 2025
@miguelmartin75 miguelmartin75 changed the title Cosmos Predict2.5 Base Model Cosmos Predict2.5 Base: inference pipeline, checkpoint conversion & scheduler Dec 17, 2025
@miguelmartin75 miguelmartin75 changed the title Cosmos Predict2.5 Base: inference pipeline, checkpoint conversion & scheduler Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion Dec 17, 2025
scheduler = FlowUniPCMultistepScheduler()

# NOTE: using Qwen2 VL instead for tests (reason1 is based on 2.5)
text_encoder = Qwen2VLForConditionalGeneration.from_pretrained(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an internal Qwen2_5_VL model to test with?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this work?

text_encoder = Qwen2_5_VLForConditionalGeneration(config)

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I left some comments. My major comments are on separating the pipelines from one another instead of inheriting from one another.

Let's also add docs?

hidden_size, patch_size[0] * patch_size[1] * patch_size[2] * out_channels, bias=False
)

self.use_crossattn_projection = use_crossattn_projection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to assign the variables like this because register_to_config will have them included in the model.config:

Suggested change
self.use_crossattn_projection = use_crossattn_projection

else:
assert False

if self.use_crossattn_projection:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.use_crossattn_projection:
if self.config.use_crossattn_projection:

"""


class Cosmos2_5_PredictBase(DiffusionPipeline):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Cosmos2_5_PredictBase(DiffusionPipeline):
class Cosmos2_5_PredictBasePipeline(DiffusionPipeline):

And elsewhere


return prompt_embeds, negative_prompt_embeds

def prepare_latents(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking if this is similar to any of the other existing Cosmos pipelines? If so, maybe this can be supplemented with a `"# Copied from ..." comment?

self._current_timestep = None

if not output_type == "latent":
assert self.latents_mean is not None and self.latents_std is not None, (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid assert statements like this in the modules. If we really want to raise, let's raise Errors instead.

video = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0]

assert self.safety_checker is not None
self.safety_checker.to(device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do it in this PR, but we could have a little utility like run_safety_checker() inside the pipelines and copy it over all the cosmos pipelines that require it (much akin to encode_prompt(), for example).

But this is not merge-blocking.

return CosmosPipelineOutput(frames=video)


class Cosmos2_5_PredictText2World(Cosmos2_5_PredictBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels quite odd. We usually maintain one pipeline class per pipeline file. That's been followed in the existing cosmos implementations as well:
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cosmos

We don't inherit from pipelines like Cosmos2_5_PredictText2World(Cosmos2_5_PredictBase):, either. We, instead, use"# Copied from ..." (we will unbloat quite a bit of that in #12820 but that's a separate thing).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we follow the same structure as Cosmos2 with separate files for each pipeline.
https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/cosmos

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove these pipelines

scheduler = FlowUniPCMultistepScheduler()

# NOTE: using Qwen2 VL instead for tests (reason1 is based on 2.5)
text_encoder = Qwen2VLForConditionalGeneration.from_pretrained(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this work?

text_encoder = Qwen2_5_VLForConditionalGeneration(config)

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu December 17, 2025 04:06
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left some feedback. Mainly:

  1. Can we look into supporting the scheduler from the existing UniPCMultistepScheduler with the flow matching options (use_flow_sigmas, prediction_type="flow_prediction")? I'm ok adding this if it requires a lot of changes or just doesn't make sense to use the existing one, but wanted to check first.

  2. for the Pipeline, can we combine the 3 pipelines into one Cosmos2_5PredictPipeline that inherits directly from DiffusionPipeline? The current design isn't how we typically structure pipelines in diffusers. This isn't ideal since our pipelines are normally task-based (text2image, image2video, etc.), but i have to admit it's getting increasingly difficult to keep that pattern without huge portion of duplicated code. I think a single unified pipeline is reasonable here.

return sigmas, timesteps


class FlowUniPCMultistepScheduler(SchedulerMixin, ConfigMixin):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing unipc multi also support flow matching, https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_unipc_multistep.py#L212
do you think we will be able to reuse it?

Copy link
Author

@miguelmartin75 miguelmartin75 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class doesn't output the sigma values needed when setting use_flow_sigmas=True, basically I want is use_karras_sigmas=True and use_flow_sigmas=True to work at the same time (with arguments for sigma_max and sigma_min).

After reading the code in unipc multi, I have modified the class to support the above. I've updated the PR with this change (it's ~10 LOC) - see the change here. Should I remove the flowunipc scheduler and keep this small change with an explicit test case for this setting?

Here is the outputs from a script for 35 steps (with the change to the branch). Without the change, setting both to True is equivalent to use_karras_sigmas=True, as this branch is hit first in the if/elif block

FlowUniPCMultistepScheduler:
Timesteps (len=35)
tensor([995, 994, 993, 991, 990, 988, 986, 983, 980, 976, 971, 965, 957, 948,
        936, 920, 901, 877, 847, 809, 763, 708, 642, 568, 486, 402, 319, 243,
        176, 122,  81,  51,  31,  18,   9])
Sigmas: (len=36)
tensor([0.9950, 0.9942, 0.9932, 0.9920, 0.9905, 0.9887, 0.9865, 0.9839, 0.9806,
        0.9766, 0.9717, 0.9655, 0.9578, 0.9482, 0.9360, 0.9208, 0.9016, 0.8775,
        0.8473, 0.8098, 0.7637, 0.7081, 0.6427, 0.5683, 0.4870, 0.4025, 0.3195,
        0.2430, 0.1768, 0.1229, 0.0817, 0.0519, 0.0315, 0.0181, 0.0099, 0.0000])
UniPCMultistepScheduler(**{'use_flow_sigmas': True}):
Timesteps (len=35)
tensor([999, 970, 941, 913, 884, 856, 827, 799, 770, 742, 713, 685, 656, 627,
        599, 570, 542, 513, 485, 456, 428, 399, 371, 342, 313, 285, 256, 228,
        199, 171, 142, 114,  85,  57,  28])
Sigmas: (len=36)
tensor([0.9990, 0.9705, 0.9419, 0.9134, 0.8848, 0.8563, 0.8277, 0.7992, 0.7707,
        0.7421, 0.7136, 0.6850, 0.6565, 0.6279, 0.5994, 0.5709, 0.5423, 0.5138,
        0.4852, 0.4567, 0.4281, 0.3996, 0.3711, 0.3425, 0.3140, 0.2854, 0.2569,
        0.2283, 0.1998, 0.1713, 0.1427, 0.1142, 0.0856, 0.0571, 0.0285, 0.0000])
torch.allclose?:  False

UniPCMultistepScheduler(**{'use_karras_sigmas': True}):
Timesteps (len=35)
tensor([999, 999, 991, 975, 957, 939, 920, 900, 880, 858, 835, 810, 784, 757,
        727, 695, 661, 623, 583, 538, 488, 434, 375, 312, 248, 188, 136,  94,
         62,  39,  23,  12,   6,   2,   0])
Sigmas: (len=36)
tensor([2.0000e+02, 1.7084e+02, 1.4539e+02, 1.2327e+02, 1.0410e+02, 8.7544e+01,
        7.3297e+01, 6.1086e+01, 5.0662e+01, 4.1801e+01, 3.4303e+01, 2.7989e+01,
        2.2698e+01, 1.8289e+01, 1.4636e+01, 1.1626e+01, 9.1637e+00, 7.1625e+00,
        5.5482e+00, 4.2564e+00, 3.2315e+00, 2.4258e+00, 1.7988e+00, 1.3162e+00,
        9.4913e-01, 6.7355e-01, 4.6961e-01, 3.2106e-01, 2.1476e-01, 1.4017e-01,
        8.8994e-02, 5.4745e-02, 3.2478e-02, 1.8473e-02, 1.0000e-02, 0.0000e+00])
torch.allclose?:  False

UniPCMultistepScheduler(**{'use_karras_sigmas': True, 'use_flow_sigmas': True}):
Timesteps (len=35)
tensor([995, 994, 993, 991, 990, 988, 986, 983, 980, 976, 971, 965, 957, 948,
        936, 920, 901, 877, 847, 809, 763, 708, 642, 568, 486, 402, 319, 243,
        176, 122,  81,  51,  31,  18,   9])
Sigmas: (len=36)
tensor([0.9950, 0.9942, 0.9932, 0.9920, 0.9905, 0.9887, 0.9865, 0.9839, 0.9806,
        0.9766, 0.9717, 0.9655, 0.9578, 0.9482, 0.9360, 0.9208, 0.9016, 0.8775,
        0.8473, 0.8098, 0.7637, 0.7081, 0.6427, 0.5683, 0.4870, 0.4025, 0.3195,
        0.2430, 0.1768, 0.1229, 0.0817, 0.0519, 0.0315, 0.0181, 0.0099, 0.0000])
torch.allclose?:  True

Note I have a bug in FlowUniPCMultistepScheduler as it outputs len=37 for sigmas when I set num_inference_steps=35, so to compare, I set set_timesteps(num_inference_steps -1) to compare


For additional context:

I created a new scheduler this out as the UniPC class was forked in the predic2.5 codebase (which I refactored). I did look into trying to re-use unipc, but due to the above settings and the complexity of the class, I decided to just clean-up this fork. In my opinion, the UniPC class is quite complex due to:

  1. It is initialized with VP-type noise schedule, but this schedule is not enforced for inference: https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_unipc_multistep.py#L251-L255
    • e.g. use_flow_sigmas=True or use_karras_sigmas=True this constraint is not enforced, i.e. use_flow_sigmas=True is only applied for set_timesteps which is for inference
  2. betas and other parameters are not needed
  3. KDM noise schedule, which can be achieved via use_karras_sigmas=True - but this is in conflict with use_flow_sigmas=True
  4. Many parameters

In my opinion, a simpler design for this class could be composition to get sigmas that UniPC can operate on. The class does use some composition already, e.g. the argument solver_p overrides the predictor step, but it doesn't do so to get the sigmas/timesteps to perform unipc on. But I do understand the philosophy behind diffusers, copying the code is preferred.

)


class Cosmos2_5_PredictImage2World(Cosmos2_5_PredictBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all our pipelines should inherit from DIffusionPipeline
so In the future, we could consider supporting cosmos from Modular Diffusers https://huggingface.co/docs/diffusers/main/en/modular_diffusers/overview

for this PR, to get it in ASAP, I think the best option is to keep single Cosmos2_5_PredictPipeline that handles all 3 use cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove these three specialized pipelines, the base pipeline is more flexible and relevant default arguments for each mode in 2World

@miguelmartin75 miguelmartin75 force-pushed the cosmos/predict2.5-base-pr-ready branch from d969b9a to 94d2b41 Compare December 17, 2025 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants