-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion #12852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion #12852
Conversation
- New scheduler: scheduling_flow_unipc_multistep.py - Changes to TransformerCosmos for text embeddings via crossattn_proj
| scheduler = FlowUniPCMultistepScheduler() | ||
|
|
||
| # NOTE: using Qwen2 VL instead for tests (reason1 is based on 2.5) | ||
| text_encoder = Qwen2VLForConditionalGeneration.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an internal Qwen2_5_VL model to test with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this work?
| text_encoder = Qwen2_5_VLForConditionalGeneration(config) |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I left some comments. My major comments are on separating the pipelines from one another instead of inheriting from one another.
Let's also add docs?
| hidden_size, patch_size[0] * patch_size[1] * patch_size[2] * out_channels, bias=False | ||
| ) | ||
|
|
||
| self.use_crossattn_projection = use_crossattn_projection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to assign the variables like this because register_to_config will have them included in the model.config:
| @register_to_config |
| self.use_crossattn_projection = use_crossattn_projection |
| else: | ||
| assert False | ||
|
|
||
| if self.use_crossattn_projection: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.use_crossattn_projection: | |
| if self.config.use_crossattn_projection: |
| """ | ||
|
|
||
|
|
||
| class Cosmos2_5_PredictBase(DiffusionPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class Cosmos2_5_PredictBase(DiffusionPipeline): | |
| class Cosmos2_5_PredictBasePipeline(DiffusionPipeline): |
And elsewhere
|
|
||
| return prompt_embeds, negative_prompt_embeds | ||
|
|
||
| def prepare_latents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking if this is similar to any of the other existing Cosmos pipelines? If so, maybe this can be supplemented with a `"# Copied from ..." comment?
| self._current_timestep = None | ||
|
|
||
| if not output_type == "latent": | ||
| assert self.latents_mean is not None and self.latents_std is not None, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We avoid assert statements like this in the modules. If we really want to raise, let's raise Errors instead.
| video = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0] | ||
|
|
||
| assert self.safety_checker is not None | ||
| self.safety_checker.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to do it in this PR, but we could have a little utility like run_safety_checker() inside the pipelines and copy it over all the cosmos pipelines that require it (much akin to encode_prompt(), for example).
But this is not merge-blocking.
| return CosmosPipelineOutput(frames=video) | ||
|
|
||
|
|
||
| class Cosmos2_5_PredictText2World(Cosmos2_5_PredictBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels quite odd. We usually maintain one pipeline class per pipeline file. That's been followed in the existing cosmos implementations as well:
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cosmos
We don't inherit from pipelines like Cosmos2_5_PredictText2World(Cosmos2_5_PredictBase):, either. We, instead, use"# Copied from ..." (we will unbloat quite a bit of that in #12820 but that's a separate thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we follow the same structure as Cosmos2 with separate files for each pipeline.
https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/cosmos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove these pipelines
| scheduler = FlowUniPCMultistepScheduler() | ||
|
|
||
| # NOTE: using Qwen2 VL instead for tests (reason1 is based on 2.5) | ||
| text_encoder = Qwen2VLForConditionalGeneration.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this work?
| text_encoder = Qwen2_5_VLForConditionalGeneration(config) |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left some feedback. Mainly:
-
Can we look into supporting the scheduler from the existing
UniPCMultistepSchedulerwith the flow matching options (use_flow_sigmas,prediction_type="flow_prediction")? I'm ok adding this if it requires a lot of changes or just doesn't make sense to use the existing one, but wanted to check first. -
for the Pipeline, can we combine the 3 pipelines into one
Cosmos2_5PredictPipelinethat inherits directly fromDiffusionPipeline? The current design isn't how we typically structure pipelines in diffusers. This isn't ideal since our pipelines are normally task-based (text2image, image2video, etc.), but i have to admit it's getting increasingly difficult to keep that pattern without huge portion of duplicated code. I think a single unified pipeline is reasonable here.
| return sigmas, timesteps | ||
|
|
||
|
|
||
| class FlowUniPCMultistepScheduler(SchedulerMixin, ConfigMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing unipc multi also support flow matching, https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_unipc_multistep.py#L212
do you think we will be able to reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class doesn't output the sigma values needed when setting use_flow_sigmas=True, basically I want is use_karras_sigmas=True and use_flow_sigmas=True to work at the same time (with arguments for sigma_max and sigma_min).
After reading the code in unipc multi, I have modified the class to support the above. I've updated the PR with this change (it's ~10 LOC) - see the change here. Should I remove the flowunipc scheduler and keep this small change with an explicit test case for this setting?
Here is the outputs from a script for 35 steps (with the change to the branch). Without the change, setting both to True is equivalent to use_karras_sigmas=True, as this branch is hit first in the if/elif block
FlowUniPCMultistepScheduler:
Timesteps (len=35)
tensor([995, 994, 993, 991, 990, 988, 986, 983, 980, 976, 971, 965, 957, 948,
936, 920, 901, 877, 847, 809, 763, 708, 642, 568, 486, 402, 319, 243,
176, 122, 81, 51, 31, 18, 9])
Sigmas: (len=36)
tensor([0.9950, 0.9942, 0.9932, 0.9920, 0.9905, 0.9887, 0.9865, 0.9839, 0.9806,
0.9766, 0.9717, 0.9655, 0.9578, 0.9482, 0.9360, 0.9208, 0.9016, 0.8775,
0.8473, 0.8098, 0.7637, 0.7081, 0.6427, 0.5683, 0.4870, 0.4025, 0.3195,
0.2430, 0.1768, 0.1229, 0.0817, 0.0519, 0.0315, 0.0181, 0.0099, 0.0000])
UniPCMultistepScheduler(**{'use_flow_sigmas': True}):
Timesteps (len=35)
tensor([999, 970, 941, 913, 884, 856, 827, 799, 770, 742, 713, 685, 656, 627,
599, 570, 542, 513, 485, 456, 428, 399, 371, 342, 313, 285, 256, 228,
199, 171, 142, 114, 85, 57, 28])
Sigmas: (len=36)
tensor([0.9990, 0.9705, 0.9419, 0.9134, 0.8848, 0.8563, 0.8277, 0.7992, 0.7707,
0.7421, 0.7136, 0.6850, 0.6565, 0.6279, 0.5994, 0.5709, 0.5423, 0.5138,
0.4852, 0.4567, 0.4281, 0.3996, 0.3711, 0.3425, 0.3140, 0.2854, 0.2569,
0.2283, 0.1998, 0.1713, 0.1427, 0.1142, 0.0856, 0.0571, 0.0285, 0.0000])
torch.allclose?: False
UniPCMultistepScheduler(**{'use_karras_sigmas': True}):
Timesteps (len=35)
tensor([999, 999, 991, 975, 957, 939, 920, 900, 880, 858, 835, 810, 784, 757,
727, 695, 661, 623, 583, 538, 488, 434, 375, 312, 248, 188, 136, 94,
62, 39, 23, 12, 6, 2, 0])
Sigmas: (len=36)
tensor([2.0000e+02, 1.7084e+02, 1.4539e+02, 1.2327e+02, 1.0410e+02, 8.7544e+01,
7.3297e+01, 6.1086e+01, 5.0662e+01, 4.1801e+01, 3.4303e+01, 2.7989e+01,
2.2698e+01, 1.8289e+01, 1.4636e+01, 1.1626e+01, 9.1637e+00, 7.1625e+00,
5.5482e+00, 4.2564e+00, 3.2315e+00, 2.4258e+00, 1.7988e+00, 1.3162e+00,
9.4913e-01, 6.7355e-01, 4.6961e-01, 3.2106e-01, 2.1476e-01, 1.4017e-01,
8.8994e-02, 5.4745e-02, 3.2478e-02, 1.8473e-02, 1.0000e-02, 0.0000e+00])
torch.allclose?: False
UniPCMultistepScheduler(**{'use_karras_sigmas': True, 'use_flow_sigmas': True}):
Timesteps (len=35)
tensor([995, 994, 993, 991, 990, 988, 986, 983, 980, 976, 971, 965, 957, 948,
936, 920, 901, 877, 847, 809, 763, 708, 642, 568, 486, 402, 319, 243,
176, 122, 81, 51, 31, 18, 9])
Sigmas: (len=36)
tensor([0.9950, 0.9942, 0.9932, 0.9920, 0.9905, 0.9887, 0.9865, 0.9839, 0.9806,
0.9766, 0.9717, 0.9655, 0.9578, 0.9482, 0.9360, 0.9208, 0.9016, 0.8775,
0.8473, 0.8098, 0.7637, 0.7081, 0.6427, 0.5683, 0.4870, 0.4025, 0.3195,
0.2430, 0.1768, 0.1229, 0.0817, 0.0519, 0.0315, 0.0181, 0.0099, 0.0000])
torch.allclose?: True
Note I have a bug in FlowUniPCMultistepScheduler as it outputs len=37 for sigmas when I set num_inference_steps=35, so to compare, I set set_timesteps(num_inference_steps -1) to compare
For additional context:
I created a new scheduler this out as the UniPC class was forked in the predic2.5 codebase (which I refactored). I did look into trying to re-use unipc, but due to the above settings and the complexity of the class, I decided to just clean-up this fork. In my opinion, the UniPC class is quite complex due to:
- It is initialized with VP-type noise schedule, but this schedule is not enforced for inference: https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_unipc_multistep.py#L251-L255
- e.g.
use_flow_sigmas=Trueoruse_karras_sigmas=Truethis constraint is not enforced, i.e.use_flow_sigmas=Trueis only applied forset_timestepswhich is for inference
- e.g.
betasand other parameters are not needed- KDM noise schedule, which can be achieved via
use_karras_sigmas=True- but this is in conflict withuse_flow_sigmas=True - Many parameters
In my opinion, a simpler design for this class could be composition to get sigmas that UniPC can operate on. The class does use some composition already, e.g. the argument solver_p overrides the predictor step, but it doesn't do so to get the sigmas/timesteps to perform unipc on. But I do understand the philosophy behind diffusers, copying the code is preferred.
| ) | ||
|
|
||
|
|
||
| class Cosmos2_5_PredictImage2World(Cosmos2_5_PredictBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all our pipelines should inherit from DIffusionPipeline
so In the future, we could consider supporting cosmos from Modular Diffusers https://huggingface.co/docs/diffusers/main/en/modular_diffusers/overview
for this PR, to get it in ASAP, I think the best option is to keep single Cosmos2_5_PredictPipeline that handles all 3 use cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove these three specialized pipelines, the base pipeline is more flexible and relevant default arguments for each mode in 2World
d969b9a to
94d2b41
Compare
What does this PR do?
This PR adds Cosmos Predict2.5 Base. It has been tested using the 2B model checkpoint official HF checkpoint is here. The converted checkpoints have yet to be uploaded to HF.
This change is largely based off the previous predict1/predict2 support done by @a-r-r-o-w
Testing:
Additions
Pipelines:
Qwen2_5_VLForConditionalGenerationnum_frames=1batch_size == 1Scheduler:
FlowUniPCMultistepScheduler: is a new scheduler introduced, which uses the EDM noise schedule (Karras sigmas) using the UniPC algorithm as predict2.5 uses flow matching. This name can be changed.Model changes:
CosmosTransformerto accept an optional cross-attention projection layer (used for text embeddings from Reason1)Scripts:
scripts/convert_cosmos_to_diffusers.pyto support Predict2.5Who can review?