-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Hello, when I directly run the training script provided in examples/dreambooth, with the following command, I encounter the “Attempting to unscale FP16 gradients” error:
`accelerate launch examples/dreambooth/train_dreambooth_lora.py \
--mixed_precision="fp16" \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--report_to="tensorboard" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--seed="0" \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=50`
I then tried the solution proposed in PR #6554 and found that the following command does not trigger this issue:
> > `accelerate launch examples/dreambooth/train_dreambooth_lora.py \
> > --mixed_precision="fp16" \
> > --pretrained_model_name_or_path=$MODEL_NAME \
> > --instance_data_dir=$INSTANCE_DIR \
> > --output_dir=$OUTPUT_DIR \
> > --instance_prompt="a photo of sks dog" \
> > --resolution=512 \
> > --train_batch_size=1 \
> > --gradient_accumulation_steps=1 \
> > --checkpointing_steps=100 \
> > --learning_rate=1e-4 \
> > --lr_scheduler="constant" \
> > --lr_warmup_steps=0 \
> > --max_train_steps=500 \
> > --gradient_checkpointing \
> > --seed="0" \
> > --report_to="tensorboard"`
>
I then tried to align the two commands and noticed that the issue is triggered specifically when the --validation_prompt argument is provided.
Does this indicate that there is still a bug related to validation in this script?
Reproduction
`accelerate launch examples/dreambooth/train_dreambooth_lora.py \
--mixed_precision="fp16" \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--report_to="tensorboard" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--seed="0" \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=50`
Logs
System Info
- 🤗 Diffusers version: 0.37.0.dev0
- Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.9.1+cu128 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 1.4.1
- Transformers version: 5.1.0
- Accelerate version: 1.12.0
- PEFT version: 0.18.1
- Bitsandbytes version: not installed
- Safetensors version: 0.7.0
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 3060, 12288 MiB
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working