Skip to content

Cannot run FLUX.2-klein under AMD Radeon 890M with ROCm build #1235

@HemanLin-cl

Description

@HemanLin-cl

Hi all,

I've tried the latest build sd-master-43e829f-bin-win-rocm-x64.zip under AMD Radeon 890M (Ryzen AI 9 HX370)

However, it failed.

Command-line arguments used
sd-cli.exe --diffusion-model ../models/Flux/flux-2-klein-4b.safetensors --vae ../models/Flux/flux2-vae.safetensors --llm ../models/text_encoders/qwen_3_4b.safetensors -p "a lovely cat" --cfg-scale 1.0 --steps 4 -v --offload-to-cpu --diffusion-fa --vae-conv-direct --clip-on-cpu

Logs / error messages / stack trace

[DEBUG] main.cpp:500 - version: stable-diffusion.cpp version unknown, commit 43e829f
[DEBUG] main.cpp:501 - System Info:
SSE3 = 1 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | VSX = 0 |
[DEBUG] main.cpp:502 - SDCliParams {
mode: img_gen,
output_path: "output.png",
verbose: true,
color: false,
canny_preprocess: false,
convert_name: false,
preview_method: none,
preview_interval: 1,
preview_path: "preview.png",
preview_fps: 16,
taesd_preview: false,
preview_noisy: false
}
[DEBUG] main.cpp:503 - SDContextParams {
n_threads: 12,
model_path: "",
clip_l_path: "",
clip_g_path: "",
clip_vision_path: "",
t5xxl_path: "",
llm_path: "../models/text_encoders/qwen_3_4b.safetensors",
llm_vision_path: "",
diffusion_model_path: "../models/Flux/flux-2-klein-4b.safetensors",
high_noise_diffusion_model_path: "",
vae_path: "../models/Flux/flux2-vae.safetensors",
taesd_path: "",
esrgan_path: "",
control_net_path: "",
embedding_dir: "",
embeddings: {
}
wtype: NONE,
tensor_type_rules: "",
lora_model_dir: ".",
photo_maker_path: "",
rng_type: cuda,
sampler_rng_type: NONE,
flow_shift: INF
offload_params_to_cpu: true,
enable_mmap: false,
control_net_cpu: false,
clip_on_cpu: true,
vae_on_cpu: false,
diffusion_flash_attn: true,
diffusion_conv_direct: false,
vae_conv_direct: true,
circular: false,
circular_x: false,
circular_y: false,
chroma_use_dit_mask: true,
qwen_image_zero_cond_t: false,
chroma_use_t5_mask: false,
chroma_t5_mask_pad: 1,
prediction: NONE,
lora_apply_mode: auto,
vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:504 - SDGenerationParams {
loras: "{
}",
high_noise_loras: "{
}",
prompt: "a lovely cat",
negative_prompt: "",
clip_skip: -1,
width: -1,
height: -1,
batch_count: 1,
init_image_path: "",
end_image_path: "",
mask_image_path: "",
control_image_path: "",
ref_image_paths: [],
control_video_path: "",
auto_resize_ref_image: true,
increase_ref_index: false,
pm_id_images_dir: "",
pm_id_embed_path: "",
pm_style_strength: 20,
skip_layers: [7, 8, 9],
sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 4, eta: 0.00, shifted_timestep: 0),
high_noise_skip_layers: [7, 8, 9],
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
custom_sigmas: [],
cache_mode: "",
cache_option: "",
cache: disabled (threshold=1, start=0.15, end=0.95),
moe_boundary: 0.875,
video_frames: 1,
fps: 16,
vace_strength: 1,
strength: 0.75,
control_strength: 0.9,
seed: 42,
upscale_repeats: 1,
upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:164 - Using CUDA backend
[INFO ] stable-diffusion.cpp\ggml_extend.hpp:78 - ggml_cuda_init: found 1 ROCm devices:
[INFO ] stable-diffusion.cpp\ggml_extend.hpp:78 - Device 0: AMD Radeon(TM) 890M Graphics, gfx1150 (0x1150), VMM: no, Wave Size: 32
[INFO ] stable-diffusion.cpp:258 - loading diffusion model from '../models/Flux/flux-2-klein-4b.safetensors'
[INFO ] model.cpp:373 - load ../models/Flux/flux-2-klein-4b.safetensors using safetensors format
[DEBUG] model.cpp:507 - init from '../models/Flux/flux-2-klein-4b.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:305 - loading llm from '../models/text_encoders/qwen_3_4b.safetensors'
[INFO ] model.cpp:373 - load ../models/text_encoders/qwen_3_4b.safetensors using safetensors format
[DEBUG] model.cpp:507 - init from '../models/text_encoders/qwen_3_4b.safetensors', prefix = 'text_encoders.llm.'
[INFO ] stable-diffusion.cpp:319 - loading vae from '../models/Flux/flux2-vae.safetensors'
[INFO ] model.cpp:373 - load ../models/Flux/flux2-vae.safetensors using safetensors format
[DEBUG] model.cpp:507 - init from '../models/Flux/flux2-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:335 - Version: Flux.2 klein
[INFO ] stable-diffusion.cpp:363 - Weight type stat: f32: 248 | bf16: 547
[INFO ] stable-diffusion.cpp:364 - Conditioner weight type stat: bf16: 398
[INFO ] stable-diffusion.cpp:365 - Diffusion model weight type stat: bf16: 149
[INFO ] stable-diffusion.cpp:366 - VAE weight type stat: f32: 248
[DEBUG] stable-diffusion.cpp:368 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:427 - CLIP: Using CPU backend
[DEBUG] stable-diffusion.cpp\llm.hpp:285 - merges size 151387
[DEBUG] stable-diffusion.cpp\llm.hpp:317 - vocab size: 151669
[DEBUG] stable-diffusion.cpp\llm.hpp:1139 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[INFO ] stable-diffusion.cpp\flux.hpp:1353 - flux: depth = 5, depth_single_blocks = 20, guidance_embed = false, context_in_dim = 7680, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:573 - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1950 - qwen3 params backend buffer size = 8414.50 MB(RAM) (398 tensors)
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1950 - flux params backend buffer size = 7392.03 MB(RAM) (149 tensors)
[INFO ] stable-diffusion.cpp:624 - Using Conv2d direct in the vae model
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1950 - vae params backend buffer size = 96.72 MB(RAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:752 - loading weights
[DEBUG] model.cpp:1381 - using 12 threads for model loading
[DEBUG] model.cpp:1403 - loading tensors from ../models/Flux/flux-2-klein-4b.safetensors
|=========> | 149/795 - 10.44it/s←[K
[DEBUG] model.cpp:1403 - loading tensors from ../models/text_encoders/qwen_3_4b.safetensors
|==================================> | 547/795 - 18.93it/s←[K
[DEBUG] model.cpp:1403 - loading tensors from ../models/Flux/flux2-vae.safetensors
|==================================================| 795/795 - 27.12it/s←[K
[INFO ] model.cpp:1629 - loading tensors completed, taking 29.32s (process: 0.00s, read: 27.90s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:787 - finished loaded file
[INFO ] stable-diffusion.cpp:860 - total params memory size = 15903.24MB (VRAM 7488.74MB, RAM 8414.50MB): text_encoders 8414.50MB(RAM), diffusion_model 7392.03MB(VRAM), vae 96.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:939 - running in Flux2 FLOW mode
[DEBUG] stable-diffusion.cpp:3472 - generate_image 512x512
[INFO ] stable-diffusion.cpp:3506 - sampling using Euler method
[DEBUG] stable-diffusion.cpp\denoiser.hpp:703 - Flux2FlowDenoiser: set shift to 2.031
[INFO ] stable-diffusion.cpp\denoiser.hpp:403 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3633 - TXT2IMG
[DEBUG] stable-diffusion.cpp\conditioner.hpp:1679 - parse '<|im_start|>user
a lovely cat<|im_end|>
<|im_start|>assistant

' to [['<|im_start|>user
', 1], ['a lovely cat', 1], ['<|im_end|>
<|im_start|>assistant

', 1], ]
[DEBUG] stable-diffusion.cpp\llm.hpp:259 - split prompt "<|im_start|>user
" to tokens ["<|im_start|>", "user", "Ċ", ]
[DEBUG] stable-diffusion.cpp\llm.hpp:259 - split prompt "a lovely cat" to tokens ["a", "Ġlovely", "Ġcat", ]
[DEBUG] stable-diffusion.cpp\llm.hpp:259 - split prompt "<|im_end|>
<|im_start|>assistant

" to tokens ["<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", "", "ĊĊ", "", "ĊĊ", ]
[DEBUG] stable-diffusion.cpp\llm.hpp:203 - token length: 512
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1762 - qwen3 compute buffer size: 74.00 MB(RAM)
[DEBUG] stable-diffusion.cpp\conditioner.hpp:1923 - computing condition graph completed, taking 11443 ms
[INFO ] stable-diffusion.cpp:3250 - get_learned_condition completed, taking 11446 ms
[INFO ] stable-diffusion.cpp:3361 - generating image: 1/1 - seed 42
[INFO ] stable-diffusion.cpp\ggml_extend.hpp:1862 - flux offload params (7392.03 MB, 149 tensors) to runtime backend (ROCm0), taking 3.23s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1762 - flux compute buffer size: 356.00 MB(VRAM)
[ERROR] stable-diffusion.cpp\ggml_extend.hpp:84 - ROCm error: invalid device function
[ERROR] stable-diffusion.cpp\ggml_extend.hpp:84 - current device: 0, in function ggml_cuda_op_mul_mat at D:/a/stable-diffusion.cpp/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1722
[ERROR] stable-diffusion.cpp\ggml_extend.hpp:84 - hipGetLastError()
D:/a/stable-diffusion.cpp/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: ROCm error

workflows
/build.yml It does not show "gfx1150", is this issue related to it?

Do I need to install ROCm or HIP related packages?

Or Radeon 890M cannot work on ROCm ? (It seems gfx1150, which is over ROCm 7.0 ?)

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions