Cannot run FLUX.2-klein under AMD Radeon 890M with ROCm build

Hi all, 

I've tried the latest build [sd-master-43e829f-bin-win-rocm-x64.zip](https://github.com/leejet/stable-diffusion.cpp/releases/download/master-487-43e829f/sd-master-43e829f-bin-win-rocm-x64.zip) under AMD Radeon 890M (Ryzen AI 9 HX370) 

However, it failed. 

Command-line arguments used
sd-cli.exe --diffusion-model  ../models/Flux/flux-2-klein-4b.safetensors --vae ../models/Flux/flux2-vae.safetensors  --llm ../models/text_encoders/qwen_3_4b.safetensors -p "a lovely cat" --cfg-scale 1.0 --steps 4 -v --offload-to-cpu --diffusion-fa --vae-conv-direct --clip-on-cpu

Logs / error messages / stack trace 

[DEBUG] main.cpp:500  - version: stable-diffusion.cpp version unknown, commit 43e829f
[DEBUG] main.cpp:501  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:502  - SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:503  - SDContextParams {
  n_threads: 12,
  model_path: "",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "../models/text_encoders/qwen_3_4b.safetensors",
  llm_vision_path: "",
  diffusion_model_path: "../models/Flux/flux-2-klein-4b.safetensors",
  high_noise_diffusion_model_path: "",
  vae_path: "../models/Flux/flux2-vae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  flow_shift: INF
  offload_params_to_cpu: true,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: true,
  vae_on_cpu: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: true,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:504  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "a lovely cat",
  negative_prompt: "",
  clip_skip: -1,
  width: -1,
  height: -1,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 4, eta: 0.00, shifted_timestep: 0),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:164  - Using CUDA backend
[INFO ] stable-diffusion.cpp\ggml_extend.hpp:78   - ggml_cuda_init: found 1 ROCm devices:
[INFO ] stable-diffusion.cpp\ggml_extend.hpp:78   -   Device 0: AMD Radeon(TM) 890M Graphics, gfx1150 (0x1150), VMM: no, Wave Size: 32
[INFO ] stable-diffusion.cpp:258  - loading diffusion model from '../models/Flux/flux-2-klein-4b.safetensors'
[INFO ] model.cpp:373  - load ../models/Flux/flux-2-klein-4b.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '../models/Flux/flux-2-klein-4b.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:305  - loading llm from '../models/text_encoders/qwen_3_4b.safetensors'
[INFO ] model.cpp:373  - load ../models/text_encoders/qwen_3_4b.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '../models/text_encoders/qwen_3_4b.safetensors', prefix = 'text_encoders.llm.'
[INFO ] stable-diffusion.cpp:319  - loading vae from '../models/Flux/flux2-vae.safetensors'
[INFO ] model.cpp:373  - load ../models/Flux/flux2-vae.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '../models/Flux/flux2-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:335  - Version: Flux.2 klein
[INFO ] stable-diffusion.cpp:363  - Weight type stat:                      f32: 248  |    bf16: 547
[INFO ] stable-diffusion.cpp:364  - Conditioner weight type stat:         bf16: 398
[INFO ] stable-diffusion.cpp:365  - Diffusion model weight type stat:     bf16: 149
[INFO ] stable-diffusion.cpp:366  - VAE weight type stat:                  f32: 248
[DEBUG] stable-diffusion.cpp:368  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:427  - CLIP: Using CPU backend
[DEBUG] stable-diffusion.cpp\llm.hpp:285  - merges size 151387
[DEBUG] stable-diffusion.cpp\llm.hpp:317  - vocab size: 151669
[DEBUG] stable-diffusion.cpp\llm.hpp:1139 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[INFO ] stable-diffusion.cpp\flux.hpp:1353 - flux: depth = 5, depth_single_blocks = 20, guidance_embed = false, context_in_dim = 7680, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:573  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1950 - qwen3 params backend buffer size =  8414.50 MB(RAM) (398 tensors)
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1950 - flux params backend buffer size =  7392.03 MB(RAM) (149 tensors)
[INFO ] stable-diffusion.cpp:624  - Using Conv2d direct in the vae model
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1950 - vae params backend buffer size =  96.72 MB(RAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:752  - loading weights
[DEBUG] model.cpp:1381 - using 12 threads for model loading
[DEBUG] model.cpp:1403 - loading tensors from ../models/Flux/flux-2-klein-4b.safetensors
  |=========>                                        | 149/795 - 10.44it/s←[K
[DEBUG] model.cpp:1403 - loading tensors from ../models/text_encoders/qwen_3_4b.safetensors
  |==================================>               | 547/795 - 18.93it/s←[K
[DEBUG] model.cpp:1403 - loading tensors from ../models/Flux/flux2-vae.safetensors
  |==================================================| 795/795 - 27.12it/s←[K
[INFO ] model.cpp:1629 - loading tensors completed, taking 29.32s (process: 0.00s, read: 27.90s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:787  - finished loaded file
[INFO ] stable-diffusion.cpp:860  - total params memory size = 15903.24MB (VRAM 7488.74MB, RAM 8414.50MB): text_encoders 8414.50MB(RAM), diffusion_model 7392.03MB(VRAM), vae 96.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:939  - running in Flux2 FLOW mode
[DEBUG] stable-diffusion.cpp:3472 - generate_image 512x512
[INFO ] stable-diffusion.cpp:3506 - sampling using Euler method
[DEBUG] stable-diffusion.cpp\denoiser.hpp:703  - Flux2FlowDenoiser: set shift to 2.031
[INFO ] stable-diffusion.cpp\denoiser.hpp:403  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3633 - TXT2IMG
[DEBUG] stable-diffusion.cpp\conditioner.hpp:1679 - parse '<|im_start|>user
a lovely cat<|im_end|>
<|im_start|>assistant
<think>

</think>

' to [['<|im_start|>user
', 1], ['a lovely cat', 1], ['<|im_end|>
<|im_start|>assistant
<think>

</think>

', 1], ]
[DEBUG] stable-diffusion.cpp\llm.hpp:259  - split prompt "<|im_start|>user
" to tokens ["<|im_start|>", "user", "Ċ", ]
[DEBUG] stable-diffusion.cpp\llm.hpp:259  - split prompt "a lovely cat" to tokens ["a", "Ġlovely", "Ġcat", ]
[DEBUG] stable-diffusion.cpp\llm.hpp:259  - split prompt "<|im_end|>
<|im_start|>assistant
<think>

</think>

" to tokens ["<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", "<think>", "ĊĊ", "</think>", "ĊĊ", ]
[DEBUG] stable-diffusion.cpp\llm.hpp:203  - token length: 512
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1762 - qwen3 compute buffer size: 74.00 MB(RAM)
[DEBUG] stable-diffusion.cpp\conditioner.hpp:1923 - computing condition graph completed, taking 11443 ms
[INFO ] stable-diffusion.cpp:3250 - get_learned_condition completed, taking 11446 ms
[INFO ] stable-diffusion.cpp:3361 - generating image: 1/1 - seed 42
[INFO ] stable-diffusion.cpp\ggml_extend.hpp:1862 - flux offload params (7392.03 MB, 149 tensors) to runtime backend (ROCm0), taking 3.23s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1762 - flux compute buffer size: 356.00 MB(VRAM)
[ERROR] stable-diffusion.cpp\ggml_extend.hpp:84   - ROCm error: invalid device function
[ERROR] stable-diffusion.cpp\ggml_extend.hpp:84   -   current device: 0, in function ggml_cuda_op_mul_mat at D:/a/stable-diffusion.cpp/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1722
[ERROR] stable-diffusion.cpp\ggml_extend.hpp:84   -   hipGetLastError()
D:/a/stable-diffusion.cpp/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: ROCm error

[workflows](https://github.com/leejet/stable-diffusion.cpp/tree/e411520407663e1ddf8ff2e5ed4ff3a116fbbc97/.github/workflows)
/build.yml It does not show "gfx1150", is this issue related to it?  

Do I need to install ROCm or HIP related packages? 

Or Radeon 890M cannot work on ROCm ? (It seems gfx1150, which is over ROCm 7.0 ?)

Thanks  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run FLUX.2-klein under AMD Radeon 890M with ROCm build #1235

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cannot run FLUX.2-klein under AMD Radeon 890M with ROCm build #1235

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions