Skip to content

Failed to create RBD storage pool after KVM agent upgrade from 4.20 to 4.22: "org.libvirt.LibvirtException: failed to create the RBD IoCTX" #12096

@tuanhoangth1603

Description

@tuanhoangth1603

problem

After upgrading the KVM agent on a compute node from CloudStack 4.20 to 4.22, the agent fails to recreate or connect to the existing RBD storage pool. The error manifests in the agent logs as a LibvirtException during pool initialization, querying if the RBD pool exists (which it does on the Ceph cluster). This prevents the host from fully reconnecting and handling VM operations (e.g., volume attach/detach).
The issue appears tied to changes in libvirt (8.0+) or Ceph client libraries post-upgrade, causing IoCTX creation to fail due to temporary secret/cached state mismatches. Notably, a full reboot of the compute node resolves the issue immediately, allowing clean recreation of the pool and secret. However, this introduces unwanted downtime for running VMs on that node, which is unacceptable in production.

versions

Environment

CloudStack version: Management server upgraded to 4.22.0 (from 4.20.0)
Agent version: KVM agent upgraded from 4.20.0 to 4.22.0 on compute nodes
Hypervisor: KVM
Primary Storage: Ceph RBD (pool name: cloudstack-zone1; Ceph version: 14)
OS on compute nodes: Ubuntu 20.04

The steps to reproduce the bug

  1. Upgrade mgmt to 4.22
  2. upgrade agent to 4.22
  3. log error from agent.log: Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to create the RBD IoCTX. Does the pool 'cloudstack-zone1' exist?
    I also do these commands on CEPH but it's still error
# ceph config set mon auth_expose_insecure_global_id_reclaim false

# ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false

# ceph config set mon auth_allow_insecure_global_id_reclaim false

Expected Behavior
The agent should successfully redefine the RBD storage pool using the existing Ceph configuration (monitors, secrets) without failure, allowing seamless host reconnection post-upgrade.

Actual Behavior
Agent logs show repeated failures to create the RBD IoCTX, followed by cleanup of the libvirt secret. Host status remains "Disconnected" or "Alert" in UI until manual intervention. Full reboot of the compute node resolves the issue immediately (it's so bad solution)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions