Skip to content

KVM HA not functioning #12088

@jpt1624

Description

@jpt1624

problem

Hello, I am having issues with getting HA to function with my two KVM hosts. The cluster, the two hosts, and a test virtual machine each have HA enabled.

The KVM hosts have OOB management configured using ipmitool.

For testing, I have a virtual machine with an HA supported policy running on KVM-02. I power off KVM-02 abruptly to see if the virtual machine will automatically migrate over to KVM-01.

What occurs is the following:

KVM-02 is determined to be Disconnected by cloudstack management:

Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} has the status [Disconnected].

KVM-02 is then set to the DOWN status (supposedly):

{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} has the status [Down].

KVM-01 then checks connectivity with KVM-02, which also returns a status of DOWN:

Neighbouring Host {"id":43,"name":"kvm-01","type":"Routing","uuid":"025ccefd-1696-43c9-9a2c-e045968d2efa"} returned status [Down] for the investigated Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}.

The shared storage volume mounted onto KVM-02 is checked for any recent writes:

Checking VM activity for Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} on storage pool [StoragePool {"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}].

Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} does not have activity on storage pool [StoragePool {"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}]

Also while these are occurring, the API states that the status for KVM-02 is UP:

Image

After about 10-15 minutes, we progress to the ALERT state for KVM-02. I am not sure why it takes this many attempts because we have set this condition in the settings for 5 checks:

Image

At the ALERT state, now the HA task tries to power OFF KVM-02 (assuming to prevent split brain prior to moving the virtual machines over):

Image

This command fails because KVM-02 is already OFF.

Image

Cloudstack will continue to try to power KVM-02 off until I manually issue the OOB power up command. Cloudstack's power OFF command then will work. When this happens we progress to marking KVM-02 as DOWN:

Image Image Image

Here are the investigators configured:

Image

versions

Cloudstack: 4.22.0.0
KVM-01: 4.22.0.0
KVM-02: 4.22.0.0

The steps to reproduce the bug

  1. Create KVM cluster
  2. Assign two KVM hosts under cluster
  3. Enable HA for cluster and KVM hosts
  4. Configure OOB management for KVM hosts
  5. Create test VM under a KVM host with HA supported policy
  6. Assign SimpleInvestigator, PingInvestigator, and KVMInvestigator under HA investigators order.
  7. Power off KVM host abruptly to simulate failure scenario.

What to do about it?

Not sure if my configuration is incorrect or underlying issue is present. The behavior is confusing. Please let me know if I can provide anything else.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions