Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1228,12 +1228,38 @@ public void run(MessageReply reply) {
}

GetVolumeWatchersReply rly = (GetVolumeWatchersReply)reply;
List watchers = rly.getWatchers();
List<String> watchers = rly.getWatchers();
if (watchers == null || watchers.isEmpty()) {
completion.success();
return;
}

// Filter out stale watchers from disconnected hosts (ZSTAC-73476)
// When libvirt hangs on a host, ceph rbd watchers become stale but persist,
// blocking VM start on other hosts. Only watchers from connected hosts are valid.
Set<String> disconnectedHostIps = new HashSet<>(Q.New(HostVO.class)
.select(HostVO_.managementIp)
.notEq(HostVO_.status, HostStatus.Connected)
.listValues());
if (!disconnectedHostIps.isEmpty()) {
List<String> activeWatchers = new ArrayList<>();
for (String watcher : watchers) {
String watcherIp = extractWatcherIp(watcher);
if (watcherIp != null && disconnectedHostIps.contains(watcherIp)) {
logger.info(String.format("filtered stale watcher from disconnected host[ip:%s] for volume[uuid:%s]: %s",
watcherIp, msg.getVolumeUuid(), watcher));
continue;
}
activeWatchers.add(watcher);
}
watchers = activeWatchers;
}
Comment on lines +1237 to +1256
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for EXTRA_IPS usage in Ceph plugin and related Ceph connection code
rg -n "EXTRA_IPS" plugin/ceph --type=java -A 3 -B 3

Repository: MatheMatrix/zstack

Length of output: 4661


🏁 Script executed:

# Search for Ceph client connection IP configuration or setup
rg -n -C 5 "monitorAddress|ceph.*ip|RBDClient.*connect|storageNetworkInterface" plugin/ceph --type=java

Repository: MatheMatrix/zstack

Length of output: 10579


🏁 Script executed:

# Look for extractWatcherIp implementation and watcher format documentation
rg -n "extractWatcherIp|watchers|rbd status" plugin/ceph --type=java -A 5 -B 2

Repository: MatheMatrix/zstack

Length of output: 12838


🏁 Script executed:

# Search for how watchers are obtained and their expected format
rg -n "GetVolumeWatchers|ceph.*watcher|libvirt.*watcher" plugin/ceph --type=java -A 3

Repository: MatheMatrix/zstack

Length of output: 8241


🏁 Script executed:

# Check if there are existing comments or tests about watcher IP sources
rg -n "watcher.*ip|disconnected.*host|stale.*watcher" plugin/ceph --type=java -A 5 -B 2

Repository: MatheMatrix/zstack

Length of output: 3726


🏁 Script executed:

# Search in the full repository for how Ceph is configured to use specific networks
rg -n "CephMonitorAddress|CephMonitor.*ip" . --type=java | head -20

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Look at the actual file around the changed lines to get more context
sed -n '1220,1300p' plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java

Repository: MatheMatrix/zstack

Length of output: 3590


断开主机的 stale watcher 过滤需要同时考虑 EXTRA_IPS

Line 1240-1243 仅使用 HostVO_.managementIp 来构建断开主机的 IP 集合,但在 Line 537-540 处理 Ceph Monitor 断开连接时,代码同时使用了 managementIpHostSystemTags.EXTRA_IPS。若 Ceph RBD 客户端实际使用了数据网或存储网 IP(记录在 EXTRA_IPS 中),来自这些 IP 的 stale watcher 将无法被过滤出来,VM 仍会被阻塞。

建议补充 EXTRA_IPS 的处理,与 Ceph Mon 断开连接逻辑保持一致:

修复方案
-                Set<String> disconnectedHostIps = new HashSet<>(Q.New(HostVO.class)
-                        .select(HostVO_.managementIp)
-                        .notEq(HostVO_.status, HostStatus.Connected)
-                        .listValues());
+                Set<String> disconnectedHostIps = new HashSet<>();
+                List<HostVO> disconnectedHosts = Q.New(HostVO.class)
+                        .notEq(HostVO_.status, HostStatus.Connected)
+                        .list();
+                for (HostVO host : disconnectedHosts) {
+                    disconnectedHostIps.add(host.getManagementIp());
+                    String extraIps = HostSystemTags.EXTRA_IPS.getTokenByResourceUuid(
+                            host.getUuid(), HostSystemTags.EXTRA_IPS_TOKEN);
+                    if (!Strings.isEmpty(extraIps)) {
+                        disconnectedHostIps.addAll(Arrays.asList(extraIps.split(",")));
+                    }
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java`
around lines 1237 - 1256, The current stale-watcher filtering builds
disconnectedHostIps only from HostVO_.managementIp, missing additional addresses
stored in HostSystemTags.EXTRA_IPS; update the logic in
CephPrimaryStorageFactory where disconnectedHostIps is constructed so it also
queries and parses EXTRA_IPS for each non-Connected HostVO (split/tag value
parsing as done for Ceph Monitor disconnect handling), add those extra IPs into
the disconnectedHostIps set, and then continue to use that enriched set in the
existing loop that calls extractWatcherIp(watcher) to filter watchers so stale
watchers originating from EXTRA_IPS are correctly removed.


if (watchers.isEmpty()) {
completion.success();
return;
}

String installPath = Q.New(VolumeVO.class)
.eq(VolumeVO_.uuid, msg.getVolumeUuid())
.select(VolumeVO_.installPath)
Expand All @@ -1244,6 +1270,26 @@ public void run(MessageReply reply) {
});
}

/**
* Extract IP address from rbd watcher string.
* Format: "watcher=IP:port/nonce client.ID cookie=COOKIE"
*/
private String extractWatcherIp(String watcher) {
if (watcher == null) {
return null;
}
int idx = watcher.indexOf("watcher=");
if (idx < 0) {
return null;
}
String rest = watcher.substring(idx + 8);
int colonIdx = rest.indexOf(':');
if (colonIdx <= 0) {
return null;
}
return rest.substring(0, colonIdx);
}

@Override
public void preReleaseVmResource(VmInstanceSpec spec, Completion completion) {
completion.success();
Expand Down