Skip to content

<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391

Open
MatheMatrix wants to merge 2 commits into5.5.12from
sync/ye.zou/fix/ZSTAC-67534
Open

<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391
MatheMatrix wants to merge 2 commits into5.5.12from
sync/ye.zou/fix/ZSTAC-67534

Conversation

@MatheMatrix
Copy link
Owner

Summary

  • 大规模集群(3000+物理机)中,物理机频繁误报失联又连接
  • 根因: ping timeout 使用静态值,大规模环境下 ping 响应时间自然较长导致超时误判
  • 修复: 实现 EMA (指数移动平均) 自适应超时算法,per-host 跟踪 ping 响应时间,timeout = max(configured, EMA * 3.0)

Changes

  • HostTrackImpl.java: 新增 EMA 计算逻辑、per-host 响应时间跟踪、getAdaptiveTimeout() 静态方法
  • KVMHost.java: 两处 asyncJsonPost timeout 参数改为调用 HostTrackImpl.getAdaptiveTimeout()

Test Plan

  • 小规模集群: 超时值 = 配置值 (EMA 未积累足够数据时使用配置值作为 floor)
  • 大规模集群: EMA 自适应增大超时值,消除误报
  • 动态调整: 网络恢复后 EMA 逐渐缩小超时值

Resolves: ZSTAC-67534

sync from gitlab !9239

Resolves: ZSTAC-67534

Change-Id: I337a5bf8efa9cad20e39f947d5c06e944003205c
@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Walkthrough

引入基于单主机 EMA 的自适应 Ping 超时机制:在 HostTrackImpl 中维护 per-host EMA 并提供计算自适应超时的静态方法,同时在 KVMHost 中将固定超时替换为该自适应超时,并新增全局配置开关以启用该功能。

Changes

Cohort / File(s) Summary
自适应 Ping 超时核心实现
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
新增 EMA 常数与 pingResponseEma 存储;新增 public static long getAdaptiveTimeout(String hostUuid) 用于返回每主机自适应超时;新增 updatePingResponseEma(...) 并在 ping 成功路径记录并更新 EMA 与观测响应时间。注意并未改变现有判决分支,只在日志与超时值计算处使用 EMA。
超时参数应用点
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
将两处固定的 HostGlobalConfig.PING_HOST_TIMEOUT.value(Long.class) 替换为 HostTrackImpl.getAdaptiveTimeout(self.getUuid()),使 KVM 主机 ping 调用使用主机级自适应超时。
全局配置开关
compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
新增布尔全局配置 PING_ADAPTIVE_TIMEOUT_ENABLED(默认 false)以开启/关闭 EMA 自适应超时功能。

Sequence Diagram(s)

(不生成序列图)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

诗歌

🐰 我数着每次心跳的秒数,
EMA 在耳畔轻声低语,
超时随主机节拍慢慢调整,
链路更耐心,重试更温柔,
跳跃的小脚印,网络安好 🥕


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Title check ❌ Error 标题超过72字符限制(83字符),不符合格式要求。 缩短标题至72字符以内,例如:'[host]: adaptive EMA-based ping timeout [ZSTAC-67534]'
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed 描述充分说明了问题根因、解决方案和代码变更,与变更集高度相关。
✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-67534

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)

45-45: pingResponseEma 目前是无界增长结构,存在长期内存膨胀风险

Line 45 使用 ConcurrentHashMap 持久保存每个主机 EMA,但没有淘汰策略;在主机频繁增删/迁移的长期运行环境里会持续累积历史 UUID。

建议修改(引入过期淘汰)
-    private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>();
+    private static final Cache<String, Double> pingResponseEma = CacheBuilder.newBuilder()
+            .expireAfterAccess(24, TimeUnit.HOURS)
+            .build();

@@
-        Double ema = pingResponseEma.get(hostUuid);
+        Double ema = pingResponseEma.getIfPresent(hostUuid);

@@
-        pingResponseEma.merge(hostUuid, responseTimeSec,
+        pingResponseEma.asMap().merge(hostUuid, responseTimeSec,
                 (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` at line 45,
The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows unbounded
and should be converted to a bounded/expiring structure to avoid memory bloat;
replace pingResponseEma with a cache that evicts entries (for example Guava
Cache or Caffeine) configured with a maximumSize and
expireAfterAccess/expireAfterWrite, or implement a
ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that
removes entries older than a threshold; ensure all accesses currently using
pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the
chosen cache API or consult the timestamp before using/removing entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 51-59: getAdaptiveTimeout currently flips to EMA-based adaptive
timeout as soon as pingResponseEma contains a value; add a global compatibility
switch (e.g., a new HostGlobalConfig like ENABLE_ADAPTIVE_PING_TIMEOUT
defaulting to false) and update getAdaptiveTimeout to return the historical
configured timeout (HostGlobalConfig.PING_HOST_TIMEOUT.value...) when the switch
is off; when the switch is on keep the existing logic using pingResponseEma and
EMA_SAFETY_FACTOR so teams can gray-release and rollback the adaptive behavior.

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Line 45: The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows
unbounded and should be converted to a bounded/expiring structure to avoid
memory bloat; replace pingResponseEma with a cache that evicts entries (for
example Guava Cache or Caffeine) configured with a maximumSize and
expireAfterAccess/expireAfterWrite, or implement a
ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that
removes entries older than a threshold; ensure all accesses currently using
pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the
chosen cache API or consult the timestamp before using/removing entries.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5e4f49 and 20160c7.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

Resolves: ZSTAC-67534

Change-Id: I3f5bfa3af90a5ee419a9886e9e119551f81455e2
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)

53-63: 建议增加最小样本门槛后再启用 EMA 超时

Line 58-63 目前只要有一个样本就会使用自适应超时。单次异常慢包会立即放大 timeout,稳定性不足。建议在样本达到阈值后再启用 EMA。

🛠️ 参考改法
@@
     private static final double EMA_ALPHA = 0.2;
     private static final double EMA_SAFETY_FACTOR = 3.0;
+    private static final int EMA_MIN_SAMPLE_COUNT = 3;
     private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>();
+    private static final ConcurrentHashMap<String, AtomicInteger> pingResponseSampleCount = new ConcurrentHashMap<>();
@@
         Double ema = pingResponseEma.get(hostUuid);
-        if (ema == null) {
+        AtomicInteger sampleCount = pingResponseSampleCount.get(hostUuid);
+        if (ema == null || sampleCount == null || sampleCount.get() < EMA_MIN_SAMPLE_COUNT) {
             return configured;
         }
@@
     private static void updatePingResponseEma(String hostUuid, double responseTimeSec) {
         pingResponseEma.merge(hostUuid, responseTimeSec,
                 (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma);
+        pingResponseSampleCount.computeIfAbsent(hostUuid, k -> new AtomicInteger(0)).incrementAndGet();
     }

Also applies to: 70-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 53 - 63, In getAdaptiveTimeout ensure EMA-based adaptive timeout is only
used after a minimum number of samples: introduce or use a sample-count map
(e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config),
check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using
pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema *
EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same
sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so
transient single slow responses don't immediately inflate timeouts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 42-45: pingResponseEma 是一个静态 ConcurrentHashMap 存放主机的 EMA
数据,但没有清理路径会导致内存泄漏;在 HostTrackImpl 中为主机停止/删除或不再跟踪的路径添加清理逻辑以删除对应 key(例如在
stopTrackHost/handleHostDisconnected/handleHostDeleted 或任何取消跟踪的回调中调用
pingResponseEma.remove(hostUuid)),并且可选地在 HostTrackImpl 中新增一个统一方法 (e.g.
cleanupPingEmaForHost(String hostUuid))
以集中删除/记录;如果需要长期自动回收,可替换或补充为基于时间的回收(TTL)策略或使用弱引用缓存,但首要修复是在主机生命周期结束处显式移除
pingResponseEma 条目以避免无界增长。

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 53-63: In getAdaptiveTimeout ensure EMA-based adaptive timeout is
only used after a minimum number of samples: introduce or use a sample-count map
(e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config),
check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using
pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema *
EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same
sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so
transient single slow responses don't immediately inflate timeouts.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20160c7 and c8365ef.

⛔ Files ignored due to path filters (1)
  • conf/globalConfig/host.xml is excluded by !**/*.xml
📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java

Comment on lines +42 to +45
// EMA adaptive timeout: tracks per-host exponential moving average of ping response times (in seconds)
private static final double EMA_ALPHA = 0.2;
private static final double EMA_SAFETY_FACTOR = 3.0;
private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

pingResponseEma 缺少回收路径,存在长期内存增长风险

Line 45 新增了静态主机状态缓存,但当前类里没有与主机生命周期绑定的清理逻辑。主机被删除/停止跟踪后条目会残留,长期运行下会持续增长。

♻️ 建议修复
@@
             if (t == null) {
                 logger.debug(String.format("host[uuid:%s] seems to be deleted, stop tracking it", uuid));
+                pingResponseEma.remove(uuid);
                 return;
             }
@@
         `@Override`
         public void cancel() {
             if (reconnectTask != null) {
                 reconnectTask.cancel();
             }

             super.cancel();

+            pingResponseEma.remove(uuid);
             trackers.remove(uuid);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 42 - 45, pingResponseEma 是一个静态 ConcurrentHashMap 存放主机的 EMA
数据,但没有清理路径会导致内存泄漏;在 HostTrackImpl 中为主机停止/删除或不再跟踪的路径添加清理逻辑以删除对应 key(例如在
stopTrackHost/handleHostDisconnected/handleHostDeleted 或任何取消跟踪的回调中调用
pingResponseEma.remove(hostUuid)),并且可选地在 HostTrackImpl 中新增一个统一方法 (e.g.
cleanupPingEmaForHost(String hostUuid))
以集中删除/记录;如果需要长期自动回收,可替换或补充为基于时间的回收(TTL)策略或使用弱引用缓存,但首要修复是在主机生命周期结束处显式移除
pingResponseEma 条目以避免无界增长。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants