<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391
<fix>[host]: adaptive EMA-based ping timeout for large-scale clusters [ZSTAC-67534]#3391MatheMatrix wants to merge 2 commits into5.5.12from
Conversation
Resolves: ZSTAC-67534 Change-Id: I337a5bf8efa9cad20e39f947d5c06e944003205c
Walkthrough引入基于单主机 EMA 的自适应 Ping 超时机制:在 HostTrackImpl 中维护 per-host EMA 并提供计算自适应超时的静态方法,同时在 KVMHost 中将固定超时替换为该自适应超时,并新增全局配置开关以启用该功能。 Changes
Sequence Diagram(s)(不生成序列图) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 诗歌
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (1 passed)
✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)
45-45:pingResponseEma目前是无界增长结构,存在长期内存膨胀风险Line 45 使用
ConcurrentHashMap持久保存每个主机 EMA,但没有淘汰策略;在主机频繁增删/迁移的长期运行环境里会持续累积历史 UUID。建议修改(引入过期淘汰)
- private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>(); + private static final Cache<String, Double> pingResponseEma = CacheBuilder.newBuilder() + .expireAfterAccess(24, TimeUnit.HOURS) + .build(); @@ - Double ema = pingResponseEma.get(hostUuid); + Double ema = pingResponseEma.getIfPresent(hostUuid); @@ - pingResponseEma.merge(hostUuid, responseTimeSec, + pingResponseEma.asMap().merge(hostUuid, responseTimeSec, (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` at line 45, The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows unbounded and should be converted to a bounded/expiring structure to avoid memory bloat; replace pingResponseEma with a cache that evicts entries (for example Guava Cache or Caffeine) configured with a maximumSize and expireAfterAccess/expireAfterWrite, or implement a ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that removes entries older than a threshold; ensure all accesses currently using pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the chosen cache API or consult the timestamp before using/removing entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 51-59: getAdaptiveTimeout currently flips to EMA-based adaptive
timeout as soon as pingResponseEma contains a value; add a global compatibility
switch (e.g., a new HostGlobalConfig like ENABLE_ADAPTIVE_PING_TIMEOUT
defaulting to false) and update getAdaptiveTimeout to return the historical
configured timeout (HostGlobalConfig.PING_HOST_TIMEOUT.value...) when the switch
is off; when the switch is on keep the existing logic using pingResponseEma and
EMA_SAFETY_FACTOR so teams can gray-release and rollback the adaptive behavior.
---
Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Line 45: The static pingResponseEma ConcurrentHashMap in HostTrackImpl grows
unbounded and should be converted to a bounded/expiring structure to avoid
memory bloat; replace pingResponseEma with a cache that evicts entries (for
example Guava Cache or Caffeine) configured with a maximumSize and
expireAfterAccess/expireAfterWrite, or implement a
ConcurrentHashMap<Pair<Double, timestamp>> plus a periodic cleanup task that
removes entries older than a threshold; ensure all accesses currently using
pingResponseEma (reads/writes in HostTrackImpl methods) are updated to use the
chosen cache API or consult the timestamp before using/removing entries.
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.javaplugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
Resolves: ZSTAC-67534 Change-Id: I3f5bfa3af90a5ee419a9886e9e119551f81455e2
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java (1)
53-63: 建议增加最小样本门槛后再启用 EMA 超时Line 58-63 目前只要有一个样本就会使用自适应超时。单次异常慢包会立即放大 timeout,稳定性不足。建议在样本达到阈值后再启用 EMA。
🛠️ 参考改法
@@ private static final double EMA_ALPHA = 0.2; private static final double EMA_SAFETY_FACTOR = 3.0; + private static final int EMA_MIN_SAMPLE_COUNT = 3; private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>(); + private static final ConcurrentHashMap<String, AtomicInteger> pingResponseSampleCount = new ConcurrentHashMap<>(); @@ Double ema = pingResponseEma.get(hostUuid); - if (ema == null) { + AtomicInteger sampleCount = pingResponseSampleCount.get(hostUuid); + if (ema == null || sampleCount == null || sampleCount.get() < EMA_MIN_SAMPLE_COUNT) { return configured; } @@ private static void updatePingResponseEma(String hostUuid, double responseTimeSec) { pingResponseEma.merge(hostUuid, responseTimeSec, (oldEma, sample) -> EMA_ALPHA * sample + (1 - EMA_ALPHA) * oldEma); + pingResponseSampleCount.computeIfAbsent(hostUuid, k -> new AtomicInteger(0)).incrementAndGet(); }Also applies to: 70-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around lines 53 - 63, In getAdaptiveTimeout ensure EMA-based adaptive timeout is only used after a minimum number of samples: introduce or use a sample-count map (e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config), check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema * EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so transient single slow responses don't immediately inflate timeouts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 42-45: pingResponseEma 是一个静态 ConcurrentHashMap 存放主机的 EMA
数据,但没有清理路径会导致内存泄漏;在 HostTrackImpl 中为主机停止/删除或不再跟踪的路径添加清理逻辑以删除对应 key(例如在
stopTrackHost/handleHostDisconnected/handleHostDeleted 或任何取消跟踪的回调中调用
pingResponseEma.remove(hostUuid)),并且可选地在 HostTrackImpl 中新增一个统一方法 (e.g.
cleanupPingEmaForHost(String hostUuid))
以集中删除/记录;如果需要长期自动回收,可替换或补充为基于时间的回收(TTL)策略或使用弱引用缓存,但首要修复是在主机生命周期结束处显式移除
pingResponseEma 条目以避免无界增长。
---
Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 53-63: In getAdaptiveTimeout ensure EMA-based adaptive timeout is
only used after a minimum number of samples: introduce or use a sample-count map
(e.g., pingResponseCount) and a MIN_SAMPLES threshold (constant or config),
check that pingResponseCount.get(hostUuid) >= MIN_SAMPLES before using
pingResponseEma.get(hostUuid) to compute adaptive = ceil(ema *
EMA_SAFETY_FACTOR); otherwise return the configured timeout; apply the same
sample-threshold guard to the other EMA usage block mentioned (lines 70-73) so
transient single slow responses don't immediately inflate timeouts.
ℹ️ Review info
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
conf/globalConfig/host.xmlis excluded by!**/*.xml
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.javacompute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
| // EMA adaptive timeout: tracks per-host exponential moving average of ping response times (in seconds) | ||
| private static final double EMA_ALPHA = 0.2; | ||
| private static final double EMA_SAFETY_FACTOR = 3.0; | ||
| private static final ConcurrentHashMap<String, Double> pingResponseEma = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
pingResponseEma 缺少回收路径,存在长期内存增长风险
Line 45 新增了静态主机状态缓存,但当前类里没有与主机生命周期绑定的清理逻辑。主机被删除/停止跟踪后条目会残留,长期运行下会持续增长。
♻️ 建议修复
@@
if (t == null) {
logger.debug(String.format("host[uuid:%s] seems to be deleted, stop tracking it", uuid));
+ pingResponseEma.remove(uuid);
return;
}
@@
`@Override`
public void cancel() {
if (reconnectTask != null) {
reconnectTask.cancel();
}
super.cancel();
+ pingResponseEma.remove(uuid);
trackers.remove(uuid);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 42 - 45, pingResponseEma 是一个静态 ConcurrentHashMap 存放主机的 EMA
数据,但没有清理路径会导致内存泄漏;在 HostTrackImpl 中为主机停止/删除或不再跟踪的路径添加清理逻辑以删除对应 key(例如在
stopTrackHost/handleHostDisconnected/handleHostDeleted 或任何取消跟踪的回调中调用
pingResponseEma.remove(hostUuid)),并且可选地在 HostTrackImpl 中新增一个统一方法 (e.g.
cleanupPingEmaForHost(String hostUuid))
以集中删除/记录;如果需要长期自动回收,可替换或补充为基于时间的回收(TTL)策略或使用弱引用缓存,但首要修复是在主机生命周期结束处显式移除
pingResponseEma 条目以避免无界增长。
Summary
Changes
HostTrackImpl.java: 新增 EMA 计算逻辑、per-host 响应时间跟踪、getAdaptiveTimeout()静态方法KVMHost.java: 两处asyncJsonPosttimeout 参数改为调用HostTrackImpl.getAdaptiveTimeout()Test Plan
Resolves: ZSTAC-67534
sync from gitlab !9239