Skip to content

<fix>[host]: add reconnect jitter and concurrency control [ZSTAC-61971]#3392

Open
zstack-robot-1 wants to merge 1 commit into5.5.12from
sync/ye.zou/fix/ZSTAC-61971
Open

<fix>[host]: add reconnect jitter and concurrency control [ZSTAC-61971]#3392
zstack-robot-1 wants to merge 1 commit into5.5.12from
sync/ye.zou/fix/ZSTAC-61971

Conversation

@zstack-robot-1
Copy link
Collaborator

Summary

  • 3000台物理机同时重连时发生惊群效应,MN 被压垮,部分物理机 pending 堆积
  • 根因: 所有物理机同时发起重连请求,无并发控制和分散机制
  • 修复:
    1. Semaphore 并发控制 (默认 100,可配置 connection.reconnectMaxConcurrency)
    2. 随机 jitter 延迟 (默认 0-30s,可配置 connection.reconnectJitterMaxSeconds)

Changes

  • HostGlobalConfig.java: 新增 RECONNECT_JITTER_MAX_SECONDSRECONNECT_MAX_CONCURRENCY 配置项
  • HostTrackImpl.java:
    • reconnectNow() 增加 Semaphore acquire/release
    • startTrack() 增加随机 jitter 延迟
    • Semaphore 支持动态配置更新

Test Plan

  • 小规模: 无 jitter 效果 (jitterMax=0 时直接 startRightNow)
  • 大规模: 重连请求被分散到 30s 内,同时最多 100 个并发
  • 动态配置: 修改 GlobalConfig 后 Semaphore 立即生效

Resolves: ZSTAC-61971

sync from gitlab !9240

…cale clusters

Resolves: ZSTAC-61971

Change-Id: I267818cfc24ad3b358087f47274749c65ad3fc66
@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

概述

向主机全局配置中添加两个新配置项(重连抖动和并发限制),并在主机追踪模块中实现相应的信号量并发控制和随机延迟机制。

变更

队列 / 文件 摘要
配置扩展
compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
新增两个全局配置:RECONNECT_JITTER_MAX_SECONDS(主机重连延迟最大抖动值,默认30秒)和RECONNECT_MAX_CONCURRENCY(最大并发重连操作数,默认100)。
并发控制实现
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
引入信号量机制控制并发重连;添加基于配置的随机抖动启动延迟;实现配置变更监听器以动态调整并发限制。

代码审查工作量估计

🎯 3 (中等) | ⏱️ ~25 分钟

诗歌

🐰 配置添新两般宝,
抖动延时巧相邀,
信号量控并发潮,
主机追踪更稳妙!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed PR标题遵循[scope]: 格式,长度71字符符合72字符限制,清晰描述了主要变更内容(重连jitter和并发控制)。
Description check ✅ Passed PR描述详细说明了问题、根因、解决方案和测试计划,与代码变更内容(HostGlobalConfig和HostTrackImpl的修改)完全相关。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-61971

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 76-80: The reconnectSemaphore is read twice from the field (once
for tryAcquire and again for release), which can cause acquire/release to happen
on different Semaphore instances after a hot-reload; fix by capturing the field
into a local final variable (e.g., final Semaphore localReconnectSemaphore =
this.reconnectSemaphore) immediately before tryAcquire in the HostTrackImpl
method that does reconnect handling, use that local variable for both tryAcquire
and release (and any early returns) so the same Semaphore instance is used
regardless of concurrent updates.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5e4f49 and 140bcc8.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java

Comment on lines +76 to +80
if (!reconnectSemaphore.tryAcquire()) {
logger.debug(String.format("[Host Tracker]: reconnect concurrency limit reached, deferring reconnect for host[uuid:%s]", uuid));
completion.success();
return;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

动态配置更新后可能释放到错误的 Semaphore,导致并发上限失真。

Line 76 获取许可时使用的是“当前实例”,但 Line 89 释放时再次读取 reconnectSemaphore 字段。若期间触发 Line 381-385 的热更新替换实例,释放会落到新实例,造成旧实例许可泄漏、新实例许可超发。

🔧 建议修复(确保 acquire/release 使用同一实例)
 private void reconnectNow(String uuid, Completion completion) {
-    if (!reconnectSemaphore.tryAcquire()) {
+    final Semaphore acquiredSemaphore = reconnectSemaphore;
+    if (!acquiredSemaphore.tryAcquire()) {
         logger.debug(String.format("[Host Tracker]: reconnect concurrency limit reached, deferring reconnect for host[uuid:%s]", uuid));
         completion.success();
         return;
     }

     ReconnectHostMsg msg = new ReconnectHostMsg();
     msg.setHostUuid(uuid);
     msg.setSkipIfHostConnected(true);
     bus.makeTargetServiceIdByResourceUuid(msg, HostConstant.SERVICE_ID, uuid);
-    bus.send(msg, new CloudBusCallBack(completion) {
-        `@Override`
-        public void run(MessageReply reply) {
-            reconnectSemaphore.release();
-            if (reply.isSuccess()) {
-                completion.success();
-            } else {
-                completion.fail(reply.getError());
-            }
-        }
-    });
+    try {
+        bus.send(msg, new CloudBusCallBack(completion) {
+            `@Override`
+            public void run(MessageReply reply) {
+                acquiredSemaphore.release();
+                if (reply.isSuccess()) {
+                    completion.success();
+                } else {
+                    completion.fail(reply.getError());
+                }
+            }
+        });
+    } catch (Throwable t) {
+        acquiredSemaphore.release();
+        throw t;
+    }
 }

Also applies to: 89-89, 381-385

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 76 - 80, The reconnectSemaphore is read twice from the field (once for
tryAcquire and again for release), which can cause acquire/release to happen on
different Semaphore instances after a hot-reload; fix by capturing the field
into a local final variable (e.g., final Semaphore localReconnectSemaphore =
this.reconnectSemaphore) immediately before tryAcquire in the HostTrackImpl
method that does reconnect handling, use that local variable for both tryAcquire
and release (and any early returns) so the same Semaphore instance is used
regardless of concurrent updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants