<fix>[host]: add reconnect jitter and concurrency control [ZSTAC-61971] by zstack-robot-1 · Pull Request #3392 · MatheMatrix/zstack

zstack-robot-1 · 2026-02-26T03:32:34Z

Summary

3000台物理机同时重连时发生惊群效应，MN 被压垮，部分物理机 pending 堆积
根因: 所有物理机同时发起重连请求，无并发控制和分散机制
修复:
1. Semaphore 并发控制 (默认 100，可配置 connection.reconnectMaxConcurrency)
2. 随机 jitter 延迟 (默认 0-30s，可配置 connection.reconnectJitterMaxSeconds)

Changes

HostGlobalConfig.java: 新增 RECONNECT_JITTER_MAX_SECONDS 和 RECONNECT_MAX_CONCURRENCY 配置项
HostTrackImpl.java:
- reconnectNow() 增加 Semaphore acquire/release
- startTrack() 增加随机 jitter 延迟
- Semaphore 支持动态配置更新

Test Plan

小规模: 无 jitter 效果 (jitterMax=0 时直接 startRightNow)
大规模: 重连请求被分散到 30s 内，同时最多 100 个并发
动态配置: 修改 GlobalConfig 后 Semaphore 立即生效

Resolves: ZSTAC-61971

sync from gitlab !9240

…cale clusters Resolves: ZSTAC-61971 Change-Id: I267818cfc24ad3b358087f47274749c65ad3fc66

coderabbitai · 2026-02-26T03:33:10Z

概述

向主机全局配置中添加两个新配置项（重连抖动和并发限制），并在主机追踪模块中实现相应的信号量并发控制和随机延迟机制。

变更

队列 / 文件	摘要
配置扩展 `compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java`	新增两个全局配置：RECONNECT_JITTER_MAX_SECONDS（主机重连延迟最大抖动值，默认30秒）和RECONNECT_MAX_CONCURRENCY（最大并发重连操作数，默认100）。
并发控制实现 `compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`	引入信号量机制控制并发重连；添加基于配置的随机抖动启动延迟；实现配置变更监听器以动态调整并发限制。

代码审查工作量估计

🎯 3 (中等) | ⏱️ ~25 分钟

诗歌

🐰 配置添新两般宝，
抖动延时巧相邀，
信号量控并发潮，
主机追踪更稳妙！

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	PR标题遵循[scope]: 格式，长度71字符符合72字符限制，清晰描述了主要变更内容（重连jitter和并发控制）。
Description check	✅ Passed	PR描述详细说明了问题、根因、解决方案和测试计划，与代码变更内容（HostGlobalConfig和HostTrackImpl的修改）完全相关。

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sync/ye.zou/fix/ZSTAC-61971

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 76-80: The reconnectSemaphore is read twice from the field (once
for tryAcquire and again for release), which can cause acquire/release to happen
on different Semaphore instances after a hot-reload; fix by capturing the field
into a local final variable (e.g., final Semaphore localReconnectSemaphore =
this.reconnectSemaphore) immediately before tryAcquire in the HostTrackImpl
method that does reconnect handling, use that local variable for both tryAcquire
and release (and any early returns) so the same Semaphore instance is used
regardless of concurrent updates.

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5e4f49 and 140bcc8.

📒 Files selected for processing (2)

compute/src/main/java/org/zstack/compute/host/HostGlobalConfig.java
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java

coderabbitai · 2026-02-26T03:36:58Z

compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java

+        if (!reconnectSemaphore.tryAcquire()) {
+            logger.debug(String.format("[Host Tracker]: reconnect concurrency limit reached, deferring reconnect for host[uuid:%s]", uuid));
+            completion.success();
+            return;
+        }


⚠️ Potential issue | 🔴 Critical

动态配置更新后可能释放到错误的 Semaphore，导致并发上限失真。

Line 76 获取许可时使用的是“当前实例”，但 Line 89 释放时再次读取 reconnectSemaphore 字段。若期间触发 Line 381-385 的热更新替换实例，释放会落到新实例，造成旧实例许可泄漏、新实例许可超发。

🔧 建议修复（确保 acquire/release 使用同一实例）

private void reconnectNow(String uuid, Completion completion) { - if (!reconnectSemaphore.tryAcquire()) { + final Semaphore acquiredSemaphore = reconnectSemaphore; + if (!acquiredSemaphore.tryAcquire()) { logger.debug(String.format("[Host Tracker]: reconnect concurrency limit reached, deferring reconnect for host[uuid:%s]", uuid)); completion.success(); return; } ReconnectHostMsg msg = new ReconnectHostMsg(); msg.setHostUuid(uuid); msg.setSkipIfHostConnected(true); bus.makeTargetServiceIdByResourceUuid(msg, HostConstant.SERVICE_ID, uuid); - bus.send(msg, new CloudBusCallBack(completion) { - `@Override` - public void run(MessageReply reply) { - reconnectSemaphore.release(); - if (reply.isSuccess()) { - completion.success(); - } else { - completion.fail(reply.getError()); - } - } - }); + try { + bus.send(msg, new CloudBusCallBack(completion) { + `@Override` + public void run(MessageReply reply) { + acquiredSemaphore.release(); + if (reply.isSuccess()) { + completion.success(); + } else { + completion.fail(reply.getError()); + } + } + }); + } catch (Throwable t) { + acquiredSemaphore.release(); + throw t; + } }

Also applies to: 89-89, 381-385

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around lines 76 - 80, The reconnectSemaphore is read twice from the field (once for tryAcquire and again for release), which can cause acquire/release to happen on different Semaphore instances after a hot-reload; fix by capturing the field into a local final variable (e.g., final Semaphore localReconnectSemaphore = this.reconnectSemaphore) immediately before tryAcquire in the HostTrackImpl method that does reconnect handling, use that local variable for both tryAcquire and release (and any early returns) so the same Semaphore instance is used regardless of concurrent updates.

<fix>[host]: add reconnect jitter and concurrency control for large-s…

140bcc8

…cale clusters Resolves: ZSTAC-61971 Change-Id: I267818cfc24ad3b358087f47274749c65ad3fc66

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<fix>[host]: add reconnect jitter and concurrency control [ZSTAC-61971]#3392

<fix>[host]: add reconnect jitter and concurrency control [ZSTAC-61971]#3392
zstack-robot-1 wants to merge 1 commit into5.5.12from
sync/ye.zou/fix/ZSTAC-61971

zstack-robot-1 commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zstack-robot-1 commented Feb 26, 2026

Summary

Changes

Test Plan

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概述

变更

代码审查工作量估计

诗歌

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 26, 2026 •

edited

Loading