-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](cloud) Fix warmup inflight count not decremented on early return #60480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](cloud) Fix warmup inflight count not decremented on early return #60480
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
0491cea to
0c66eac
Compare
During cloud tablet decommission, some tablets take unexpectedly long time (5+ minutes) to migrate because FE keeps waiting for warmup tasks to complete, even though the tasks have already failed. Root cause: In `FileCacheBlockDownloader::download_file_cache_block()`, when early return occurs (e.g., tablet not found, rowset not found, storage resource error), the `_inflight_tablets` count is not decremented. This causes: 1. `check_download_task()` always returns `done=false` for these tablets 2. FE's `checkInflightWarmUpCacheAsync()` waits until timeout (default 300s) 3. Tablet migration is blocked unnecessarily Example log showing the issue: ``` W download_file_cache_block: tablet_id=xxx rowset_id not found, rowset_id=xxx ``` After this warning, the tablet's inflight count remains, causing the 5-minute wait. 1. Extract the inflight count decrement logic into a reusable lambda `decrease_inflight_count` 2. Call `decrease_inflight_count()` in all early return paths: - When `get_tablet()` fails - When `rowset_id` is not found - When `remote_storage_resource()` fails 3. Refactor `download_done` callback to reuse `decrease_inflight_count`, eliminating code duplication 4. Use value capture for `decrease_inflight_count` in `download_done` lambda to ensure lifetime safety if the callback is ever called asynchronously 5. Add unit tests to verify inflight count is correctly decremented on failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
0c66eac to
7df33de
Compare
|
run buildall |
TPC-H: Total hot run time: 32304 ms |
ClickBench: Total hot run time: 28.29 s |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
dataroaring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
run beut |
|
run external |
|
run nonConcurrent |
|
run p0 |
|
run buildall |
TPC-H: Total hot run time: 30433 ms |
ClickBench: Total hot run time: 28.27 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
Proposed changes
Problem
During cloud tablet decommission, some tablets take unexpectedly long time (5+ minutes) to migrate because FE keeps waiting for warmup tasks to complete, even though the tasks have already failed on BE side.
Root cause: In
FileCacheBlockDownloader::download_file_cache_block(), when early return occurs (e.g., tablet not found, rowset not found, storage resource error), the_inflight_tabletscount is not decremented. This causes:check_download_task()always returnsdone=falsefor these tabletscheckInflightWarmUpCacheAsync()waits until timeout (default 300 seconds)Example log showing the issue:
After this warning, the tablet's inflight count remains in
_inflight_tabletsmap, causing the 5-minute wait before FE times out and proceeds.Solution
Extract the inflight count decrement logic into a reusable lambda
decrease_inflight_countCall
decrease_inflight_count()in all early return paths:get_tablet()failsrowset_idis not foundremote_storage_resource()failsRefactor
download_donecallback to reusedecrease_inflight_count, eliminating code duplicationUse value capture for
decrease_inflight_countindownload_donelambda to ensure lifetime safety if the callback is ever called asynchronously in the futureAdd unit tests to verify inflight count is correctly decremented on failures
Further comments
This bug also causes a minor memory leak: entries in
_inflight_tabletsmap are never cleaned up when warmup fails, slowly accumulating over time (cleared on BE restart).Checklist(Required)