Skip to content

Conversation

@deardeng
Copy link
Contributor

@deardeng deardeng commented Feb 3, 2026

Proposed changes

Problem

During cloud tablet decommission, some tablets take unexpectedly long time (5+ minutes) to migrate because FE keeps waiting for warmup tasks to complete, even though the tasks have already failed on BE side.

Root cause: In FileCacheBlockDownloader::download_file_cache_block(), when early return occurs (e.g., tablet not found, rowset not found, storage resource error), the _inflight_tablets count is not decremented. This causes:

  1. check_download_task() always returns done=false for these tablets
  2. FE's checkInflightWarmUpCacheAsync() waits until timeout (default 300 seconds)
  3. Tablet migration is blocked unnecessarily

Example log showing the issue:

W download_file_cache_block: tablet_id=1769675033824 rowset_id not found, rowset_id=020000000010fa85...

After this warning, the tablet's inflight count remains in _inflight_tablets map, causing the 5-minute wait before FE times out and proceeds.

Solution

  1. Extract the inflight count decrement logic into a reusable lambda decrease_inflight_count

  2. Call decrease_inflight_count() in all early return paths:

    • When get_tablet() fails
    • When rowset_id is not found
    • When remote_storage_resource() fails
  3. Refactor download_done callback to reuse decrease_inflight_count, eliminating code duplication

  4. Use value capture for decrease_inflight_count in download_done lambda to ensure lifetime safety if the callback is ever called asynchronously in the future

  5. Add unit tests to verify inflight count is correctly decremented on failures

Further comments

This bug also causes a minor memory leak: entries in _inflight_tablets map are never cleaned up when warmup fails, slowly accumulating over time (cleared on BE restart).

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
  2. Has unit tests been added:
    • Yes
    • No
  3. Has document been added or modified:
    • Yes
    • No
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Is there any sharding changes:
    • Yes
    • No

@Thearas
Copy link
Contributor

Thearas commented Feb 3, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng
Copy link
Contributor Author

deardeng commented Feb 3, 2026

run buildall

@deardeng deardeng force-pushed the fix-warmup-inflight-count-leak branch from 0491cea to 0c66eac Compare February 3, 2026 13:08
During cloud tablet decommission, some tablets take unexpectedly long time
(5+ minutes) to migrate because FE keeps waiting for warmup tasks to complete,
even though the tasks have already failed.

Root cause: In `FileCacheBlockDownloader::download_file_cache_block()`, when
early return occurs (e.g., tablet not found, rowset not found, storage resource
error), the `_inflight_tablets` count is not decremented. This causes:

1. `check_download_task()` always returns `done=false` for these tablets
2. FE's `checkInflightWarmUpCacheAsync()` waits until timeout (default 300s)
3. Tablet migration is blocked unnecessarily

Example log showing the issue:
```
W download_file_cache_block: tablet_id=xxx rowset_id not found, rowset_id=xxx
```
After this warning, the tablet's inflight count remains, causing the 5-minute wait.

1. Extract the inflight count decrement logic into a reusable lambda
   `decrease_inflight_count`

2. Call `decrease_inflight_count()` in all early return paths:
   - When `get_tablet()` fails
   - When `rowset_id` is not found
   - When `remote_storage_resource()` fails

3. Refactor `download_done` callback to reuse `decrease_inflight_count`,
   eliminating code duplication

4. Use value capture for `decrease_inflight_count` in `download_done` lambda
   to ensure lifetime safety if the callback is ever called asynchronously

5. Add unit tests to verify inflight count is correctly decremented on failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@deardeng deardeng force-pushed the fix-warmup-inflight-count-leak branch from 0c66eac to 7df33de Compare February 3, 2026 13:13
@deardeng
Copy link
Contributor Author

deardeng commented Feb 3, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32304 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7df33def749fb21237e395f9d5e8161ea4d97cfb, data reload: false

------ Round 1 ----------------------------------
q1	17620	5236	5079	5079
q2	2029	315	201	201
q3	10219	1335	757	757
q4	10197	845	318	318
q5	7508	2124	1933	1933
q6	202	179	149	149
q7	891	754	619	619
q8	9267	1432	1074	1074
q9	5180	4839	4938	4839
q10	6777	1959	1570	1570
q11	515	295	267	267
q12	342	380	235	235
q13	17762	4059	3266	3266
q14	236	245	222	222
q15	906	837	817	817
q16	682	720	630	630
q17	639	762	548	548
q18	7047	6547	7629	6547
q19	1245	1037	667	667
q20	438	370	238	238
q21	3043	2227	2040	2040
q22	383	330	288	288
Total cold run time: 103128 ms
Total hot run time: 32304 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5632	5532	5625	5532
q2	271	341	271	271
q3	2316	2845	2387	2387
q4	1484	2029	1394	1394
q5	4655	4523	4694	4523
q6	219	176	143	143
q7	2029	1921	1799	1799
q8	2536	2499	2527	2499
q9	7597	7458	7553	7458
q10	2808	3031	2547	2547
q11	549	467	450	450
q12	645	709	603	603
q13	3931	4424	3385	3385
q14	278	293	269	269
q15	840	789	790	789
q16	634	680	639	639
q17	1076	1241	1324	1241
q18	7447	7383	7278	7278
q19	865	843	817	817
q20	1997	2060	1932	1932
q21	4532	4265	4143	4143
q22	573	538	512	512
Total cold run time: 52914 ms
Total hot run time: 50611 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7df33def749fb21237e395f9d5e8161ea4d97cfb, data reload: false

query1	0.05	0.05	0.05
query2	0.08	0.04	0.04
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.28	0.25	0.24
query6	1.16	0.68	0.67
query7	0.03	0.02	0.02
query8	0.05	0.04	0.04
query9	0.56	0.51	0.49
query10	0.54	0.55	0.56
query11	0.14	0.10	0.09
query12	0.14	0.11	0.12
query13	0.62	0.60	0.62
query14	1.06	1.05	1.05
query15	0.87	0.86	0.87
query16	0.40	0.39	0.40
query17	1.12	1.15	1.15
query18	0.22	0.21	0.21
query19	2.09	1.98	2.04
query20	0.01	0.01	0.02
query21	15.38	0.25	0.15
query22	5.34	0.05	0.05
query23	15.99	0.29	0.10
query24	1.87	0.31	0.18
query25	0.09	0.08	0.05
query26	0.14	0.14	0.13
query27	0.08	0.05	0.08
query28	3.31	1.15	0.98
query29	12.55	3.91	3.14
query30	0.27	0.14	0.12
query31	2.83	0.66	0.41
query32	3.24	0.60	0.50
query33	3.30	3.25	3.29
query34	16.45	5.41	4.75
query35	4.79	4.81	4.87
query36	0.64	0.50	0.49
query37	0.10	0.07	0.07
query38	0.06	0.04	0.03
query39	0.05	0.03	0.03
query40	0.19	0.16	0.16
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 98.12 s
Total hot run time: 28.29 s

gavinchou
gavinchou previously approved these changes Feb 3, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 3, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

PR approved by anyone and no changes requested.

dataroaring
dataroaring previously approved these changes Feb 3, 2026
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@deardeng
Copy link
Contributor Author

deardeng commented Feb 5, 2026

run beut

@deardeng
Copy link
Contributor Author

deardeng commented Feb 5, 2026

run external

@deardeng
Copy link
Contributor Author

deardeng commented Feb 5, 2026

run nonConcurrent

@deardeng
Copy link
Contributor Author

deardeng commented Feb 5, 2026

run p0

@deardeng deardeng dismissed stale reviews from dataroaring and gavinchou via 31e9277 February 6, 2026 09:25
@deardeng
Copy link
Contributor Author

deardeng commented Feb 6, 2026

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Feb 6, 2026
@doris-robot
Copy link

TPC-H: Total hot run time: 30433 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 31e92774e30813a10829ad64a8b3461c95b46faf, data reload: false

------ Round 1 ----------------------------------
q1	17608	4545	4292	4292
q2	2042	361	240	240
q3	10134	1298	735	735
q4	10200	784	324	324
q5	7536	2177	1945	1945
q6	194	188	152	152
q7	901	733	622	622
q8	9287	1446	1096	1096
q9	4724	4635	4585	4585
q10	6764	1928	1546	1546
q11	520	314	311	311
q12	345	371	223	223
q13	17783	4006	3242	3242
q14	241	242	218	218
q15	911	824	804	804
q16	674	684	630	630
q17	695	786	575	575
q18	6428	5893	5787	5787
q19	1241	986	633	633
q20	507	495	390	390
q21	2610	1884	1798	1798
q22	372	325	285	285
Total cold run time: 101717 ms
Total hot run time: 30433 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4315	4336	4327	4327
q2	260	346	252	252
q3	2094	2680	2225	2225
q4	1341	1741	1281	1281
q5	4337	4134	4301	4134
q6	218	178	141	141
q7	1821	1808	1709	1709
q8	2481	2698	2487	2487
q9	7784	7442	7522	7442
q10	2838	3168	2654	2654
q11	546	502	455	455
q12	692	733	610	610
q13	3918	4491	3672	3672
q14	309	322	279	279
q15	878	830	797	797
q16	678	725	673	673
q17	1144	1307	1362	1307
q18	8408	8093	7974	7974
q19	876	830	828	828
q20	2046	2166	2056	2056
q21	4703	4626	4333	4333
q22	629	529	524	524
Total cold run time: 52316 ms
Total hot run time: 50160 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.27 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 31e92774e30813a10829ad64a8b3461c95b46faf, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.06	0.05
query3	0.26	0.08	0.08
query4	1.61	0.12	0.11
query5	0.27	0.25	0.24
query6	1.18	0.67	0.66
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.58	0.51	0.49
query10	0.56	0.55	0.56
query11	0.14	0.10	0.11
query12	0.14	0.11	0.11
query13	0.64	0.62	0.61
query14	1.08	1.05	1.08
query15	0.89	0.87	0.88
query16	0.40	0.40	0.39
query17	1.10	1.07	1.20
query18	0.23	0.24	0.22
query19	2.02	2.02	1.97
query20	0.01	0.01	0.02
query21	15.40	0.27	0.15
query22	5.18	0.06	0.05
query23	16.02	0.30	0.11
query24	1.52	0.35	0.41
query25	0.09	0.10	0.06
query26	0.14	0.14	0.14
query27	0.06	0.05	0.08
query28	4.12	1.16	0.96
query29	12.54	3.91	3.16
query30	0.27	0.15	0.12
query31	2.83	0.64	0.41
query32	3.24	0.57	0.50
query33	3.30	3.34	3.16
query34	16.20	5.44	4.72
query35	4.84	4.75	4.86
query36	0.65	0.50	0.49
query37	0.11	0.07	0.07
query38	0.08	0.04	0.04
query39	0.05	0.03	0.03
query40	0.20	0.16	0.16
query41	0.08	0.03	0.04
query42	0.05	0.03	0.04
query43	0.04	0.04	0.03
Total cold run time: 98.35 s
Total hot run time: 28.27 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 65.22% (15/23) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.71% (19446/36895)
Line Coverage 36.20% (181006/499986)
Region Coverage 32.51% (140220/431260)
Branch Coverage 33.53% (60712/181093)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 65.22% (15/23) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.36% (26524/36156)
Line Coverage 56.46% (281596/498762)
Region Coverage 54.16% (235923/435643)
Branch Coverage 55.82% (101475/181801)

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 11, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@hello-stephen hello-stephen merged commit 7990571 into apache:master Feb 11, 2026
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.0.x-conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants