-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29420: Hive ACID: Cleaner mishandles retries of killed compactions #6281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
427c785 to
dea31f2
Compare
e7d2384 to
2313e42
Compare
558170a to
5919458
Compare
5919458 to
8af6c57
Compare
58663a0 to
e2058e1
Compare
e2058e1 to
d328704
Compare
d328704 to
adeac37
Compare
ebcc557 to
b9d4ac1
Compare
kuczoram
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Denys for this patch! Overall it looks good, the test cases also look good. The second testcase indeed can reproduce the issue with the two bases (if the assert is disabled).
I have one question about the SQL query modification.
| " AND \"cq1\".\"CQ_TABLE\" = \"hwm\".\"MH_TABLE\""; | ||
|
|
||
| whereClause += " AND (\"CQ_HIGHEST_WRITE_ID\" < \"MIN_OPEN_WRITE_ID\" OR \"MIN_OPEN_WRITE_ID\" IS NULL)"; | ||
| whereClause += " AND (\"CQ_HIGHEST_WRITE_ID\" < \"MIN_OPEN_WRITE_ID\"-1 OR \"MIN_OPEN_WRITE_ID\" IS NULL)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the -1 here? Could you please explain this a little?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
Scenario
- Multiple deltas exist with max writeId = 50
- SELECT query (TXN_ID=100) starts:
- Takes snapshot, sees writeId 50 as committed;
- Records minOpenWriteId = 51;
- Begins reading from deltas up to writeId 50;
- Compaction (TXN_ID=101) runs:
- Produces base_50 (includes all data up to writeId 50);
- Does not create new writeIds;
- Commits;
- Cleaner starts and evaluates cleanup eligibility
Without the -1 offset
Condition: CQ_HIGHEST_WRITE_ID < MIN_OPEN_WRITE_ID
- 50 < 51 = true → cleaner proceeds
- Problem: The SELECT query (TXN_ID=100) is still running and reading from deltas
- The reader cannot switch to base_50 mid-query; its snapshot was taken before compaction committed
- Cleaner throws an exception when evaluating
createValidTxnListForCleaner(newly added guard) with something like (txnid:100 is open and <= hwm: 50). - It may retry several times and potentially mark the cleanup attempt as failed.
With the -1 offset
Condition: CQ_HIGHEST_WRITE_ID < MIN_OPEN_WRITE_ID - 1
- MIN_OPEN_WRITE_ID - 1 => Highest write ID for a compacted file to be considered by reader;
- 50 < 51 - 1 = 50 < 50 = false → cleaner does not proceed
- This requires a gap of at least 1 writeId between compaction's HWM and the minimum open writeId
The -1 offset protects long-running readers that started before compaction committed by ensuring cleanup only proceeds when there's a writeId gap, indicating it's safe to delete the obsolete deltas.
PS: We could have applied the CQ_NEXT_TXN_ID < MIN_OPEN_TXN_ID condition; however, doing so would require synchronization/locking during transaction commit to obtain an accurate CQ_NEXT_TXN_ID.
b9d4ac1 to
f29a40b
Compare
|



What changes were proposed in this pull request?
TxnUtils#createValidTxnListForCleanerfindReadyToCleaneligibility condition inReadyToCleanHandler(i.e. CQ_HIGHEST_WRITE_ID < MIN_OPEN_WRITE_ID - 1)
addWriteIdsToMinHistoryfrom AcidCompactionServiceWhy are the changes needed?
Compaction retries triggered by timeouts under an incorrect configuration pose a risk of data loss. If a compaction is re-attempted before the prior attempt has completed or been properly aborted, the Cleaner may observe multiple base directories with the same writeId and erroneously delete all of them.
Does this PR introduce any user-facing change?
No
How was this patch tested?
TestCleanerWithMinHistoryWriteId