Skip to content

Conversation

@deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Jan 27, 2026

What changes were proposed in this pull request?

  1. Replaced assert with an exception in TxnUtils#createValidTxnListForCleaner
  2. Fixed findReadyToClean eligibility condition in ReadyToCleanHandler
    (i.e. CQ_HIGHEST_WRITE_ID < MIN_OPEN_WRITE_ID - 1)
  3. Removed redundant addWriteIdsToMinHistory from AcidCompactionService

Why are the changes needed?

Compaction retries triggered by timeouts under an incorrect configuration pose a risk of data loss. If a compaction is re-attempted before the prior attempt has completed or been properly aborted, the Cleaner may observe multiple base directories with the same writeId and erroneously delete all of them.

Does this PR introduce any user-facing change?

No

How was this patch tested?

TestCleanerWithMinHistoryWriteId

@deniskuzZ deniskuzZ changed the title [DRAFT] ACID: Fix Cleaner processing of retried compaction attempts previously killed HIVE-29420: Hive ACID: Cleaner mishandles retries of killed compactions Jan 28, 2026
@deniskuzZ deniskuzZ force-pushed the cleaner-fix branch 2 times, most recently from 58663a0 to e2058e1 Compare January 28, 2026 22:14
Copy link
Contributor

@kuczoram kuczoram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Denys for this patch! Overall it looks good, the test cases also look good. The second testcase indeed can reproduce the issue with the two bases (if the assert is disabled).
I have one question about the SQL query modification.

" AND \"cq1\".\"CQ_TABLE\" = \"hwm\".\"MH_TABLE\"";

whereClause += " AND (\"CQ_HIGHEST_WRITE_ID\" < \"MIN_OPEN_WRITE_ID\" OR \"MIN_OPEN_WRITE_ID\" IS NULL)";
whereClause += " AND (\"CQ_HIGHEST_WRITE_ID\" < \"MIN_OPEN_WRITE_ID\"-1 OR \"MIN_OPEN_WRITE_ID\" IS NULL)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the -1 here? Could you please explain this a little?

Copy link
Member Author

@deniskuzZ deniskuzZ Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Scenario

  1. Multiple deltas exist with max writeId = 50
  2. SELECT query (TXN_ID=100) starts:
  • Takes snapshot, sees writeId 50 as committed;
  • Records minOpenWriteId = 51;
  • Begins reading from deltas up to writeId 50;
  1. Compaction (TXN_ID=101) runs:
  • Produces base_50 (includes all data up to writeId 50);
  • Does not create new writeIds;
  • Commits;
  1. Cleaner starts and evaluates cleanup eligibility

Without the -1 offset
Condition: CQ_HIGHEST_WRITE_ID < MIN_OPEN_WRITE_ID

  • 50 < 51 = true → cleaner proceeds
  • Problem: The SELECT query (TXN_ID=100) is still running and reading from deltas
  • The reader cannot switch to base_50 mid-query; its snapshot was taken before compaction committed
  • Cleaner throws an exception when evaluating createValidTxnListForCleaner (newly added guard) with something like (txnid:100 is open and <= hwm: 50).
  • It may retry several times and potentially mark the cleanup attempt as failed.

With the -1 offset
Condition: CQ_HIGHEST_WRITE_ID < MIN_OPEN_WRITE_ID - 1

  • MIN_OPEN_WRITE_ID - 1 => Highest write ID for a compacted file to be considered by reader;
  • 50 < 51 - 1 = 50 < 50 = false → cleaner does not proceed
  • This requires a gap of at least 1 writeId between compaction's HWM and the minimum open writeId

The -1 offset protects long-running readers that started before compaction committed by ensuring cleanup only proceeds when there's a writeId gap, indicating it's safe to delete the obsolete deltas.

PS: We could have applied the CQ_NEXT_TXN_ID < MIN_OPEN_TXN_ID condition; however, doing so would require synchronization/locking during transaction commit to obtain an accurate CQ_NEXT_TXN_ID.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants