HDDS-14652. Handle NoSuchFileException during the bootstrap tarball transfer.#9784
HDDS-14652. Handle NoSuchFileException during the bootstrap tarball transfer.#9784aswinshakil merged 12 commits intoapache:masterfrom
Conversation
|
@SaketaChalamchala Could you please take a look? |
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/Archiver.java
Outdated
Show resolved
Hide resolved
...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OmSnapshotUtils.java
Outdated
Show resolved
Hide resolved
...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
57141ce to
5a9c4dc
Compare
There was a problem hiding this comment.
@sadanand48 I'm taking another pass at the review and I think we should not handle NoSuchFileException for snapshots and OM DB. If an expected file is missing from snapshots or OM checkpoint an error should be thrown and the bootstrap retried, right?
We should just handle the exception here for SST backup directory because we know that non-L0 SST files can be pruned from the directory.
Let me know what you think.
There was a problem hiding this comment.
Yeah I was thinking about it too , this should get fixed after HDDS-14651. However since that is not yet implemented we can selectively catch it only for this particular case i.e sstBackupDir. Makes sense.
SaketaChalamchala
left a comment
There was a problem hiding this comment.
LGTM. Thanks @sadanand48
aswinshakil
left a comment
There was a problem hiding this comment.
I'm good with the changes. I'm thinking if this can be simplified where we check file's existence in extractSSTFilesFromCompactionLog.
Since we have the bootstrap write lock, SST Pruning shouldn't have happened after that. Which kind of gives us a consistent state of the SST backup dir throughout the bootstrap.
Not really need for this PR, Just a suggestion.
What changes were proposed in this pull request?
While transferring the contents from SST backup dir, it can so happen that the DAG Pruner can delete the non-leaf nodes (SST files that are present on the disk) but not remove them from the compaction log.
We iterate the compaction log to get the list of backup sst files to transfer, this can lead to a situation where we are trying to transfer a file cleaned up by the pruner which will throw NoSuchFileException. The bootstrap will fail and retrigger only to be stuck again in a loop.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14652
How was this patch tested?
will add