-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Integrate BulkDump/BulkLoad with backup/restore system #12608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
saintstack
commented
Dec 24, 2025
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
|
Failures are in AtomicBackupToDBCorrectness.toml and VersionStampSwitchover.toml. Don't seem related but will dig more. |
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
b47782e to
4a7a00f
Compare
|
Fix compile failure. |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
|
Joshua for this latest PR:
Failure is this (No changes to DD in the patch...)
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
johscheuer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the design.
| BW -->|Save partitioned logs| ML | ||
| BDTF -->|Save SST snapshots| BD | ||
| style BDTF fill:#FFD700 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rendered version it's a bit hard to read the text inside the box, perhaps a different fill colour or text colour would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **`backup_agent`** = Long-running executable process that executes backup-related TaskBucket tasks only | ||
| - **`Backup Agents`** = Instances of `backup_agent` processes executing backup TaskBucket tasks | ||
| - **`fdbbackup`** = Command-line tool that submits backup jobs to TaskBucket (does not execute the backup itself) | ||
| - **`fdbrestore`** = Command-line tool that both submits and executes restore jobs (different from backup_agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fdbrestore doesn't execute the actual restore right? I believe fdbrestore behaves the same as fdbbackup in a way that it only submits a restore job to the TaskBucket but you still need Backup Agents to do the restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Fixed.
| **Flow:** | ||
| 1. User runs `fdbbackup start` → Submits backup job to TaskBucket | ||
| 2. Running `backup_agent` processes pick up and execute the backup tasks | ||
| 3. User runs `fdbrestore start` → Submits and executes restore job (not via backup_agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, would be great to verify that fdbrestore actually executes the restore command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| RA3 -->|Transactions| CP3 | ||
| CP3 --> SS3 | ||
| style BLTF fill:#FFD700 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rendered version it's a bit hard to read the text inside the box, perhaps a different fill colour or text colour would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No harm. Made all backgrounds lighter.
| [other standard options] | ||
| ``` | ||
|
|
||
| **Automatic Validation:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the available information in the storage server interface, but wouldn't this validation also require to check that the metadata for all shards is in the format? Otherwise this could lead to cases where the knobs are set but the metadata is still in the old format, e.g. because someone forgot the wiggle step or the wiggle step has not completed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an 'Important' and will need to add code to ensure metadata.
|
|
||
| ### Development Phases and Testing Strategy | ||
|
|
||
| The implementation is divided into six phases with specific testing criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see 4 phases not 6 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahem... Fixed.
fdbbackup/backup.actor.cpp
Outdated
| return SnapshotMode::BULKDUMP; | ||
| if (mode == "both") | ||
| return SnapshotMode::BOTH; | ||
| return SnapshotMode::RANGEFILE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather prefer to exit the command and print something like unknown snapshot mode ... to prevent debugging issues because of a typo, e.g. someone typed bukldump and the command happily moves forward and the user assumes everything is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That would be better... Fixed 'ERROR: Unknown snapshot mode 'bukldump'. Valid modes are: rangefile, bulkdump, both'
fdbbackup/backup.actor.cpp
Outdated
| return RestoreMode::RANGEFILE; | ||
| if (mode == "bulkload") | ||
| return RestoreMode::BULKLOAD; | ||
| return RestoreMode::RANGEFILE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'ERROR: Unknown restore mode ''. Valid modes are: rangefile, bulkload'
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
This commit adds the ability to use BulkDump for creating backup snapshots and BulkLoad for restoring them, providing faster backup/restore operations for large databases. Key changes: - Add BulkDumpTaskFunc to create SST file snapshots during backup - Add BulkLoadRestoreTaskFunc to restore from BulkDump snapshots - Store bulkDumpJobId in snapshot metadata for restore coordination - Add snapshotMode parameter (0=RANGEFILE, 1=BULKDUMP) to control backup type - Add useRangeFileRestore parameter to control restore method - Add CLIENT_KNOBS for configurable job timeouts - Add test assertions to verify BulkDump/BulkLoad execution - Check for existing running jobs to avoid conflicts when multiple agents run - Properly scope state variables for error handling in Flow actors New test: tests/slow/BackupS3BlobBulkLoadRestore.toml
Co-authored-by: Johannes Scheuermann <johscheuer@users.noreply.github.com>
e5649c5 to
f9afcd7
Compare
|
Rebase |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
