-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix DELETE/U{DATE filter extraction when predicates are pushed down into TableScan #19884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enhance DML filter extraction to incorporate table scan filters, split conjunctions, and ensure expressions are de-duplicated after stripping qualifiers. Implement exact filter pushdown handling in the delete test provider, and add a new DELETE plan test to validate that table scan filters propagate into delete_from.
| filters.extend(split_conjunction(filter).into_iter().cloned()); | ||
| } | ||
| } | ||
| _ => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry slightly that as LogicalPlan and perhaps the optimiser behavior change in future, this default case may lead to similar issues. If the outcome was "nothing gets deleted" rather than "everything gets deleted" I'd worry less.
On the other hand, I can't think of how to improve it, and this change overall fixes the current issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
The DML input we build for DELETE/UPDATE currently only carries predicates in Filter nodes or in TableScan.filters after pushdown, so the match over those two variants is exhaustive for today's plan shapes.
We also have regression tests for pushdown paths--test_delete_filter_pushdown_extracts_table_scan_filters and `test_delete_compound_filters_with_pushdown which should catch any future optimizer change that relocates predicates away from these nodes.
If we ever add a new logical node that can hold filters, we'll extend extract_dml_filters instead of relying on the default case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to handle each unhandled variant explicity instead of via a default case? This is how several of the methods in LogicalPlan work currently, presumably to deliberately cause compilation failure when new variants are added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
datafusion/datafusion/expr/src/logical_plan/plan.rs
Lines 480 to 484 in d90d074
| // plans without inputs | |
| LogicalPlan::TableScan { .. } | |
| | LogicalPlan::EmptyRelation { .. } | |
| | LogicalPlan::Values { .. } | |
| | LogicalPlan::DescribeTable(_) => vec![], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - b3240bf
|
Would it help to have a That's something I've wanted to various reasons. My view is that if we define what a scan is as a table + a set of filter expressions + a set of projection expressions universally across the codebase a lot of use cases will be unlocked and things will be simplified. It gives us a framework under which to define cost based optimizers, etc. That becomes a digression from the issue at hand though. |
ethan-tyler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this @mjgarton and @kosiew for the fix! I'm the original author of this DML hooks and a little embarrassed this slipped through in my original PR. The edge case where pushdown moves predicates entirely into TableScan.filters (leaving the Filter node empty or absent) wasn't something I tested properly. Grateful for the thorough find and fix.
A few thoughts on hardening:
- I would add UPDATE test coverage. extract_dml_filters handles both DELETE and UPDATE, but only DELETE is tested.
- A mixed-location test would have caught my original bug. Some conjuncts in residual Filter, others in TableScan.filters.
- Scope extraction to target scan only. Currently pulls from any scan in subtree and unsafe for UPDATE … FROM.
- Validate before stripping qualifiers. t.id = u.id becoming id = id is dangerous. Worth a clear planning error.
Nits:
- Dedup after stripping is fine for single-table DML; scoping/validation above covers multi-scan cases.
- is_identity_assignment has the same qualifier-stripping hazard.
Happy to help with any of these, least I can do given I introduced the bug.
| // Split AND predicates into individual expressions | ||
| filters.extend(split_conjunction(&filter.predicate).into_iter().cloned()); | ||
| } | ||
| LogicalPlan::TableScan(TableScan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This collects TableScan.filters from any scan in the subtree. Works for single-table DELETE/UPDATE, but unsafe for UPDATE … FROM (extra scans). Should we restrict extraction to the DML target scan (match table_name or provider identity) and fail-closed if multiple candidate scans exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - b66edf4
| async fn test_delete_filter_pushdown_extracts_table_scan_filters() -> Result<()> { | ||
| let provider = Arc::new(CaptureDeleteProvider::new_with_filter_pushdown( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Nice DELETE+Exact pushdown regression. I would add the same coverage for UPDATE … WHERE … + TableProviderFilterPushDown::Exact (since extract_dml_filters is used by both).
-
I would think about adding a "mixed location" case (some conjuncts in residual Filter, others in TableScan.filters) to lock in union+dedup behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // and TableScan.filters when the optimizer pushes some predicates down. | ||
| // We deduplicate by (unqualified) expression to avoid passing the same filter twice. | ||
| let mut seen_filters = HashSet::new(); | ||
| let deduped = filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dedup is done after stripping qualifiers; that can collapse distinct qualified predicates in multi-scan plans. I would either enforce single-target-scan eligibility or adjust dedup so it can't drop distinct predicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let mut seen_filters = HashSet::new(); | ||
| let deduped = filters | ||
| .into_iter() | ||
| .map(strip_column_qualifiers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qualifier stripping is dangerous if any predicate references non-target relations (t.id = u.id → id = id). I think we should validate all Column refs belong to the DML target before stripping, and error out (fail closed) otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed via qualifier-validation in cfc945d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks for adding target-scan scoping and the mixed residual/pushdown coverage.
One remaining concern: extract_dml_filters currently drops non-target conjuncts (predicate_is_on_target check in datafusion/core/src/physical_planner.rs:1941), and test_update_from_drops_non_target_predicates (datafusion/core/tests/custom_sources_cases/dml_planning.rs:658) codifies this as expected behavior. That can silently broaden row selection and worst case, if the join predicate is the only constraint, this becomes "update all rows".
Would it make sense to fail closed instead (planning error if any WHERE conjunct references a non-target relation), and update that test to assert the error?
| ]))))) | ||
| } | ||
|
|
||
| fn supports_filters_pushdown( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now pushdown mode is uniform for all filters. To test mixed-location behavior (Inexact vs Unsupported per conjunct), I would consider making this provider return per-filter decisions so we can force residual and pushed-down predicates in one query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added per-filter pushdown - bc5bb4c
…ith explicit variant handling
…ider and corresponding tests
Update DML filter extraction to utilize TableReference matching, omit non-target predicates, and remove qualifiers post-filtering and deduplication. Pass TableReference targets into DML filter extraction for DELETE and UPDATE planning paths.
Add helper to detect table-qualified column references in captured filters for supporting UPDATE ... FROM assertions. Implement tests to verify that TableScan filters are properly extracted into DML update filters and ensure non-target predicates are dropped while retaining target filters during UPDATE ... FROM operations.
…ith single-pass `try_fold`
…tion and improved handling for UPDATE...FROM queries
…d CaptureUpdateProvider
|
Thanks @kosiew - I'll do another pass and look at the new TableScan.filters as well. It's coming together nicely 👌 |
ethan-tyler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| /// Determine whether a predicate references only columns from the target table. | ||
| fn predicate_is_on_target(expr: &Expr, target: &TableReference) -> Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two concerns here:
- alias-qualified columns (
x.idfromUPDATE t AS x) may not matchtargetif it's the base ref -- seeupdate.slt:91for the plan shape - Dropping join predicates silently can broaden scope. Consider failing closed if any conjunct is rejected.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_update_from_drops_non_target_predicates() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t2 shares column names with t1, the "no t2 references" check can false-negative after qualifier stripping. I would use a t2-only column name to detect leakage reliably.
Which issue does this PR close?
TableProvider::delete_fromproblem with pushed down filters #19840.Rationale for this change
When a
TableProvidersupports filter pushdown (for exampleTableProviderFilterPushDown::Exact), the DataFusion optimizer inlines predicates directly into theTableScannode instead of keeping them in a separateFilternode.The existing DML planning logic only extracted predicates from explicit
Filternodes. As a result, DELETE and UPDATE operations executed against providers with filter pushdown support received no filters indelete_from/update, causing all rows to be modified instead of only the intended subset.This change ensures that DML filter extraction correctly accounts for optimizer-pushed predicates and preserves expected semantics for DELETE and UPDATE statements.
What changes are included in this PR?
Enhanced DML filter extraction
Updated
extract_dml_filtersto collect predicates from both:LogicalPlan::FilternodesLogicalPlan::TableScan.filters(for pushed-down predicates)Split conjunctions (
AND) into individual expressions consistently across both sources.Strip column qualifiers so expressions match the
TableProviderschema.Deduplicate filters to avoid passing the same predicate twice when it appears in both a
Filternode and aTableScan.Improved documentation and comments
Expanded test coverage
TableProvider.TableScanwhen filter pushdown is enabled.Are these changes tested?
Yes. This PR includes new tests that:
LogicalPlan::TableScanwhen pushdown is enabled.delete_fromreceives the correct number and content of filter expressions.AND) without over-deduplication.Existing DELETE and UPDATE tests continue to pass, ensuring backward compatibility when filter pushdown is not used.
Are there any user-facing changes?
Yes — this is a correctness fix.
Users implementing custom
TableProviders with filter pushdown support will now receive the expected filter predicates indelete_fromandupdate. This restores correct behavior for DELETE and UPDATE statements with WHERE clauses and prevents accidental full-table modifications.No API changes are introduced.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.