Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 19, 2026

Which issue does this PR close?


Rationale for this change

When a TableProvider supports filter pushdown (for example TableProviderFilterPushDown::Exact), the DataFusion optimizer inlines predicates directly into the TableScan node instead of keeping them in a separate Filter node.

The existing DML planning logic only extracted predicates from explicit Filter nodes. As a result, DELETE and UPDATE operations executed against providers with filter pushdown support received no filters in delete_from / update, causing all rows to be modified instead of only the intended subset.

This change ensures that DML filter extraction correctly accounts for optimizer-pushed predicates and preserves expected semantics for DELETE and UPDATE statements.


What changes are included in this PR?

  • Enhanced DML filter extraction

    • Updated extract_dml_filters to collect predicates from both:

      • LogicalPlan::Filter nodes
      • LogicalPlan::TableScan.filters (for pushed-down predicates)
    • Split conjunctions (AND) into individual expressions consistently across both sources.

    • Strip column qualifiers so expressions match the TableProvider schema.

    • Deduplicate filters to avoid passing the same predicate twice when it appears in both a Filter node and a TableScan.

  • Improved documentation and comments

    • Updated function-level documentation to explain TableScan filter handling and deduplication rationale.
    • Added inline comments clarifying why deduplication is required.
  • Expanded test coverage

    • Added test infrastructure to support configurable filter pushdown behavior in a custom TableProvider.
    • Added a regression test verifying that DELETE operations correctly extract filters from TableScan when filter pushdown is enabled.
    • Added a compound-filter test to ensure deduplication does not suppress distinct predicates.

Are these changes tested?

Yes. This PR includes new tests that:

  • Verify optimizer behavior by asserting filters are present in LogicalPlan::TableScan when pushdown is enabled.
  • Confirm that delete_from receives the correct number and content of filter expressions.
  • Validate correct handling of compound predicates (AND) without over-deduplication.

Existing DELETE and UPDATE tests continue to pass, ensuring backward compatibility when filter pushdown is not used.


Are there any user-facing changes?

Yes — this is a correctness fix.

Users implementing custom TableProviders with filter pushdown support will now receive the expected filter predicates in delete_from and update. This restores correct behavior for DELETE and UPDATE statements with WHERE clauses and prevents accidental full-table modifications.

No API changes are introduced.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Enhance DML filter extraction to incorporate table scan filters,
split conjunctions, and ensure expressions are de-duplicated after
stripping qualifiers. Implement exact filter pushdown handling in
the delete test provider, and add a new DELETE plan test to
validate that table scan filters propagate into delete_from.
filters.extend(split_conjunction(filter).into_iter().cloned());
}
}
_ => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry slightly that as LogicalPlan and perhaps the optimiser behavior change in future, this default case may lead to similar issues. If the outcome was "nothing gets deleted" rather than "everything gets deleted" I'd worry less.

On the other hand, I can't think of how to improve it, and this change overall fixes the current issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

The DML input we build for DELETE/UPDATE currently only carries predicates in Filter nodes or in TableScan.filters after pushdown, so the match over those two variants is exhaustive for today's plan shapes.

We also have regression tests for pushdown paths--test_delete_filter_pushdown_extracts_table_scan_filters and `test_delete_compound_filters_with_pushdown which should catch any future optimizer change that relocates predicates away from these nodes.

If we ever add a new logical node that can hold filters, we'll extend extract_dml_filters instead of relying on the default case.

Copy link
Contributor

@mjgarton mjgarton Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to handle each unhandled variant explicity instead of via a default case? This is how several of the methods in LogicalPlan work currently, presumably to deliberately cause compilation failure when new variants are added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

// plans without inputs
LogicalPlan::TableScan { .. }
| LogicalPlan::EmptyRelation { .. }
| LogicalPlan::Values { .. }
| LogicalPlan::DescribeTable(_) => vec![],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - b3240bf

@kosiew kosiew marked this pull request as ready for review January 19, 2026 11:28
@adriangb
Copy link
Contributor

Would it help to have a filters field on TableScan similar to projection?

That's something I've wanted to various reasons. My view is that if we define what a scan is as a table + a set of filter expressions + a set of projection expressions universally across the codebase a lot of use cases will be unlocked and things will be simplified. It gives us a framework under which to define cost based optimizers, etc. That becomes a digression from the issue at hand though.

Copy link
Contributor

@ethan-tyler ethan-tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @mjgarton and @kosiew for the fix! I'm the original author of this DML hooks and a little embarrassed this slipped through in my original PR. The edge case where pushdown moves predicates entirely into TableScan.filters (leaving the Filter node empty or absent) wasn't something I tested properly. Grateful for the thorough find and fix.

A few thoughts on hardening:

  • I would add UPDATE test coverage. extract_dml_filters handles both DELETE and UPDATE, but only DELETE is tested.
  • A mixed-location test would have caught my original bug. Some conjuncts in residual Filter, others in TableScan.filters.
  • Scope extraction to target scan only. Currently pulls from any scan in subtree and unsafe for UPDATE … FROM.
  • Validate before stripping qualifiers. t.id = u.id becoming id = id is dangerous. Worth a clear planning error.

Nits:

  • Dedup after stripping is fine for single-table DML; scoping/validation above covers multi-scan cases.
  • is_identity_assignment has the same qualifier-stripping hazard.

Happy to help with any of these, least I can do given I introduced the bug.

// Split AND predicates into individual expressions
filters.extend(split_conjunction(&filter.predicate).into_iter().cloned());
}
LogicalPlan::TableScan(TableScan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collects TableScan.filters from any scan in the subtree. Works for single-table DELETE/UPDATE, but unsafe for UPDATE … FROM (extra scans). Should we restrict extraction to the DML target scan (match table_name or provider identity) and fail-closed if multiple candidate scans exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - b66edf4

Comment on lines +273 to +274
async fn test_delete_filter_pushdown_extracts_table_scan_filters() -> Result<()> {
let provider = Arc::new(CaptureDeleteProvider::new_with_filter_pushdown(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Nice DELETE+Exact pushdown regression. I would add the same coverage for UPDATE … WHERE … + TableProviderFilterPushDown::Exact (since extract_dml_filters is used by both).

  2. I would think about adding a "mixed location" case (some conjuncts in residual Filter, others in TableScan.filters) to lock in union+dedup behavior.

Copy link
Contributor Author

@kosiew kosiew Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 98be1b5
5fcb927

// and TableScan.filters when the optimizer pushes some predicates down.
// We deduplicate by (unqualified) expression to avoid passing the same filter twice.
let mut seen_filters = HashSet::new();
let deduped = filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dedup is done after stripping qualifiers; that can collapse distinct qualified predicates in multi-scan plans. I would either enforce single-target-scan eligibility or adjust dedup so it can't drop distinct predicates.

Copy link
Contributor Author

@kosiew kosiew Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addresssed via
Target Scan Scoping - b66edf4
Predicate-Level Validation - cfc945d

let mut seen_filters = HashSet::new();
let deduped = filters
.into_iter()
.map(strip_column_qualifiers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qualifier stripping is dangerous if any predicate references non-target relations (t.id = u.id → id = id). I think we should validate all Column refs belong to the DML target before stripping, and error out (fail closed) otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed via qualifier-validation in cfc945d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks for adding target-scan scoping and the mixed residual/pushdown coverage.

One remaining concern: extract_dml_filters currently drops non-target conjuncts (predicate_is_on_target check in datafusion/core/src/physical_planner.rs:1941), and test_update_from_drops_non_target_predicates (datafusion/core/tests/custom_sources_cases/dml_planning.rs:658) codifies this as expected behavior. That can silently broaden row selection and worst case, if the join predicate is the only constraint, this becomes "update all rows".

Would it make sense to fail closed instead (planning error if any WHERE conjunct references a non-target relation), and update that test to assert the error?

])))))
}

fn supports_filters_pushdown(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now pushdown mode is uniform for all filters. To test mixed-location behavior (Inexact vs Unsupported per conjunct), I would consider making this provider return per-filter decisions so we can force residual and pushed-down predicates in one query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added per-filter pushdown - bc5bb4c

kosiew added 17 commits January 20, 2026 13:43
Update DML filter extraction to utilize TableReference
matching, omit non-target predicates, and remove
qualifiers post-filtering and deduplication.
Pass TableReference targets into DML filter extraction
for DELETE and UPDATE planning paths.
Add helper to detect table-qualified column references in
captured filters for supporting UPDATE ... FROM assertions.
Implement tests to verify that TableScan filters are
properly extracted into DML update filters and ensure
non-target predicates are dropped while retaining
target filters during UPDATE ... FROM operations.
…tion and improved handling for UPDATE...FROM queries
@ethan-tyler
Copy link
Contributor

Thanks @kosiew - I'll do another pass and look at the new TableScan.filters as well. It's coming together nicely 👌

Copy link
Contributor

@ethan-tyler ethan-tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosiew - Good solution for #19840, thanks for taking it on. The approach looks correct and the regression tests are solid.

Requesting changes for a couple of remaining cases that can still produce unexpected results (see inline comments).

}

/// Determine whether a predicate references only columns from the target table.
fn predicate_is_on_target(expr: &Expr, target: &TableReference) -> Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns here:

  • alias-qualified columns (x.id from UPDATE t AS x) may not match target if it's the base ref -- see update.slt:91 for the plan shape
  • Dropping join predicates silently can broaden scope. Consider failing closed if any conjunct is rejected.

}

#[tokio::test]
async fn test_update_from_drops_non_target_predicates() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t2 shares column names with t1, the "no t2 references" check can false-negative after qualifier stripping. I would use a t2-only column name to detect leakage reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TableProvider::delete_from problem with pushed down filters

4 participants