Skip to content

Conversation

@sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Jan 9, 2026

Which issue does this PR close?

Rationale for this change

Splitting the PR to make it more readable.

What changes are included in this PR?

Adding the udf_preimage logic without date_part implementation.

Are these changes tested?

Added unit tests for a test specific function

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Jan 9, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sdf-jkl -- reviewed this PR carefully this morning and it looks great (thank you for splitting up the work), I found it well commented and well designed and a joy to read

I do think we need to add unit tests tests to for this feature, which I know you have lined up in #18789 but I think writing the unit tests in for the rewrite will make it easiest to validate.

I also have some questions about the rewrite for = (aka the boundary conditions)

// NOTE: we only consider immutable UDFs with literal RHS values
Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
use datafusion_expr::Operator::*;
let is_preimage_op = matches!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice (as a follow on PR) to mention this list in the docs for preimage -- e.g. that it only applies to predicates =, !=, ...

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate spark labels Jan 18, 2026
@sdf-jkl sdf-jkl force-pushed the smaller-preimage-pr-1 branch from f308662 to 5ffb704 Compare January 18, 2026 18:21
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jan 18, 2026
@alamb alamb changed the title Support "pre-image" for pruning predicate evaluation #1 Support API for "pre-image" for pruning predicate evaluation Jan 19, 2026
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 19, 2026

I'll take a look at the failing tests

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 19, 2026

Should be good

I'll create a separate issue for date_part implementation

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @sdf-jkl -- this is looking so close. I have one more comment on the API, but I think then we'll be good

Let me know if you would prefer me to make these changes directly rather than the fedback. I figured you would appreciate the reviews and back and forth.

let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else {
return Ok((None, None));
};
if !is_literal_or_literal_cast(right_expr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a reason to limit this to literal ? It seems like the call to pre_image could handle this (and basically return if it wasn't a literal)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still an open question, but it is ok to handle as a follow on PR (aka widen the expressions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example where we could use a non-literal expr on rhs for a comparison with preimage? I can't come up with one, but if there is, we could move expression matching into preimage impl

return Ok(None);
}
match lit_expr {
Expr::Literal(ScalarValue::Int32(Some(500)), _) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this has to check for Expr::Literal anyways, I think the simplfy expression could just pass whatever argument in here, rather than only doing it with columns and literals

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 19, 2026

I definitely appreciate the feedback, and the back and forth. Thanks, I'll work on addressing it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also working on some tests -- I'll make a PR to propose adding coverge to this PR

@alamb
Copy link
Contributor

alamb commented Jan 20, 2026

@sdf-jkl here are some tests and other small suggestions

Add tests for additional cases
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 20, 2026

Wow, this is much cleaner, thanks!

@alamb
Copy link
Contributor

alamb commented Jan 20, 2026

I think this PR needs two more things:

  1. Fi the NULL handling (probably by not calling preimage with null constants)
  2. Update the API to have only a single method

(I am trying to keep my review context under control, so trying to focus on getting stuff through before starting more)

@sdf-jkl sdf-jkl requested a review from alamb January 20, 2026 19:01
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 20, 2026

Both done. Re-requested a review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @sdf-jkl

This looks good to me. I would like to change the signature to use Interval rather than Box<Interval> and there are a few other small comments, but we can also do this as a follow on PR (or I can push some commits to this PR)

Thank you for hanging with this one

FYI @colinmarc -- once we get this in, I think @sdf-jkl plans to implement preimage for date_part. Perhaps you are interested in something similar for date_trunc

Also, FYI @jonahgao and @xudong963 / @zhuqi-lucas in case you are interested in this PR (the primary usecase is improving the handling of date/timestamp predicates)

None,
/// The expression always evaluates to the specified constant
/// given that `expr` is within the interval
Range { expr: Expr, interval: Box<Interval> },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to Box this? I think it might be simpler if it was Interval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy suggested Boxing it because one enum variant is much bigger than the other. (threshold is 200bytes, None is 0, Range is 240bytes minimum)

warning: large size difference between variants
  --> datafusion/expr/src/preimage.rs:22:1
   |
22 | / pub enum PreimageResult {
23 | |     /// No preimage exists for the specified value
24 | |     None,
   | |     ---- the second-largest variant carries no data at all
25 | |     /// The expression always evaluates to the specified constant
26 | |     /// given that `expr` is within the interval
27 | |     Range { expr: Expr, interval: Interval },
   | |     ---------------------------------------- the largest variant contains at least 240 bytes
28 | | }
   | |_^ the entire enum is at least 240 bytes
   |
   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.92.0/index.html#large_enum_variant
   = note: `#[warn(clippy::large_enum_variant)]` on by default
help: consider boxing the large fields or introducing indirection in some other way to reduce the total size of the enum
   |
27 -     Range { expr: Expr, interval: Interval },
27 +     Range { expr: Expr, interval: Box<Interval> },
   |

let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else {
return Ok((None, None));
};
if !is_literal_or_literal_cast(right_expr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still an open question, but it is ok to handle as a follow on PR (aka widen the expressions)

if !is_literal_or_literal_cast(right_expr) {
return Ok(PreimageResult::None);
}
if func.signature().volatility != Volatility::Immutable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for a follow on PR, I think it would be safe to rewrite stable functions (whose values don't change during the statement)

Operator::LtEq => expr.lt(upper),
// <expr> = x ==> (<expr> >= lower) and (<expr> < upper)
//
// <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper))
// <expr> is not distinct from x ==> (<expr> is NULL) or ((<expr> >= lower) and (<expr> < upper))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this IS NOT DISTICNT rewrite is correctas it is rewritten to just the range predicate. If expr is NULL and the literal is non-NULL, the original expression is FALSE, but the rewrite evaluates to NULL (x >= lower AND x < upper), which is not equivalent and violates the “same nullability” expectation for simplified expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?

If we want to keep false:

Operator::IsNotDistinctFrom => {
    // expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
    // because we know x is NOT NULL.
    expr.clone().is_not_null().and(
        and(expr.clone().gt_eq(lower), expr.lt(upper))
    )
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963 this solves the issue. Thanks!

@xudong963
Copy link
Member

I'll have a look at the PR today

/// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
///
pub(super) fn rewrite_with_preimage(
_info: &SimplifyContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb mentioned that we should keep it in #18789 (comment), but it was a while ago.

Operator::LtEq => expr.lt(upper),
// <expr> = x ==> (<expr> >= lower) and (<expr> < upper)
//
// <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?

If we want to keep false:

Operator::IsNotDistinctFrom => {
    // expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
    // because we know x is NOT NULL.
    expr.clone().is_not_null().and(
        and(expr.clone().gt_eq(lower), expr.lt(upper))
    )
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support "pre-image" for pruning predicate evaluation

3 participants