-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support API for "pre-image" for pruning predicate evaluation #19722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sdf-jkl -- reviewed this PR carefully this morning and it looks great (thank you for splitting up the work), I found it well commented and well designed and a joy to read
I do think we need to add unit tests tests to for this feature, which I know you have lined up in #18789 but I think writing the unit tests in for the rewrite will make it easiest to validate.
I also have some questions about the rewrite for = (aka the boundary conditions)
| // NOTE: we only consider immutable UDFs with literal RHS values | ||
| Expr::BinaryExpr(BinaryExpr { left, op, right }) => { | ||
| use datafusion_expr::Operator::*; | ||
| let is_preimage_op = matches!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be nice (as a follow on PR) to mention this list in the docs for preimage -- e.g. that it only applies to predicates =, !=, ...
f308662 to
5ffb704
Compare
|
I'll take a look at the failing tests |
|
Should be good I'll create a separate issue for date_part implementation |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @sdf-jkl -- this is looking so close. I have one more comment on the API, but I think then we'll be good
Let me know if you would prefer me to make these changes directly rather than the fedback. I figured you would appreciate the reviews and back and forth.
| let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else { | ||
| return Ok((None, None)); | ||
| }; | ||
| if !is_literal_or_literal_cast(right_expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a reason to limit this to literal ? It seems like the call to pre_image could handle this (and basically return if it wasn't a literal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still an open question, but it is ok to handle as a follow on PR (aka widen the expressions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example where we could use a non-literal expr on rhs for a comparison with preimage? I can't come up with one, but if there is, we could move expression matching into preimage impl
| return Ok(None); | ||
| } | ||
| match lit_expr { | ||
| Expr::Literal(ScalarValue::Int32(Some(500)), _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this has to check for Expr::Literal anyways, I think the simplfy expression could just pass whatever argument in here, rather than only doing it with columns and literals
|
I definitely appreciate the feedback, and the back and forth. Thanks, I'll work on addressing it. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also working on some tests -- I'll make a PR to propose adding coverge to this PR
|
@sdf-jkl here are some tests and other small suggestions |
Add tests for additional cases
|
Wow, this is much cleaner, thanks! |
|
I think this PR needs two more things:
(I am trying to keep my review context under control, so trying to focus on getting stuff through before starting more) |
|
Both done. Re-requested a review. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @sdf-jkl
This looks good to me. I would like to change the signature to use Interval rather than Box<Interval> and there are a few other small comments, but we can also do this as a follow on PR (or I can push some commits to this PR)
Thank you for hanging with this one
FYI @colinmarc -- once we get this in, I think @sdf-jkl plans to implement preimage for date_part. Perhaps you are interested in something similar for date_trunc
Also, FYI @jonahgao and @xudong963 / @zhuqi-lucas in case you are interested in this PR (the primary usecase is improving the handling of date/timestamp predicates)
| None, | ||
| /// The expression always evaluates to the specified constant | ||
| /// given that `expr` is within the interval | ||
| Range { expr: Expr, interval: Box<Interval> }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to Box this? I think it might be simpler if it was Interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clippy suggested Boxing it because one enum variant is much bigger than the other. (threshold is 200bytes, None is 0, Range is 240bytes minimum)
warning: large size difference between variants
--> datafusion/expr/src/preimage.rs:22:1
|
22 | / pub enum PreimageResult {
23 | | /// No preimage exists for the specified value
24 | | None,
| | ---- the second-largest variant carries no data at all
25 | | /// The expression always evaluates to the specified constant
26 | | /// given that `expr` is within the interval
27 | | Range { expr: Expr, interval: Interval },
| | ---------------------------------------- the largest variant contains at least 240 bytes
28 | | }
| |_^ the entire enum is at least 240 bytes
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.92.0/index.html#large_enum_variant
= note: `#[warn(clippy::large_enum_variant)]` on by default
help: consider boxing the large fields or introducing indirection in some other way to reduce the total size of the enum
|
27 - Range { expr: Expr, interval: Interval },
27 + Range { expr: Expr, interval: Box<Interval> },
|
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
| let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else { | ||
| return Ok((None, None)); | ||
| }; | ||
| if !is_literal_or_literal_cast(right_expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still an open question, but it is ok to handle as a follow on PR (aka widen the expressions)
| if !is_literal_or_literal_cast(right_expr) { | ||
| return Ok(PreimageResult::None); | ||
| } | ||
| if func.signature().volatility != Volatility::Immutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for a follow on PR, I think it would be safe to rewrite stable functions (whose values don't change during the statement)
| Operator::LtEq => expr.lt(upper), | ||
| // <expr> = x ==> (<expr> >= lower) and (<expr> < upper) | ||
| // | ||
| // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper)) | |
| // <expr> is not distinct from x ==> (<expr> is NULL) or ((<expr> >= lower) and (<expr> < upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this IS NOT DISTICNT rewrite is correctas it is rewritten to just the range predicate. If expr is NULL and the literal is non-NULL, the original expression is FALSE, but the rewrite evaluates to NULL (x >= lower AND x < upper), which is not equivalent and violates the “same nullability” expectation for simplified expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?
If we want to keep false:
Operator::IsNotDistinctFrom => {
// expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
// because we know x is NOT NULL.
expr.clone().is_not_null().and(
and(expr.clone().gt_eq(lower), expr.lt(upper))
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 this solves the issue. Thanks!
|
I'll have a look at the PR today |
| /// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image | ||
| /// | ||
| pub(super) fn rewrite_with_preimage( | ||
| _info: &SimplifyContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb mentioned that we should keep it in #18789 (comment), but it was a while ago.
| Operator::LtEq => expr.lt(upper), | ||
| // <expr> = x ==> (<expr> >= lower) and (<expr> < upper) | ||
| // | ||
| // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?
If we want to keep false:
Operator::IsNotDistinctFrom => {
// expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
// because we know x is NOT NULL.
expr.clone().is_not_null().and(
and(expr.clone().gt_eq(lower), expr.lt(upper))
)
}
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Rationale for this change
Splitting the PR to make it more readable.
What changes are included in this PR?
Adding the udf_preimage logic without date_part implementation.
Are these changes tested?
Added unit tests for a test specific function
Are there any user-facing changes?
No