Skip to content

Conversation

@cbb330
Copy link

@cbb330 cbb330 commented Jan 27, 2026

Summary

Part 2/15 of ORC predicate pushdown implementation.

⚠️ Depends on PR #49009 being merged first

Builds on PR #49009 to add statistics-to-expression conversion:

  • Convert ORC stripe statistics to Arrow Expression guarantees
  • Support for INT32/INT64 types
  • Handle NULL values correctly
  • Build min/max range expressions: field >= min AND field <= max

Changes

  • Add StripeStatsAsExpression() function
  • Extract min/max from IntegerColumnStatistics
  • Generate guarantee expressions for SimplifyWithGuarantee()

Example

For a stripe with id column stats min=100, max=500:

  • Generates expression: (id >= 100) AND (id <= 500) OR is_null(id)
  • Used to test if predicate id > 1000 can be satisfied (no, skip stripe)

Part of stacked PR series. Review after PR #49009.

Add internal utilities for extracting min/max statistics from ORC
stripe metadata. This establishes the foundation for statistics-based
stripe filtering in predicate pushdown.

Changes:
- Add MinMaxStats struct to hold extracted statistics
- Add ExtractStripeStatistics() function for INT64 columns
- Statistics extraction returns std::nullopt for missing/invalid data
- Validates statistics integrity (min <= max)

This is an internal-only change with no public API modifications.
Part of incremental ORC predicate pushdown implementation (PR1/15).
Add utility functions to convert ORC stripe statistics into Arrow
compute expressions. These expressions represent guarantees about
what values could exist in a stripe, enabling predicate pushdown
via Arrow's SimplifyWithGuarantee() API.

Changes:
- Add BuildMinMaxExpression() for creating range expressions
- Support null handling with OR is_null(field) when nulls present
- Add convenience overload accepting MinMaxStats directly
- Expression format: (field >= min AND field <= max) [OR is_null(field)]

This is an internal-only utility with no public API changes.
Part of incremental ORC predicate pushdown implementation (PR2/15).
@cbb330 cbb330 changed the title GH-48986: [C++][Dataset] Add Arrow expression builder for ORC statistics GH-48986: [C++][Dataset] Add Arrow expression builder for ORC statistics (2/15) Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant