Skip to content

Conversation

@cbb330
Copy link

@cbb330 cbb330 commented Jan 27, 2026

Summary

Part 4/15 of ORC predicate pushdown implementation.

⚠️ Depends on PRs 1-3 being merged first

Implements core filtering logic:

  • TestStripes(): Simplify predicate with each stripe's statistics
  • FilterStripes(): Return stripe indices that may satisfy predicate
  • Use SimplifyWithGuarantee() and IsSatisfiable() from Arrow Compute

Example

File with 3 stripes:

  • Stripe 0: id in [1, 100]
  • Stripe 1: id in [101, 200]
  • Stripe 2: id in [201, 300]

Query: WHERE id > 150

  • Stripe 0: max=100 < 150 → Skip
  • Stripe 1: max=200 >= 150 → Keep
  • Stripe 2: min=201 > 150 → Keep
  • Result: [1, 2]

Part of stacked PR series. Review after PR 3.

Add internal utilities for extracting min/max statistics from ORC
stripe metadata. This establishes the foundation for statistics-based
stripe filtering in predicate pushdown.

Changes:
- Add MinMaxStats struct to hold extracted statistics
- Add ExtractStripeStatistics() function for INT64 columns
- Statistics extraction returns std::nullopt for missing/invalid data
- Validates statistics integrity (min <= max)

This is an internal-only change with no public API modifications.
Part of incremental ORC predicate pushdown implementation (PR1/15).
Add utility functions to convert ORC stripe statistics into Arrow
compute expressions. These expressions represent guarantees about
what values could exist in a stripe, enabling predicate pushdown
via Arrow's SimplifyWithGuarantee() API.

Changes:
- Add BuildMinMaxExpression() for creating range expressions
- Support null handling with OR is_null(field) when nulls present
- Add convenience overload accepting MinMaxStats directly
- Expression format: (field >= min AND field <= max) [OR is_null(field)]

This is an internal-only utility with no public API changes.
Part of incremental ORC predicate pushdown implementation (PR2/15).
Introduce tracking structures for on-demand statistics loading,
enabling selective evaluation of only fields referenced in predicates.
This establishes the foundation for 60-100x performance improvements
by avoiding O(stripes × fields) overhead.

Changes:
- Add OrcFileFragment class extending FileFragment
- Add statistics_expressions_ vector (per-stripe guarantee tracking)
- Add statistics_expressions_complete_ vector (per-field completion tracking)
- Initialize structures in EnsureMetadataCached() with mutex protection
- Add FoldingAnd() helper for efficient expression accumulation

Pattern follows Parquet's proven lazy evaluation approach.
This is infrastructure-only with no public API exposure yet.
Part of incremental ORC predicate pushdown implementation (PR3/15).
Implement first end-to-end working predicate pushdown for ORC files.
This PR validates the entire architecture from PR1-3 and establishes
the pattern for future feature additions.

Scope limited to prove the concept:
- INT64 columns only
- Greater-than operator (>) only

Changes:
- Add FilterStripes() public API to OrcFileFragment
- Add TestStripes() internal method for stripe evaluation
- Implement lazy statistics evaluation (processes only referenced fields)
- Integrate with Arrow's SimplifyWithGuarantee() for correctness
- Add ARROW_ORC_DISABLE_PREDICATE_PUSHDOWN feature flag
- Cache ORC reader to avoid repeated file opens
- Conservative fallback: include all stripes if statistics unavailable

The implementation achieves significant performance improvements by
skipping stripes that provably cannot contain matching data.

Part of incremental ORC predicate pushdown implementation (PR4/15).
@cbb330 cbb330 changed the title GH-48986: [C++][Dataset] Add basic ORC stripe filtering API with predicate pushdown GH-48986: [C++][Dataset] Add basic ORC stripe filtering API with predicate pushdown (4/15) Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant