Skip to content

Conversation

@cbb330
Copy link

@cbb330 cbb330 commented Jan 27, 2026

Summary

Part 3/15 of ORC predicate pushdown implementation.

⚠️ Depends on PR #49009 and PR 2 being merged first

Adds lazy evaluation infrastructure to OrcFileFragment:

  • Only process statistics for fields referenced in predicate
  • Cache statistics expressions to avoid recomputation
  • Track which fields have been processed
  • Efficient: O(fields_in_predicate) not O(all_fields)

Changes

  • Add statistics_expressions_ cache to OrcFileFragment
  • Add statistics_expressions_complete_ tracking
  • Add metadata caching infrastructure
  • Add EnsureMetadataCached() for lazy loading

Performance Impact

For a file with 100 columns and predicate on 2 columns:

  • Without lazy evaluation: Process 100 × N stripes
  • With lazy evaluation: Process 2 × N stripes (50x reduction)

Part of stacked PR series. Review after PR 2.

Add internal utilities for extracting min/max statistics from ORC
stripe metadata. This establishes the foundation for statistics-based
stripe filtering in predicate pushdown.

Changes:
- Add MinMaxStats struct to hold extracted statistics
- Add ExtractStripeStatistics() function for INT64 columns
- Statistics extraction returns std::nullopt for missing/invalid data
- Validates statistics integrity (min <= max)

This is an internal-only change with no public API modifications.
Part of incremental ORC predicate pushdown implementation (PR1/15).
Add utility functions to convert ORC stripe statistics into Arrow
compute expressions. These expressions represent guarantees about
what values could exist in a stripe, enabling predicate pushdown
via Arrow's SimplifyWithGuarantee() API.

Changes:
- Add BuildMinMaxExpression() for creating range expressions
- Support null handling with OR is_null(field) when nulls present
- Add convenience overload accepting MinMaxStats directly
- Expression format: (field >= min AND field <= max) [OR is_null(field)]

This is an internal-only utility with no public API changes.
Part of incremental ORC predicate pushdown implementation (PR2/15).
Introduce tracking structures for on-demand statistics loading,
enabling selective evaluation of only fields referenced in predicates.
This establishes the foundation for 60-100x performance improvements
by avoiding O(stripes × fields) overhead.

Changes:
- Add OrcFileFragment class extending FileFragment
- Add statistics_expressions_ vector (per-stripe guarantee tracking)
- Add statistics_expressions_complete_ vector (per-field completion tracking)
- Initialize structures in EnsureMetadataCached() with mutex protection
- Add FoldingAnd() helper for efficient expression accumulation

Pattern follows Parquet's proven lazy evaluation approach.
This is infrastructure-only with no public API exposure yet.
Part of incremental ORC predicate pushdown implementation (PR3/15).
@cbb330 cbb330 changed the title GH-48986: [C++][Dataset] Add lazy evaluation infrastructure for ORC predicate pushdown GH-48986: [C++][Dataset] Add lazy evaluation infrastructure for ORC predicate pushdown (3/15) Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant