Skip to content

Conversation

@cbb330
Copy link

@cbb330 cbb330 commented Jan 27, 2026

Summary

Part 1/15 of ORC predicate pushdown implementation.

Adds foundation for statistics-based predicate pushdown:

  • Add GetStripeStatistics() API to ORCFileReader
  • Forward declarations for liborc types
  • Infrastructure for accessing ORC stripe-level statistics

This is the first building block that enables stripe filtering based on column statistics.

Changes

  • adapter.h: Add GetStripeStatistics API
  • adapter.cc: Implement statistics retrieval

Rationale

ORC files store min/max statistics at the stripe level. This API exposes those statistics to enable predicate pushdown at the Dataset API layer, following the same pattern as Parquet row group statistics.

Part of stacked PR series for ORC predicate pushdown.

Add internal utilities for extracting min/max statistics from ORC
stripe metadata. This establishes the foundation for statistics-based
stripe filtering in predicate pushdown.

Changes:
- Add MinMaxStats struct to hold extracted statistics
- Add ExtractStripeStatistics() function for INT64 columns
- Statistics extraction returns std::nullopt for missing/invalid data
- Validates statistics integrity (min <= max)

This is an internal-only change with no public API modifications.
Part of incremental ORC predicate pushdown implementation (PR1/15).
@cbb330 cbb330 changed the title GH-48986: [C++][Dataset] Add ORC stripe statistics extraction foundation GH-48986: [C++][Dataset] Add ORC stripe statistics extraction foundation (1/15) Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant