Skip to content

Conversation

@aviralgarg05
Copy link

Which issue does this PR close?

Rationale for this change

The ParquetOpener was using ArrowReaderOptions::with_page_index(true), which internally sets PageIndexPolicy::Required. This caused sparse column chunk reads with row selection masks to fail with errors like "Invalid offset in sparse column chunk data" when reading Parquet files that lack page index metadata.

Relaxing this policy to PageIndexPolicy::Optional allows DataFusion to gracefully handle files both with and without page index metadata while still leveraging the index when it exists.

What changes are included in this PR?

  • Modified datafusion/datasource-parquet/src/opener.rs to use PageIndexPolicy::Optional instead of Required.
  • Added a new regression test in datafusion/core/tests/parquet/issue_19839.rs that validates reading a Parquet file written without a page index.

Are these changes tested?

Yes. I have added a dedicated regression test case:

  • datafusion/core/tests/parquet/issue_19839.rs

This test writes a Parquet file specifically without page index metadata and verifies that ParquetOpener can read it successfully when parquet_page_index_pruning is enabled.

Are there any user-facing changes?

No. This is a bug fix that improves the robustness of the Parquet reader.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jan 20, 2026
@aviralgarg05
Copy link
Author

Resolved the issues! @kumarUjjawal

@aviralgarg05 aviralgarg05 force-pushed the fix/parquet-opener-page-index-policy branch from 91e0832 to faffff0 Compare January 20, 2026 10:25
reader_metadata = load_page_index(
reader_metadata,
&mut async_file_reader,
// Since we're manually loading the page index the option here should not matter but we pass it in for consistency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment probably needs to be updated.

// Write parquet WITHOUT page index
let props = WriterProperties::builder().build();

let file_fs = std::fs::File::create(&path).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to reopen the already opened file (line 979). Maybe you can just reuse it ?!

let path = file.path().to_str().unwrap().to_string();

// Write parquet WITHOUT page index
let props = WriterProperties::builder().build();
Copy link
Member

@martin-g martin-g Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which property exactly controls whether the page index is enabled or not ?
Maybe you can disable it explicitly here with WriterProperties::builder().with_xyz(false).build();, so it will still work as desired if it ever becomes enabled by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the update of the parquet-testing submodule needed for the new test ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParquetOpener fails on files without PageIndex metadata

3 participants