-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix/parquet opener page index policy #19890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix/parquet opener page index policy #19890
Conversation
|
Resolved the issues! @kumarUjjawal |
91e0832 to
faffff0
Compare
| reader_metadata = load_page_index( | ||
| reader_metadata, | ||
| &mut async_file_reader, | ||
| // Since we're manually loading the page index the option here should not matter but we pass it in for consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment probably needs to be updated.
| // Write parquet WITHOUT page index | ||
| let props = WriterProperties::builder().build(); | ||
|
|
||
| let file_fs = std::fs::File::create(&path).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tries to reopen the already opened file (line 979). Maybe you can just reuse it ?!
| let path = file.path().to_str().unwrap().to_string(); | ||
|
|
||
| // Write parquet WITHOUT page index | ||
| let props = WriterProperties::builder().build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which property exactly controls whether the page index is enabled or not ?
Maybe you can disable it explicitly here with WriterProperties::builder().with_xyz(false).build();, so it will still work as desired if it ever becomes enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the update of the parquet-testing submodule needed for the new test ?
Which issue does this PR close?
ParquetOpenerfails on files withoutPageIndexmetadata #19839.Rationale for this change
The ParquetOpener was using
ArrowReaderOptions::with_page_index(true), which internally setsPageIndexPolicy::Required. This caused sparse column chunk reads with row selection masks to fail with errors like "Invalid offset in sparse column chunk data" when reading Parquet files that lack page index metadata.Relaxing this policy to
PageIndexPolicy::Optionalallows DataFusion to gracefully handle files both with and without page index metadata while still leveraging the index when it exists.What changes are included in this PR?
PageIndexPolicy::Optionalinstead ofRequired.Are these changes tested?
Yes. I have added a dedicated regression test case:
This test writes a Parquet file specifically without page index metadata and verifies that ParquetOpener can read it successfully when
parquet_page_index_pruningis enabled.Are there any user-facing changes?
No. This is a bug fix that improves the robustness of the Parquet reader.