-
Notifications
You must be signed in to change notification settings - Fork 449
Open
Description
Feature Request / Improvement
Problem
The write path in pyiceberg/io/pyarrow.py is hardcoded to Parquet. The write.format.default table property exists but is never read. Adding a new format (ORC, Vortex, Lance) requires modifying the monolithic write_file() function. The read path already dispatches multiple formats; the write path should too.
Proposal
Introduce a File Format API aligned with Java Iceberg's File Format API (design doc).
New module pyiceberg/io/fileformat.py:
FileFormatWriter(ABC)FileFormatModel(ABC)FormatRegistryDataFileStatistics(it's inpyarrow.pycurrently but I think this might be good to consolidate for metrics)
Changes to pyiceberg/io/pyarrow.py:
ParquetFormatWriter/ParquetFormatModelusing thewrite_parquet()(insidewrite_file()write_file()refactored to readwrite.format.default, look up the format model, and dispatch.
TCK tests/io/test_file_format_tck.py:
- pytest-parameterized round-trip, statistics, type coverage, and null handling tests for every registered format.
Phased rollout:
- ABCs and registry first, then Parquet extraction with TCK tests, then
write_file()dispatch
Java ↔ Python Mapping
| Java | Python |
|---|---|
FormatModel<D, S> |
FileFormatModel (ABC, no type params) |
FileAppender<D> / ModelWriteBuilder |
FileFormatWriter (ABC) |
FormatModelRegistry |
FormatRegistry (keyed by FileFormat only) |
Metrics |
DataFileStatistics (existing) |
| TCK | test_file_format_tck.py |
Scope
This proposal covers the abstraction layer and the Parquet extraction only. No new format writers are included; ORC write support (#20) and any future formats (Avro, etc.) would be follow-ups once this lands.
References
- Java File Format API: apache/iceberg#12774
- Design doc: Google Doc
- Format impls: Parquet #15253, ORC #15255, Avro #15254
- TCK: apache/iceberg#15415
- Prior pyiceberg ORC work: #20, #790, #2236
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels