Skip to content

File Format API for PyIceberg #3100

@nssalian

Description

@nssalian

Feature Request / Improvement

Problem

The write path in pyiceberg/io/pyarrow.py is hardcoded to Parquet. The write.format.default table property exists but is never read. Adding a new format (ORC, Vortex, Lance) requires modifying the monolithic write_file() function. The read path already dispatches multiple formats; the write path should too.

Proposal

Introduce a File Format API aligned with Java Iceberg's File Format API (design doc).

New module pyiceberg/io/fileformat.py:

  • FileFormatWriter (ABC)
  • FileFormatModel (ABC)
  • FormatRegistry
  • DataFileStatistics (it's in pyarrow.py currently but I think this might be good to consolidate for metrics)

Changes to pyiceberg/io/pyarrow.py:

  • ParquetFormatWriter / ParquetFormatModel using the write_parquet() (inside write_file()
  • write_file() refactored to read write.format.default, look up the format model, and dispatch.

TCK tests/io/test_file_format_tck.py:

  • pytest-parameterized round-trip, statistics, type coverage, and null handling tests for every registered format.

Phased rollout:

  • ABCs and registry first, then Parquet extraction with TCK tests, then write_file() dispatch

Java ↔ Python Mapping

Java Python
FormatModel<D, S> FileFormatModel (ABC, no type params)
FileAppender<D> / ModelWriteBuilder FileFormatWriter (ABC)
FormatModelRegistry FormatRegistry (keyed by FileFormat only)
Metrics DataFileStatistics (existing)
TCK test_file_format_tck.py

Scope

This proposal covers the abstraction layer and the Parquet extraction only. No new format writers are included; ORC write support (#20) and any future formats (Avro, etc.) would be follow-ups once this lands.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions