Skip to content

Conversation

@rynewang
Copy link
Contributor

@rynewang rynewang commented Jan 3, 2026

Rationale for this change

Fixes #36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice

What changes are included in this PR?

The bug happens because:

  1. Header is written to data_buffer_ and flushed during CSVWriterImpl initialization
  2. The buffer is not cleared after flush
  3. When the next batch is empty, TranslateMinimalBatch returns early without modifying data_buffer_
  4. The write loop then writes data_buffer_ which still contains stale content

The fix introduces a WriteAndClearBuffer() helper that writes the buffer to sink and clears it. This helper is used in all write paths:

  • WriteHeader()
  • WriteRecordBatch()
  • WriteTable()

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

Are these changes tested?

Yes. Added C++ tests in writer_test.cc and Python tests in test_csv.py:

  • Empty batch at start of table
  • Empty batch in middle of table

Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

⚠️ GitHub issue #36889 has been automatically assigned in GitHub to PR creator.

@rynewang
Copy link
Contributor Author

rynewang commented Jan 6, 2026

@wgtmac Hi would you mind to also review this? Thanks!

@HuaHuaY
Copy link
Contributor

HuaHuaY commented Jan 9, 2026

I think we can clear data_buffer_ at last line of writeHeader. Relying on the Resize operation in TranslateMinimalBatch to perform implicit clear operation is not very readable.

@rynewang
Copy link
Contributor Author

Clearing after writeHeader is a great idea! However it's not complete and failed my unit tests because there are other Write* methods. Now, I made a helper function and let all Write* to do "write to sink and clear buffer". @HuaHuaY can you help pointing this to a maintainer for approval? Thanks!

@HuaHuaY
Copy link
Contributor

HuaHuaY commented Jan 13, 2026

cc @pitrou @wgtmac . Please take a look.

@github-actions
Copy link

⚠️ GitHub issue #36889 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 13, 2026
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! This looks good to me now.

@pitrou Do you want to take a look at it?

rynewang and others added 5 commits January 27, 2026 15:51
…ch is empty

When writing CSV, if the first record batch was empty, the header would be
written twice. This happened because:

1. Header is written to data_buffer_ and flushed during initialization
2. TranslateMinimalBatch returns early for empty batches without modifying data_buffer_
3. The loop then writes data_buffer_ which still contains the header

The fix clears the buffer (resize to 0) when encountering an empty batch,
so the subsequent write outputs nothing.

Added C++ and Python tests for empty batches at start and in middle of tables.

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=1%)
Claude-Steers: 2
Claude-Permission-Prompts: 2
Claude-Escapes: 1
Signed-off-by: Ruiyang Wang <ruiyang@anthropic.com>
Addresses review feedback from HuaHuaY: Instead of clearing the buffer
in TranslateMinimalBatch for empty batches, use a WriteAndClearBuffer()
helper that writes and clears the buffer in all write paths.

This is cleaner because:
- Every write follows the same pattern (write -> clear)
- Easier to reason about for future write stages
- The invariant is explicit: buffer is always clean after flush
Co-authored-by: Gang Wu <ustcwg@gmail.com>
… tests

- Rename WriteAndClearBuffer to FlushToSink (shorter, clearer intent)
- Consolidate Python tests into a single parameterized test with 3 cases:
  empty batch at beginning, middle, and end
@wgtmac wgtmac force-pushed the fix-csv-duplicate-header branch from 902ca15 to 3cc81a2 Compare January 27, 2026 07:52
@wgtmac wgtmac merged commit 0d0e068 into apache:main Jan 27, 2026
65 of 68 checks passed
@wgtmac wgtmac removed the awaiting committer review Awaiting committer review label Jan 27, 2026
@wgtmac
Copy link
Member

wgtmac commented Jan 27, 2026

Thanks @rynewang for fixing this and @HuaHuaY for the review!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 0d0e068.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Python] Duplicate csv header when table batches start with empty

3 participants