sketch out sync codecs + threadpool by d-v-b · Pull Request #3715 · zarr-developers/zarr-python

d-v-b · 2026-02-18T20:51:17Z

This is a work in progress with all the heavy lifting done by claude. The goal is to improve the performance of our codecs by avoiding overhead in to_thread and other async machinery. At the moment we have deadlocks in some of the array tests, but I am opening this now as a draft to see if the benchmarks show anything promising.

codspeed-hq · 2026-02-18T21:12:53Z

Merging this PR will improve performance by ×5

⚡ 50 improved benchmarks
✅ 6 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,031.6 ms	270.8 ms	×3.8
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	554.3 ms	181.7 ms	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1,551.5 ms	684.4 ms	×2.3
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2,111.7 ms	791.6 ms	×2.7
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	1,204.9 ms	552.4 ms	×2.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.7 s	1.3 s	×2
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory]`	1,831.3 µs	662.2 µs	×2.8
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	278.1 ms	66.7 ms	×4.2
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	1,315 ms	532.1 ms	×2.5
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,631.2 ms	639.4 ms	×2.6
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	6 s	1.4 s	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	619.7 ms	143.9 ms	×4.3
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	2,886.8 ms	604.5 ms	×4.8
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	419.2 ms	99.4 ms	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	952.5 ms	228.6 ms	×4.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	1.5 s	×2.2
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing d-v-b:perf/faster-codecs (9d77ca5) with main (f8b3d38)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…thon into perf/faster-codecs

d-v-b · 2026-02-19T10:53:00Z

docs/design/sync-bypass.md

@@ -0,0 +1,228 @@
+# Design: Fully Synchronous Read/Write Bypass


@rabernat @dcherian have a look, this is claude's summary of the perf blockers addressed in this PR

d-v-b · 2026-02-19T12:49:49Z

performance impact ranges from "good" to "amazing" so I think we want to learn from this PR. IMO this is NOT a merge candidate but rather should function as a proof-of-concept for what we can get if we rethink our current codec API.

Some key points:

Wrapping CPU-bound routines like gzip encode / decode with async adds needless latency. We get a lot of perf by using a sync fast path whenever possible. We need to bake this "sync is faster when available" lesson into both our codec API and store API. For example, there is no reason that reading or writing to an in-memor dict should be async.
We should design the chunk encoding process so that IO bound and CPU-bound routines are logically separated in the codebase. That means modelling sharding as a codec is probably wrong. Sharding is declared as a codec in array metadata, but we don't need to model it as a codec internally. Sharding changes how we do IO, but it should not change when we do IO.
I haven't looked at memory use at all. that's probably a separate effort.

d-v-b · 2026-02-19T13:06:24Z

the current performance improvements are without any parallelism. I'm adding that now.

d-v-b · 2026-02-19T13:53:27Z

the latest commit adds thread-based parallelism to the synchronous codec pipeline. we compute an estimated compute cost based on the chunk size, codecs, and operation (encode / code), and use that estimate to choose a parellelism strategy, ranging from no threads to full use of a thread pool.

d-v-b · 2026-02-20T15:20:17Z

marking this as not a draft, because I think we should actually consider merging it.

d-v-b · 2026-02-20T15:53:30Z

i added a changelog entry and made a breaking change: removal of the batch_size parameter from the BatchedCodecPipeline. The batch size was already limited by the concurrency limit, and the parallelism model offered by batch_size (applying codec X over batch_size chunks at once) doesn't really make performance sense versus parallelism across chunks.

dcherian · 2026-02-20T19:14:22Z

This is extremely hard to review at the moment. Can we look at a new PR with just one affected codec (Zstd?) please?

d-v-b · 2026-02-20T19:22:12Z

the changes here aren't really made at the granularity of a single codec. We have new codec pipeline behavior, which requires new methods on stores AND codecs. When the codec pipeline identifies that all the codecs AND the store support the fast path, then it uses the fast path. So breaking that apart is difficult.

src/zarr/experimental/sync_codecs.py

src/zarr/core/config.py

dcherian · 2026-02-20T19:28:46Z

src/zarr/codecs/gzip.py

-        return await asyncio.to_thread(
-            as_numpy_array_wrapper, GZip(self.level).decode, chunk_bytes, chunk_spec.prototype
-        )
+        return await asyncio.to_thread(self._decode_sync, chunk_bytes, chunk_spec)


as an aside, these to_thread calls are extremely annoying; they run on an independent thread pool not the one Zarr sets up (and are thus unconstrained by any config setting).

instead we need something like this:
https://github.com/earth-mover/xpublish-tiles/blob/1a800e05617d609098bbcd1a1f5ac9bbdcb531aa/src/xpublish_tiles/lib.py#L147-L152

yes to_thread has serious problems: python/cpython#136084. I will drop in your async_run idea!

src/zarr/codecs/crc32c_.py

tests/package_with_entrypoint/__init__.py

d-v-b · 2026-02-20T19:50:59Z

src/zarr/core/codec_pipeline.py

+_CODEC_DECODE_NS_PER_BYTE: dict[str, float] = {
+    # Near-zero cost — just reshaping/copying/checksumming
+    "BytesCodec": 0,
+    "Crc32cCodec": 0,
+    "TransposeCodec": 0,
+    "VLenUTF8Codec": 0,
+    "VLenBytesCodec": 0,
+    # Medium cost — fast C codecs, GIL released
+    "ZstdCodec": 1,
+    "BloscCodec": 0.5,
+    # High cost — slower C codecs, GIL released
+    "GzipCodec": 8,
+}
+
+_CODEC_ENCODE_NS_PER_BYTE: dict[str, float] = {
+    # Near-zero cost — just reshaping/copying/checksumming
+    "BytesCodec": 0,
+    "Crc32cCodec": 0,
+    "TransposeCodec": 0,
+    "VLenUTF8Codec": 0,
+    "VLenBytesCodec": 0,
+    # Medium cost — fast C codecs, GIL released
+    "ZstdCodec": 3,
+    "BloscCodec": 2,
+    # High cost — slower C codecs, GIL released
+    "GzipCodec": 50,
+}


@dcherian here's the estimated cost of running each codec in the encode and decode path

…ospection more efficient

…into perf/faster-codecs

src/zarr/core/codec_pipeline.py

mkitti

Could we adjust work estimates based on codec parameters?

mkitti · 2026-02-20T20:21:01Z

src/zarr/core/codec_pipeline.py

+    "VLenUTF8Codec": 0,
+    "VLenBytesCodec": 0,
+    # Medium cost — fast C codecs, GIL released
+    "ZstdCodec": 3,


Can we adjust by the compression level.? Compression level level -1000 is different compression level 22 in terms of time.

yes we could put this in the model. we would have to take some data first of course

dcherian · 2026-02-20T20:49:26Z

src/zarr/core/codec_pipeline.py

-    if per_chunk_ns < _POOL_OVERHEAD_NS and min_workers == 0:
+    # Only use the pool when at least one codec does real work
+    # and the chunks are large enough to offset dispatch overhead.
+    has_expensive = any(type(c).__name__ not in _CHEAP_CODECS for c in codecs)


uhh... can we not isinstance because of cyclic imports or something?

dcherian · 2026-02-20T20:51:09Z

src/zarr/core/codec_pipeline.py

+_MIN_CHUNK_NBYTES_FOR_POOL = 100_000  # 100 KB
+
+
+def _choose_workers(n_chunks: int, chunk_nbytes: int, codecs: Iterable[Codec]) -> int:


Can this be def _use_thread_pool(...)->bool instead?

dcherian · 2026-02-20T20:51:31Z

src/zarr/core/codec_pipeline.py

-
-def _get_pool(max_workers: int) -> ThreadPoolExecutor:
-    """Get a thread pool with at most *max_workers* threads."""
+def _get_pool() -> ThreadPoolExecutor:


hard to see why this had to change but... i"m not opposed to it.

dcherian · 2026-02-20T20:54:00Z

src/zarr/core/codec_pipeline.py

+    """Get the module-level thread pool, creating it lazily."""
+    global _pool
+    if _pool is None:
+        max_workers: int = config.get("threading.codec_workers").get("max") or os.cpu_count() or 4


duplicated in _choose_workers, doesn't donfig have a way to do runtime defaults?

sketch out sync codecs + threadpool

f427898

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added benchmark Code will be benchmarked in a CI job. and removed needs release notes Automatically applied to PRs which haven't added release notes labels Feb 18, 2026

Merge branch 'main' into perf/faster-codecs

dbdc3d4

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added 5 commits February 19, 2026 08:45

fix perf regressions

65d1230

Merge branch 'perf/faster-codecs' of https://github.com/d-v-b/zarr-py…

e24fe7e

…thon into perf/faster-codecs

add partial encode / decode

f979eaa

add sync hotpath

a934899

add comments and documentation

b53ac3e

d-v-b commented Feb 19, 2026

View reviewed changes

d-v-b added 4 commits February 19, 2026 12:29

refactor sharding to allow sync

73ac845

fix array spec propagation

aeecda8

fix countingdict tests

69172fb

update design doc

28d0def

dynamic pool allocation

f8e39e6

d-v-b added 7 commits February 19, 2026 15:03

default to 1 itemsize for data types that don't declare it

b388911

Merge branch 'main' into perf/faster-codecs

7e29ef3

Merge branch 'main' into perf/faster-codecs

00dde0b

remove extra codec pipeline

9d77ca5

remove garbage

88a4875

lint

284e5e2

use protocols for new sync behavior

b1b876a

d-v-b marked this pull request as ready for review February 20, 2026 15:19

remove batch size parameter; add changelog entry

6996284

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 20, 2026

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/experimental/sync_codecs.py Outdated Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/config.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/codecs/crc32c_.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

tests/package_with_entrypoint/__init__.py Show resolved Hide resolved

d-v-b added 2 commits February 20, 2026 20:34

prune dead code, make protocols useful

204dda1

restore batch size but it's only there for warnings

e9db616

d-v-b commented Feb 20, 2026

View reviewed changes

d-v-b added 2 commits February 20, 2026 21:09

fix type hints, prevent thread pool leakage, make codec pipeline intr…

01e1f73

…ospection more efficient

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

fbde3af

…into perf/faster-codecs

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

restore old comments / docstrings

11534b0

mkitti reviewed Feb 20, 2026

View reviewed changes

simplify threadpool management

b40d53a

dcherian reviewed Feb 20, 2026

View reviewed changes

		@@ -0,0 +1,228 @@
		# Design: Fully Synchronous Read/Write Bypass

		_MIN_CHUNK_NBYTES_FOR_POOL = 100_000 # 100 KB


		def _choose_workers(n_chunks: int, chunk_nbytes: int, codecs: Iterable[Codec]) -> int:

Uh oh!

Comments

Conversation

d-v-b commented Feb 18, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×5

Performance Changes

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 20, 2026

Uh oh!

d-v-b commented Feb 20, 2026

Uh oh!

dcherian commented Feb 20, 2026

Uh oh!

d-v-b commented Feb 20, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkitti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Feb 18, 2026 •

edited

Loading

dcherian Feb 20, 2026 •

edited

Loading