Skip to content

Comments

Feature: support rectilinear chunk grid extension #3534

Open
jhamman wants to merge 35 commits intozarr-developers:mainfrom
jhamman:feature/rectilinear-chunk-grid
Open

Feature: support rectilinear chunk grid extension #3534
jhamman wants to merge 35 commits intozarr-developers:mainfrom
jhamman:feature/rectilinear-chunk-grid

Conversation

@jhamman
Copy link
Member

@jhamman jhamman commented Oct 20, 2025

Summary

Adds support for RectilinearChunkGrid extension (Zarr v3), enabling arrays with variable chunk sizes per dimension.

Closes: #1595 | Replaces: #1483 | Related: zarr-extensions#25

Key Features

RectilinearChunkGrid

arr = zarr.create_array(
    shape=(60, 100),
    chunks=[[10, 20, 30], [25, 25, 25, 25]],
    zarr_format=3
)
  • Zarr v3 only (not compatible with v2, sharding, or from_array())
  • Supports RLE in JSON metadata: [[10, 6]] = 6 chunks of size 10
  • Stored internally in expanded format for fast indexing

Chunk Grid Access

grid = arr.chunk_grid                    # Returns ChunkGrid instance
grid.chunk_shapes                        # ((10, 20, 30), (25, 25, 25, 25))
isinstance(grid, RectilinearChunkGrid)   # Type-safe checking

.chunks Property Behavior

# RegularChunkGrid: returns tuple with FutureWarning (deprecated)
arr.chunks  # (10, 10)

# RectilinearChunkGrid: raises NotImplementedError
arr.chunks  # Use .chunk_grid instead

Design Decisions

Decision Rationale
ChunksLike as TypeAlias Flexible input types without runtime overhead
ResolvedChunkSpec as frozen dataclass Named access, immutability, IDE support
Standalone validation functions Testability, clear error messages, early validation
.chunks raises for rectilinear No sensible single-tuple representation; guides users to .chunk_grid

Removed from Earlier Designs

Item Reason
RegularChunks/RectilinearChunks tuple subclasses Rejected - unnecessary complexity
Named dimension access (chunks.lat) Removed per review feedback
ChunksType ABC hierarchy Not implemented - TypeAlias approach preferred

Deferred / TODO Items

Item Location Notes
update_shape() optional chunks parameter metadata/v3.py:483 Allow specifying new chunk sizes when resizing instead of default heuristic
Validation function placement chunk_grids.py:1513-1593 Reviewer suggested moving to metadata module; kept for testability

Review Focus Areas

High Priority:

  • chunk_grids.py: RectilinearChunkGrid class, ChunksLike type, RLE expansion/compression, resolve_chunk_spec()
  • metadata/v3.py: update_shape() for rectilinear resize behavior
  • indexing.py: Variable chunk indexing with binary search
  • array.py: .chunks property behavior, .chunk_grid property

Tests:

  • test_chunk_grids/test_rectilinear.py: Comprehensive unit tests
  • test_chunk_grids/test_rectilinear_integration.py: End-to-end scenarios
  • testing/strategies.py: Hypothesis strategies for property-based testing

Breaking Changes

None. Fully backward compatible.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 20, 2025
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 78.85533% with 133 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.65%. Comparing base (b712f96) to head (41db2dc).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/chunk_grids.py 74.49% 76 Missing ⚠️
src/zarr/core/array.py 75.00% 20 Missing ⚠️
src/zarr/core/indexing.py 85.07% 20 Missing ⚠️
src/zarr/testing/strategies.py 89.28% 9 Missing ⚠️
src/zarr/core/metadata/v2.py 72.72% 6 Missing ⚠️
src/zarr/core/_info.py 0.00% 1 Missing ⚠️
src/zarr/core/metadata/v3.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
+ Coverage   60.94%   61.65%   +0.71%     
==========================================
  Files          86       86              
  Lines       10268    10769     +501     
==========================================
+ Hits         6258     6640     +382     
- Misses       4010     4129     +119     
Files with missing lines Coverage Δ
src/zarr/api/asynchronous.py 72.20% <ø> (ø)
src/zarr/api/synchronous.py 36.61% <ø> (ø)
src/zarr/core/group.py 70.27% <ø> (ø)
src/zarr/core/_info.py 51.80% <0.00%> (ø)
src/zarr/core/metadata/v3.py 59.91% <90.00%> (+1.90%) ⬆️
src/zarr/core/metadata/v2.py 60.31% <72.72%> (+2.17%) ⬆️
src/zarr/testing/strategies.py 94.18% <89.28%> (-3.66%) ⬇️
src/zarr/core/array.py 67.99% <75.00%> (-0.12%) ⬇️
src/zarr/core/indexing.py 70.19% <85.07%> (+0.73%) ⬆️
src/zarr/core/chunk_grids.py 70.70% <74.49%> (+8.40%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Oct 20, 2025


@dataclass(frozen=True)
class RectilinearChunkGrid(ChunkGrid):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on just calling this class Rectilinear, and renaming the RegularChunkGrid to Regular? We could keep around a RegularChunkGrid class for compatibility. But I feel like people know these are chunk grids when they import them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50/50. I think the more descriptive class is useful when looking at a tracebacks. Plus, this is currently in .core so its not meant to be used directly by users.

@given(data=st.data())
async def test_basic_indexing(data: st.DataObject) -> None:
zarray = data.draw(simple_arrays())
@given(data=st.data(), zarray=st.one_of([simple_arrays(), complex_chunked_arrays()]))
Copy link
Contributor

@dcherian dcherian Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the search space for the standard arrays strategy is so large, i made a different one complex_chunked_arrays that purely checks different chunk grids
with simple_arrays() we are only spending 10% of our time trying RectilinearChunkGrid so using this approach. We should boost number of examples too.

Comment on lines +668 to +669
2. **Not compatible with sharding**: You cannot use variable chunking together with
the sharding feature. Arrays must use either variable chunking or sharding, but not both.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is a temporary limitation! There's a natural extension of rectilinear chunk grids to rectilinear shard grids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



@dataclass(frozen=True)
class ResolvedChunkSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a class? It has no methods. Seems like either a TypedDict or just a tuple would work

shards: tuple[int, ...] | None


def _validate_zarr_format_compatibility(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this here? Can't we check when we construct the array metadata document?

)


def _validate_sharding_compatibility(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be defined over in the array v3 metadata module, and used there to check that the array metadata document is valid?

)


def _validate_data_compatibility(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be stand-alone function? are we going to use this logic anywhere other than it's current usage site (resolve_chunk_spec)

@@ -436,6 +468,21 @@ def to_dict(self) -> dict[str, JSON]:
return out_dict

def update_shape(self, shape: tuple[int, ...]) -> Self:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this needs to take a parameter that can define the new chunks (which can default to None or some other sentinel flagging "default behavior" semantics)

@d-v-b d-v-b mentioned this pull request Jan 28, 2026
@tomwhite
Copy link
Member

FYI I tested this PR for implementing rechunking with variable-sized intermediate chunks in Cubed - and it worked!

I found one wrinkle in that zarr.create_array supports rectilinear chunk grids, but zarr.open (with mode="w") doesn't. But that could be addressed later.

@jhamman jhamman closed this Feb 3, 2026
@jhamman jhamman reopened this Feb 3, 2026
@d-v-b
Copy link
Contributor

d-v-b commented Feb 4, 2026

I will give this a spin today, but assuming everything works, my question is: how can we ship this soon (this week?) while making it clear that the feature is experimental?

@tinaok
Copy link

tinaok commented Feb 4, 2026

I have varient chunked array;

import xarray as xr
from healpix_geo import nested 
import numpy as np

da = xr.open_zarr(
    "https://data-taos.ifremer.fr/GRID4EARTH/no_chunk_healpix.zarr",
    consolidated=True,   # if metadata is consolidated
).da

depth = da.cell_ids.attrs['level']
new_depth = depth-6
parents = nested.zoom_to(da.cell_ids, depth=depth, new_depth=new_depth) 
_, chunk_sizes =np.unique(parents, return_counts=True)
da.chunk({"cell_ids": tuple(chunk_sizes.tolist())})

Happy to experiment to write it using this experimental feature(if it works with xarray...)

@maxrjones
Copy link
Member

@jhamman I'm finally reviewing this. I'd like to get you an approval today. The one issue that I think we should address first is a performance regression in indexing, where IntArrayDimIndexer and
CoordinateIndexer replaced a vectorized numpy operation (dim_sel // dim_chunk_len) with a per-element for-loop calling array_index_to_chunk_coord(). This affects both RegularChunkGrid along with
RectilinearChunkGrid, which is why I think this is a blocking regression. I opened a PR against your branch with a potential fix — jhamman#5. This is a plot from https://github.com/maxrjones/zarr-chunk-grid-tests showing the performance impact:
image

@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 16, 2026
Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes add validation to reject rectilinear chunk specifications where the last chunk would contain no valid data.

For example, previously chunks=[[5, 5, 5]] with shape=(10,) would silently create a extra 3rd chunk at slice(10, 15) entirely outside the array.

This matters because without it, get_nchunks() and all_chunk_coords() report the extra unused chunks as real, which could cause issue downstream. Also, the code would accept (most likely) user errors rather than catching the errors at construction time.

f"but array shape is {arr_size}. This is invalid for the "
"RectilinearChunkGrid."
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sum(axis_chunks[:-1]) >= arr_size:
raise ValueError(
f"Axis {axis} has more chunks than needed "
f"The last chunk(s) would contain no valid data."
f"Remove the extra chunk(s) or increase the array shape."
)

f"Variable chunks along dimension {i} sum to {chunk_sum} "
f"but array shape is {dim_size}. Chunks must sum to be greater than or equal to the shape."
)

Copy link
Member

@maxrjones maxrjones Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sum(dim_chunks[:-1]) >= dim_size:
raise ValueError(
f"Dimension {i} has more chunks than needed "
f"The last chunk(s) would contain no valid data."
f"Remove the extra chunk(s) or increase the array shape."
)

assert arr.nchunks == 12 # 3 chunks in dim 0, 4 chunks in dim 1

# Can also get nchunks from the chunk_grid directly
assert arr.metadata.chunk_grid.get_nchunks((60, 100)) == 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert arr.metadata.chunk_grid.get_nchunks((60, 100)) == 12
assert arr.metadata.chunk_grid.get_nchunks((60, 100)) == 12
def test_rectilinear_validate_chunk_sum_exceeds_array_size() -> None:
"""Test that chunk sizes summing to more than array size are rejected.
chunks=[5, 5, 5] (sum=15) for shape=(10,) should fail because the chunks
describe more data than the array contains.
"""
grid = RectilinearChunkGrid(chunk_shapes=[[5, 5, 5]])
with pytest.raises(ValueError, match="more chunks"):
grid.get_nchunks((10,))

)


# Edge case tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Edge case tests
def test_resolve_chunk_spec_error_chunk_sum_exceeds_array_size() -> None:
"""Test that variable chunks summing to more than array size raise error.
shape=10 with chunks=[5, 5, 5] (sum=15) should be rejected because it
specifies more data than the array can hold, which could cause confusing
runtime errors or data corruption.
"""
with pytest.raises(ValueError, match="more chunks"):
resolve_chunk_spec(
chunks=[[5, 5, 5]], # sum=15 > shape=10
shards=None,
shape=(10,),
dtype_itemsize=4,
zarr_format=3,
)
# Edge case tests

@maxrjones
Copy link
Member

I tried this out in virtual tiff (virtual-zarr/virtual-tiff#69) and virtualizarr (zarr-developers/VirtualiZarr#877) and am super excited about this feature. I expect that it'll unlock a lot of downstream development and absolutely support releasing it this week as experimental. I think that the experimental status is already effectively documented.

My only additional comment beyond my two reviews earlier today is that I'm not convinced of the value of deprecating .chunks for RegularChunkGrid, but wouldn't want that concern to block merging.

Thanks for your work on this @jhamman!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants