Feature: support rectilinear chunk grid extension #3534
Feature: support rectilinear chunk grid extension #3534jhamman wants to merge 35 commits intozarr-developers:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3534 +/- ##
==========================================
+ Coverage 60.94% 61.65% +0.71%
==========================================
Files 86 86
Lines 10268 10769 +501
==========================================
+ Hits 6258 6640 +382
- Misses 4010 4129 +119
🚀 New features to boost your workflow:
|
…ature/rectilinear-chunk-grid
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class RectilinearChunkGrid(ChunkGrid): |
There was a problem hiding this comment.
thoughts on just calling this class Rectilinear, and renaming the RegularChunkGrid to Regular? We could keep around a RegularChunkGrid class for compatibility. But I feel like people know these are chunk grids when they import them
There was a problem hiding this comment.
50/50. I think the more descriptive class is useful when looking at a tracebacks. Plus, this is currently in .core so its not meant to be used directly by users.
| @given(data=st.data()) | ||
| async def test_basic_indexing(data: st.DataObject) -> None: | ||
| zarray = data.draw(simple_arrays()) | ||
| @given(data=st.data(), zarray=st.one_of([simple_arrays(), complex_chunked_arrays()])) |
There was a problem hiding this comment.
Because the search space for the standard arrays strategy is so large, i made a different one complex_chunked_arrays that purely checks different chunk grids
with simple_arrays() we are only spending 10% of our time trying RectilinearChunkGrid so using this approach. We should boost number of examples too.
| 2. **Not compatible with sharding**: You cannot use variable chunking together with | ||
| the sharding feature. Arrays must use either variable chunking or sharding, but not both. |
There was a problem hiding this comment.
I hope this is a temporary limitation! There's a natural extension of rectilinear chunk grids to rectilinear shard grids.
There was a problem hiding this comment.
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ResolvedChunkSpec: |
There was a problem hiding this comment.
does this need to be a class? It has no methods. Seems like either a TypedDict or just a tuple would work
| shards: tuple[int, ...] | None | ||
|
|
||
|
|
||
| def _validate_zarr_format_compatibility( |
There was a problem hiding this comment.
why do we need this here? Can't we check when we construct the array metadata document?
| ) | ||
|
|
||
|
|
||
| def _validate_sharding_compatibility( |
There was a problem hiding this comment.
shouldn't this be defined over in the array v3 metadata module, and used there to check that the array metadata document is valid?
| ) | ||
|
|
||
|
|
||
| def _validate_data_compatibility( |
There was a problem hiding this comment.
does this need to be stand-alone function? are we going to use this logic anywhere other than it's current usage site (resolve_chunk_spec)
| @@ -436,6 +468,21 @@ def to_dict(self) -> dict[str, JSON]: | |||
| return out_dict | |||
|
|
|||
| def update_shape(self, shape: tuple[int, ...]) -> Self: | |||
There was a problem hiding this comment.
IMO this needs to take a parameter that can define the new chunks (which can default to None or some other sentinel flagging "default behavior" semantics)
|
FYI I tested this PR for implementing rechunking with variable-sized intermediate chunks in Cubed - and it worked! I found one wrinkle in that |
…arr-python into feature/rectilinear-chunk-grid
…ature/rectilinear-chunk-grid
|
I will give this a spin today, but assuming everything works, my question is: how can we ship this soon (this week?) while making it clear that the feature is experimental? |
|
I have varient chunked array; Happy to experiment to write it using this experimental feature(if it works with xarray...) |
|
@jhamman I'm finally reviewing this. I'd like to get you an approval today. The one issue that I think we should address first is a performance regression in indexing, where |
maxrjones
left a comment
There was a problem hiding this comment.
These changes add validation to reject rectilinear chunk specifications where the last chunk would contain no valid data.
For example, previously chunks=[[5, 5, 5]] with shape=(10,) would silently create a extra 3rd chunk at slice(10, 15) entirely outside the array.
This matters because without it, get_nchunks() and all_chunk_coords() report the extra unused chunks as real, which could cause issue downstream. Also, the code would accept (most likely) user errors rather than catching the errors at construction time.
| f"but array shape is {arr_size}. This is invalid for the " | ||
| "RectilinearChunkGrid." | ||
| ) | ||
|
|
There was a problem hiding this comment.
| if sum(axis_chunks[:-1]) >= arr_size: | |
| raise ValueError( | |
| f"Axis {axis} has more chunks than needed " | |
| f"The last chunk(s) would contain no valid data." | |
| f"Remove the extra chunk(s) or increase the array shape." | |
| ) | |
| f"Variable chunks along dimension {i} sum to {chunk_sum} " | ||
| f"but array shape is {dim_size}. Chunks must sum to be greater than or equal to the shape." | ||
| ) | ||
|
|
There was a problem hiding this comment.
| if sum(dim_chunks[:-1]) >= dim_size: | |
| raise ValueError( | |
| f"Dimension {i} has more chunks than needed " | |
| f"The last chunk(s) would contain no valid data." | |
| f"Remove the extra chunk(s) or increase the array shape." | |
| ) | |
| assert arr.nchunks == 12 # 3 chunks in dim 0, 4 chunks in dim 1 | ||
|
|
||
| # Can also get nchunks from the chunk_grid directly | ||
| assert arr.metadata.chunk_grid.get_nchunks((60, 100)) == 12 |
There was a problem hiding this comment.
| assert arr.metadata.chunk_grid.get_nchunks((60, 100)) == 12 | |
| assert arr.metadata.chunk_grid.get_nchunks((60, 100)) == 12 | |
| def test_rectilinear_validate_chunk_sum_exceeds_array_size() -> None: | |
| """Test that chunk sizes summing to more than array size are rejected. | |
| chunks=[5, 5, 5] (sum=15) for shape=(10,) should fail because the chunks | |
| describe more data than the array contains. | |
| """ | |
| grid = RectilinearChunkGrid(chunk_shapes=[[5, 5, 5]]) | |
| with pytest.raises(ValueError, match="more chunks"): | |
| grid.get_nchunks((10,)) |
| ) | ||
|
|
||
|
|
||
| # Edge case tests |
There was a problem hiding this comment.
| # Edge case tests | |
| def test_resolve_chunk_spec_error_chunk_sum_exceeds_array_size() -> None: | |
| """Test that variable chunks summing to more than array size raise error. | |
| shape=10 with chunks=[5, 5, 5] (sum=15) should be rejected because it | |
| specifies more data than the array can hold, which could cause confusing | |
| runtime errors or data corruption. | |
| """ | |
| with pytest.raises(ValueError, match="more chunks"): | |
| resolve_chunk_spec( | |
| chunks=[[5, 5, 5]], # sum=15 > shape=10 | |
| shards=None, | |
| shape=(10,), | |
| dtype_itemsize=4, | |
| zarr_format=3, | |
| ) | |
| # Edge case tests |
|
I tried this out in virtual tiff (virtual-zarr/virtual-tiff#69) and virtualizarr (zarr-developers/VirtualiZarr#877) and am super excited about this feature. I expect that it'll unlock a lot of downstream development and absolutely support releasing it this week as experimental. I think that the experimental status is already effectively documented. My only additional comment beyond my two reviews earlier today is that I'm not convinced of the value of deprecating Thanks for your work on this @jhamman! |

Summary
Adds support for RectilinearChunkGrid extension (Zarr v3), enabling arrays with variable chunk sizes per dimension.
Closes: #1595 | Replaces: #1483 | Related: zarr-extensions#25
Key Features
RectilinearChunkGrid
from_array())[[10, 6]]= 6 chunks of size 10Chunk Grid Access
.chunksProperty BehaviorDesign Decisions
ChunksLikeas TypeAliasResolvedChunkSpecas frozen dataclass.chunksraises for rectilinear.chunk_gridRemoved from Earlier Designs
RegularChunks/RectilinearChunkstuple subclasseschunks.lat)ChunksTypeABC hierarchyDeferred / TODO Items
update_shape()optional chunks parametermetadata/v3.py:483chunk_grids.py:1513-1593Review Focus Areas
High Priority:
chunk_grids.py:RectilinearChunkGridclass,ChunksLiketype, RLE expansion/compression,resolve_chunk_spec()metadata/v3.py:update_shape()for rectilinear resize behaviorindexing.py: Variable chunk indexing with binary searcharray.py:.chunksproperty behavior,.chunk_gridpropertyTests:
test_chunk_grids/test_rectilinear.py: Comprehensive unit teststest_chunk_grids/test_rectilinear_integration.py: End-to-end scenariostesting/strategies.py: Hypothesis strategies for property-based testingBreaking Changes
None. Fully backward compatible.
TODO:
docs/user-guide/*.mdchanges/