Skip to content

Add support for structured dtypes to zarr3 driver, open structs as void#271

Open
BrianMichell wants to merge 67 commits intogoogle:masterfrom
BrianMichell:v3_structs_and_void
Open

Add support for structured dtypes to zarr3 driver, open structs as void#271
BrianMichell wants to merge 67 commits intogoogle:masterfrom
BrianMichell:v3_structs_and_void

Conversation

@BrianMichell
Copy link
Contributor

Supersedes #264

Resolves comments 1 and 2

Implement shim for `open_as_void` driver level flag
* Begin removing void field shim

* Fully removed void string shim

* Cleanup debug prints

* Remove shimmed validation

* Remove unnecessary comment

* Prefer false over zero for ternary clarity
* Implement a more general and portable example set

* Fix driver cache bug

* Update example for template

* Cleanup example

* Remove testing examples from source
* Use the appropriate fill value for open_as_void structured data

* Cleanup
@laramiel
Copy link
Collaborator

I'll try to get to this in about a week, before I look this one over, please double check that the prior PR works for you. Also look over this one and see if any of the suggestions from the other one applies.

Matches the pattern from zarr v2 driver (PR google#272). When both "field"
and "open_as_void" are specified in the spec, return an error since
these options are mutually exclusive - field selects a specific field
from a structured array, while open_as_void provides raw byte access
to the entire structure.
The zarr3 URL syntax cannot represent field selection or void access
mode. Following the pattern from zarr v2 driver (PR google#272), ToUrl() now
returns an error when either of these options is specified instead of
silently ignoring them.
…trip

Following the pattern from zarr v2 driver (PR google#272), override
GetBoundSpecData in ZarrDataCache to set spec.open_as_void from
ChunkCacheImpl::open_as_void_. This ensures that when you open a
store with open_as_void=true and then call spec(), the resulting
spec correctly has open_as_void=true set.

Without this fix, opening a store with open_as_void=true and then
getting its spec would lose the open_as_void flag, causing incorrect
behavior if the spec is used to re-open the store.
The ZarrShardSubChunkCache template had duplicate member variables
(open_as_void_, original_is_structured_, bytes_per_element_) that
were already present in the base class ChunkCacheImpl (ZarrLeafChunkCache).

Access these through ChunkCacheImpl:: prefix instead to follow DRY
principle and maintain consistency with other TensorStore patterns.
Reviewed the code for potential inconsistencies and fixed some bugs
@laramiel
Copy link
Collaborator

laramiel commented Feb 3, 2026

FWIW I see a new assert failure in this PR:

[ RUN      ] StorageStatisticsTest.FullyLexicographicOrder
F0202 23:48:22.663513    9604 logging.cc:51] assert.h assertion failed at tensorstore/util/span.h:366 in span<element_type, dynamic_extent> tensorstore::span<const long>::subspan(ptrdiff_t, ptrdiff_t) const [T = const long, Extent = -1]: offset >= 0 && (count == dynamic_extent || (count >= 0 && offset + count <= size()))
*** Check failure stack trace: ***
...
    @     0x7fc7ec534374  __assert_fail
    @     0x7fca7366d7c3  tensorstore::span<>::subspan()
    @     0x7fca7366463b  tensorstore::internal_zarr3::(anonymous namespace)::DataCacheBase::FormatKey()
    @     0x7fca72931660  tensorstore::internal::GetChunkKeyRangesForRegularGridWithSemiLexicographicalKeys()::$_1::operator()()
    @     0x7fca72930eb0  absl::functional_internal::InvokeObject<>()
    @     0x7fca6b5f65fa  tensorstore::internal_grid_partition::(anonymous namespace)::GetGridCellRangesIterateHelper::IterateOverStridedSets()
    @     0x7fca6b5f5d57  tensorstore::internal_grid_partition::GetGridCellRanges()
    @     0x7fca72930c79  tensorstore::internal::GetChunkKeyRangesForRegularGridWithSemiLexicographicalKeys()
    @     0x7fca72d3c281  tensorstore::internal::GetStorageStatisticsForRegularGridWithSemiLexicographicalKeys()
    @     0x7fca73177d4c  tensorstore::internal_zarr3::GridStorageStatisticsChunkHandlerBase::Start()
    @     0x7fca7317737d  tensorstore::internal_zarr3::ZarrLeafChunkCache::GetStorageStatistics()

It's best to run all the tests as you're developing. I typically do bazelisk.py test -k //tensorstore/driver/zarr3/...

So far this is mostly build-based issues. I'll look at more of the structure later.

@laramiel
Copy link
Collaborator

laramiel commented Feb 5, 2026

Ok, from an implementation perspective this is somewhat better than before. However there doesn't seem to be a spec. It does look like zarrs has some thoughts: zarrs/zarrs#154. And zarr-python has others: zarr-developers/zarr-python#2874

It would at least be reasonable to get an extension into the spec: https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html

@jbms
Copy link
Collaborator

jbms commented Feb 5, 2026

A spec in the zarr-extensions repo would be sufficient to move forward on this --- getting it as a core data type in the zarr v3 spec would not be required (or likely to occur any time soon).

However, open_as_void does pose additional questions in the context of zarr v3 as far as where exactly it should fit into the codec pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants