-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Remove special mapping of auto to {} in open_zarr
#11010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Breaking Changes | ||
| ~~~~~~~~~~~~~~~~ | ||
|
|
||
| - Remove special mapping of ``"auto"`` to ``{}`` in ``open_zarr``. This matches the behavior of ``open_dataset(..., engine="zarr")`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone is going to complain but I think this is an improvement.
- it removes one source of confusion.
- in general you do want a multiple of the on-disk chunks.
- users can still pass
{}to get previous behaviour.
^ perhaps we can call that third point out in the release note.
Also, since this could be pretty breaking I'd like us to get a few more thumbs up on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this could be pretty breaking I'd like us to get a few more thumbs up on it.
Yeah absolutely.
I am wondering if it's worth trying to improve the auto logic to make it less willing to split chunks before we merge this. If auto never split chunks then I think this is a much less breaking change. As it currently stands, auto prefers to keep existing chunk boundaries, but it isn't strict about it. I would kind of like something that is logically (max({}, "auto"), but maybe that should be a different name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone is going to complain
😏
Out of laziness I have a lot of code that assumes open_zarr(path) is shorthand foropen_dataset(path, engine="zarr", chunks={}).
Fine, that can be changed, although a deprecation cycle would be appreciated.
But in my opinion auto-magic chunking as defined by some hidden logic in the chunkmanager is not very predictable and therefore not great default behavior. We should trust that the user had intent in storing the data with the chunk structure they did, and explicitly opt in to modifying that rather than the other way around. Even more so if "auto" doesn't even respect chunk boundaries.
I never use "auto" so I don't know the extent to which it does or does not preserve the on-disk chunking. If it usually leaves them alone, then maybe not a big issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easy to resolve by setting chunks=None as a default instead of "auto", interpreting that as {} for the period of the deprecation cycle, and switching that out to mean the same as chunks=None for open_dataset (which is no dask/cubed chunks) once the deprecation cycle is done..
Anyone that specifies "auto" today is asking for auto-magical behaviour, and should be mostly unaffected I hope. Or at least, I am OK with that breakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems good, and could pave the way to remove open_zarr entirely. Although that's been rejected before (#7496), and there's the asymmetry of specialized to_ methods for each file type and a single generic open_ method.
xr.open_zarrandxr.open_dataset(..., engine='zarr')withchunks="auto"#11002whats-new.rstThis PR makes the handling of
chunks="auto"consistent betweenopen_zarrandopen_dataset(..., engine="zarr").The handling of chunks still differs in
open_zarrvsopen_dataset(..., engine="zarr")in that the default inopen_zarris to usechunks="auto"and a chunk manager (aka dask) when available in your env. And inopen_datasetthe default is to usechunks=None(aka no chunks).