Skip to content

Commit ec41d65

Browse files
delgadompre-commit-ci[bot]max-sixtyandersy005Illviljan
authored
docs on specifying chunks in to_zarr encoding arg (#6542)
* docs on specifying chunks in to_zarr encoding arg The structure of the to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others! * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean-up prior write * accept changes from @andersy005 * drop cleanup in io.zarr.writing_chunks Co-authored-by: Anderson Banihirwe <axbanihirwe@ualr.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Co-authored-by: Anderson Banihirwe <axbanihirwe@ualr.edu> Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
1 parent ea07233 commit ec41d65

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed

doc/user-guide/io.rst

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -776,6 +776,75 @@ dimensions included in ``region``. Other variables (typically coordinates)
776776
need to be explicitly dropped and/or written in a separate calls to ``to_zarr``
777777
with ``mode='a'``.
778778

779+
.. _io.zarr.writing_chunks:
780+
781+
Specifying chunks in a zarr store
782+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
783+
784+
Chunk sizes may be specified in one of three ways when writing to a zarr store:
785+
786+
1. Manual chunk sizing through the use of the ``encoding`` argument in :py:meth:`Dataset.to_zarr`:
787+
2. Automatic chunking based on chunks in dask arrays
788+
3. Default chunk behavior determined by the zarr library
789+
790+
The resulting chunks will be determined based on the order of the above list; dask
791+
chunks will be overridden by manually-specified chunks in the encoding argument,
792+
and the presence of either dask chunks or chunks in the ``encoding`` attribute will
793+
supercede the default chunking heuristics in zarr.
794+
795+
Importantly, this logic applies to every array in the zarr store individually,
796+
including coordinate arrays. Therefore, if a dataset contains one or more dask
797+
arrays, it may still be desirable to specify a chunk size for the coordinate arrays
798+
(for example, with a chunk size of `-1` to include the full coordinate).
799+
800+
To specify chunks manually using the ``encoding`` argument, provide a nested
801+
dictionary with the structure ``{'variable_or_coord_name': {'chunks': chunks_tuple}}``.
802+
803+
.. note::
804+
805+
The positional ordering of the chunks in the encoding argument must match the
806+
positional ordering of the dimensions in each array. Watch out for arrays with
807+
differently-ordered dimensions within a single Dataset.
808+
809+
For example, let's say we're working with a dataset with dimensions
810+
``('time', 'x', 'y')``, a variable ``Tair`` which is chunked in ``x`` and ``y``,
811+
and two multi-dimensional coordinates ``xc`` and ``yc``:
812+
813+
.. ipython:: python
814+
815+
ds = xr.tutorial.open_dataset("rasm")
816+
817+
ds["Tair"] = ds["Tair"].chunk({"x": 100, "y": 100})
818+
819+
ds
820+
821+
These multi-dimensional coordinates are only two-dimensional and take up very little
822+
space on disk or in memory, yet when writing to disk the default zarr behavior is to
823+
split them into chunks:
824+
825+
.. ipython:: python
826+
827+
ds.to_zarr("path/to/directory.zarr", mode="w")
828+
! ls -R path/to/directory.zarr
829+
830+
831+
This may cause unwanted overhead on some systems, such as when reading from a cloud
832+
storage provider. To disable this chunking, we can specify a chunk size equal to the
833+
length of each dimension by using the shorthand chunk size ``-1``:
834+
835+
.. ipython:: python
836+
837+
ds.to_zarr(
838+
"path/to/directory.zarr",
839+
encoding={"xc": {"chunks": (-1, -1)}, "yc": {"chunks": (-1, -1)}},
840+
mode="w",
841+
)
842+
! ls -R path/to/directory.zarr
843+
844+
845+
The number of chunks on Tair matches our dask chunks, while there is now only a single
846+
chunk in the directory stores of each coordinate.
847+
779848
.. _io.iris:
780849

781850
Iris

0 commit comments

Comments
 (0)