docs on specifying chunks in to_zarr encoding arg (#6542)

delgadom · pre-commit-ci[bot] · max-sixty · web-flow · commit ec41d651a7fd · 2022-06-23T15:31:36.000-06:00
* docs on specifying chunks in to_zarr encoding arg The structure of the to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others! * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean-up prior write * accept changes from @andersy005 * drop cleanup in io.zarr.writing_chunks Co-authored-by: Anderson Banihirwe <axbanihirwe@ualr.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Co-authored-by: Anderson Banihirwe <axbanihirwe@ualr.edu> Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
diff --git a/doc/user-guide/io.rst b/doc/user-guide/io.rst
@@ -776,6 +776,75 @@ dimensions included in ``region``. Other variables (typically coordinates)
 need to be explicitly dropped and/or written in a separate calls to ``to_zarr``
 with ``mode='a'``.
 
+.. _io.zarr.writing_chunks:
+
+Specifying chunks in a zarr store
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Chunk sizes may be specified in one of three ways when writing to a zarr store:
+
+1. Manual chunk sizing through the use of the ``encoding`` argument in :py:meth:`Dataset.to_zarr`:
+2. Automatic chunking based on chunks in dask arrays
+3. Default chunk behavior determined by the zarr library
+
+The resulting chunks will be determined based on the order of the above list; dask
+chunks will be overridden by manually-specified chunks in the encoding argument,
+and the presence of either dask chunks or chunks in the ``encoding`` attribute will
+supercede the default chunking heuristics in zarr.
+
+Importantly, this logic applies to every array in the zarr store individually,
+including coordinate arrays. Therefore, if a dataset contains one or more dask
+arrays, it may still be desirable to specify a chunk size for the coordinate arrays
+(for example, with a chunk size of `-1` to include the full coordinate).
+
+To specify chunks manually using the ``encoding`` argument, provide a nested
+dictionary with the structure ``{'variable_or_coord_name': {'chunks': chunks_tuple}}``.
+
+.. note::
+
+    The positional ordering of the chunks in the encoding argument must match the
+    positional ordering of the dimensions in each array. Watch out for arrays with
+    differently-ordered dimensions within a single Dataset.
+
+For example, let's say we're working with a dataset with dimensions
+``('time', 'x', 'y')``, a variable ``Tair`` which is chunked in ``x`` and ``y``,
+and two multi-dimensional coordinates ``xc`` and ``yc``:
+
+.. ipython:: python
+
+    ds = xr.tutorial.open_dataset("rasm")
+
+    ds["Tair"] = ds["Tair"].chunk({"x": 100, "y": 100})
+
+    ds
+
+These multi-dimensional coordinates are only two-dimensional and take up very little
+space on disk or in memory, yet when writing to disk the default zarr behavior is to
+split them into chunks:
+
+.. ipython:: python
+
+    ds.to_zarr("path/to/directory.zarr", mode="w")
+    ! ls -R path/to/directory.zarr
+
+
+This may cause unwanted overhead on some systems, such as when reading from a cloud
+storage provider. To disable this chunking, we can specify a chunk size equal to the
+length of each dimension by using the shorthand chunk size ``-1``:
+
+.. ipython:: python
+
+    ds.to_zarr(
+        "path/to/directory.zarr",
+        encoding={"xc": {"chunks": (-1, -1)}, "yc": {"chunks": (-1, -1)}},
+        mode="w",
+    )
+    ! ls -R path/to/directory.zarr
+
+
+The number of chunks on Tair matches our dask chunks, while there is now only a single
+chunk in the directory stores of each coordinate.
+
 .. _io.iris:
 
 Iris