Skip to content

Commit 4e4b648

Browse files
Update to zarr v3 (#735)
* Tests all passing? Deps must have updated since I last tried * Missing environment bits * Apparently the bugfix was in zarr v3.0.10 * Fix requirements in ci/env*.yml * Allow zarr <3.0 or >=3.0.10 * Change async in test based on zarr version - we can make this default behaviour too I think * Move `_zarr_async` to utils module * Add asynchronous/synchronous note to docs under Assets object section. I think we can do better & take care of this for the user * @mgrover1 comment, add some extra tests, allow zarr version specification in the data format * Misformatted esm-catalog-spec * Re-pin minimum zarr version * Formatting
1 parent ab18065 commit 4e4b648

File tree

14 files changed

+457
-13
lines changed

14 files changed

+457
-13
lines changed

ci/environment-docs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ dependencies:
3030
- watermark
3131
- xarray-datatree >=0.0.9
3232
- xarray >=2024.10
33-
- zarr >=2.12,<3.0
33+
- zarr <3.0|>=3.0.10
3434
- furo >=2022.09.15
3535
- pip:
3636
- git+https://github.com/ncar-xdev/ecgtools

ci/environment-upstream-dev.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ dependencies:
3737
- scipy
3838
- xarray-datatree
3939
- xgcm
40-
- zarr >=2.10,<3.0
40+
- zarr <3.0|>=3.0.10
4141
- pip:
4242
- git+https://github.com/intake/intake.git
4343
- git+https://github.com/pydata/xarray.git

ci/environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,5 +34,5 @@ dependencies:
3434
- scipy
3535
- xarray >=2024.10
3636
- xarray-datatree
37-
- zarr >=2.12,<3.0
37+
- zarr <3.0|>=3.0.10
3838
# - pytest-icdiff

docs/source/reference/esm-catalog-spec.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,11 +85,44 @@ The column names can optionally be associated with a controlled vocabulary, such
8585

8686
An assets object describes the columns in the CSV file relevant for opening the actual data files.
8787

88-
| Element | Type | Description |
89-
| ------------------ | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
90-
| column_name | string | **REQUIRED.** The name of the column containing the path to the asset. Must be in the header of the CSV file. |
91-
| format | string | The data format. Valid values are `netcdf`, `zarr`, `opendap` or `reference` ([`kerchunk`](https://github.com/fsspec/kerchunk) reference files). If specified, it means that all data in the catalog is the same type. |
92-
| format_column_name | string | The column name which contains the data format, allowing for variable data types in one catalog. Mutually exclusive with `format`. |
88+
| Element | Type | Description |
89+
| ------------------ | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
90+
| column_name | string | **REQUIRED.** The name of the column containing the path to the asset. Must be in the header of the CSV file. |
91+
| format | string | The data format. Valid values are `netcdf`, `zarr`, `zarr2`, `zarr3`, `opendap` or `reference` ([`kerchunk`](https://github.com/fsspec/kerchunk) reference files). If specified, it means that all data in the catalog is the same type. |
92+
| format_column_name | string | The column name which contains the data format, allowing for variable data types in one catalog. Mutually exclusive with `format`. |
93+
94+
````{note}
95+
Zarr v3 is built on asynchronous operations, and requires `xarray_open_kwargs` to contain the following dictionary fragment:
96+
```python
97+
xarray_open_kwargs = {
98+
"storage_options" : {
99+
"remote_options" : {
100+
"async": true,
101+
...
102+
},
103+
...
104+
},
105+
...
106+
}
107+
```
108+
109+
In contrast, Zarr v2 is synchronous and instead requires:
110+
111+
```python
112+
xarray_open_kwargs = {
113+
"storage_options" : {
114+
"remote_options" : {
115+
"async": false,
116+
...
117+
},
118+
...
119+
},
120+
...
121+
}
122+
```
123+
````
124+
125+
If `zarr2` or `zarr3` is specified in the `format` field, the `async` flag will be set automatically. If you specify `zarr` as the format, you must set the `async` flag manually in the `xarray_open_kwargs`.
93126

94127
### Aggregation Control Object
95128

intake_esm/cat.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@ class AggregationType(str, enum.Enum):
5454
class DataFormat(str, enum.Enum):
5555
netcdf = 'netcdf'
5656
zarr = 'zarr'
57+
zarr2 = 'zarr2'
58+
zarr3 = 'zarr3'
5759
reference = 'reference'
5860
opendap = 'opendap'
5961

intake_esm/source.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from intake.source.base import DataSource, Schema
1010

1111
from .cat import Aggregation, DataFormat
12-
from .utils import OPTIONS
12+
from .utils import OPTIONS, _set_async_flag
1313

1414

1515
class ConcatenationWarning(UserWarning):
@@ -23,7 +23,7 @@ class ESMDataSourceError(Exception):
2323
def _get_xarray_open_kwargs(data_format, xarray_open_kwargs=None, storage_options=None):
2424
xarray_open_kwargs = (xarray_open_kwargs or {}).copy()
2525
_default_open_kwargs = {
26-
'engine': 'zarr' if data_format in {'zarr', 'reference'} else 'netcdf4',
26+
'engine': 'zarr' if data_format in {'zarr', 'zarr2', 'zarr3', 'reference'} else 'netcdf4',
2727
'chunks': {},
2828
'backend_kwargs': {},
2929
'decode_timedelta': False,
@@ -40,6 +40,8 @@ def _get_xarray_open_kwargs(data_format, xarray_open_kwargs=None, storage_option
4040
):
4141
xarray_open_kwargs['backend_kwargs']['storage_options'] = {} or storage_options
4242

43+
xarray_open_kwargs = _set_async_flag(data_format, xarray_open_kwargs)
44+
4345
return xarray_open_kwargs
4446

4547

intake_esm/utils.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
from collections import defaultdict
66

77
import polars as pl
8+
import zarr
9+
10+
__all__ = [
11+
'OPTIONS',
12+
'set_options',
13+
'_set_async_flag',
14+
]
815

916

1017
def show_versions(file=sys.stdout): # pragma: no cover
@@ -57,6 +64,59 @@ def show_versions(file=sys.stdout): # pragma: no cover
5764
print(f'{k}: {stat}', file=file)
5865

5966

67+
def _zarr_async() -> bool:
68+
"""
69+
Zarr went all async in version 3.0.0. This sets the async flag based on
70+
the zarr version in storage options
71+
"""
72+
73+
return int(zarr.__version__.split('.')[0]) > 2
74+
75+
76+
def _set_async_flag(data_format: str, xarray_open_kwargs: dict) -> dict:
77+
"""
78+
If we have the data format set to either zarr2 or zarr3, the async flag in
79+
`xarray_open_kwargs['storage_options']['remote_opetions']` is constrained to
80+
be either False or True, respectively.
81+
82+
Parameters
83+
----------
84+
data_format : str
85+
86+
xarray_open_kwargs : dict
87+
The xarray open kwargs dictionary that may contain storage options.
88+
Returns
89+
-------
90+
dict
91+
The updated xarray open kwargs with the async flag set appropriately.
92+
"""
93+
if data_format not in {'zarr2', 'zarr3'}:
94+
return xarray_open_kwargs
95+
96+
storage_opts_template = {
97+
'backend_kwargs': {'storage_options': {'remote_options': {'asynchronous': _zarr_async()}}}
98+
}
99+
if (
100+
xarray_open_kwargs.get('backend_kwargs', {})
101+
.get('storage_options', {})
102+
.get('remote_options', None)
103+
is not None
104+
):
105+
xarray_open_kwargs['backend_kwargs']['storage_options']['remote_options'][
106+
'asynchronous'
107+
] = _zarr_async()
108+
elif xarray_open_kwargs.get('backend_kwargs', {}).get('storage_options', None) is not None:
109+
xarray_open_kwargs['backend_kwargs']['storage_options'] = storage_opts_template[
110+
'backend_kwargs'
111+
]['storage_options']
112+
elif xarray_open_kwargs.get('backend_kwargs', None) is not None:
113+
xarray_open_kwargs['backend_kwargs'] = storage_opts_template['backend_kwargs']
114+
else:
115+
xarray_open_kwargs = storage_opts_template
116+
117+
return xarray_open_kwargs
118+
119+
60120
OPTIONS = {
61121
'attrs_prefix': 'intake_esm_attrs',
62122
'dataset_key': 'intake_esm_dataset_key',

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ pydantic>=2.0
1010
pydap!=3.5.5
1111
requests>=2.24.0
1212
xarray>=2024.10
13-
zarr>=2.12
13+
# Allow zarr >2.12 or zarr 3.1.0+
14+
zarr>=2.12,!=3.0.*
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"esmcat_version": "0.1.0",
3+
"id": "sample-cesm1-lens-zarr2",
4+
"description": "This is a sample ESM catalog for CESM1-LENS data in zarr v2 format",
5+
"catalog_file": "./tests/sample-catalogs/cesm1-lens-aws-zarr.csv",
6+
"attributes": [
7+
{
8+
"column_name": "experiment",
9+
"vocabulary": ""
10+
},
11+
{
12+
"column_name": "component",
13+
"vocabulary": ""
14+
},
15+
{
16+
"column_name": "frequency",
17+
"vocabulary": ""
18+
},
19+
{ "column_name": "variable", "vocabulary": "" }
20+
],
21+
"assets": {
22+
"column_name": "path",
23+
"format": "zarr2"
24+
},
25+
"aggregation_control": {
26+
"variable_column_name": "variable",
27+
"groupby_attrs": ["component", "experiment", "frequency"],
28+
"aggregations": [
29+
{
30+
"type": "union",
31+
"attribute_name": "variable"
32+
}
33+
]
34+
}
35+
}
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"esmcat_version": "0.1.0",
3+
"id": "sample-cesm1-lens-zarr3",
4+
"description": "This is a sample ESM catalog for CESM1-LENS data in zarr v3 format",
5+
"catalog_file": "./tests/sample-catalogs/cesm1-lens-aws-zarr.csv",
6+
"attributes": [
7+
{
8+
"column_name": "experiment",
9+
"vocabulary": ""
10+
},
11+
{
12+
"column_name": "component",
13+
"vocabulary": ""
14+
},
15+
{
16+
"column_name": "frequency",
17+
"vocabulary": ""
18+
},
19+
{ "column_name": "variable", "vocabulary": "" }
20+
],
21+
"assets": {
22+
"column_name": "path",
23+
"format": "zarr3"
24+
},
25+
"aggregation_control": {
26+
"variable_column_name": "variable",
27+
"groupby_attrs": ["component", "experiment", "frequency"],
28+
"aggregations": [
29+
{
30+
"type": "union",
31+
"attribute_name": "variable"
32+
}
33+
]
34+
}
35+
}

0 commit comments

Comments
 (0)