Skip to content

New defaults for concat, merge, combine_* #10062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
5c56acf
Remove default values in private functions
jsignell Feb 14, 2025
5461a9f
Use sentinel value to change default with warnings
jsignell Feb 24, 2025
e16834f
Remove unnecessary warnings
jsignell Feb 24, 2025
9c50125
Use old kwarg values within map_blocks, concat dataarray
jsignell Feb 25, 2025
b0cf17a
Merge branch 'main' into concat_default_kwargs
jsignell Feb 25, 2025
0026ee8
Switch options back to old defaults
jsignell Feb 26, 2025
4d4deda
Update tests and add new ones to exercise options
jsignell Feb 26, 2025
5a4036b
Merge branch 'main' into concat_default_kwargs
jsignell Mar 4, 2025
912638b
Use `emit_user_level_warning` rather than `warnings.warn`
jsignell Mar 4, 2025
67fd4ff
Change hardcoded defaults
jsignell Mar 4, 2025
4f38292
Fix up test_concat
jsignell Mar 4, 2025
51ccc89
Add comment about why we allow data_vars='minimial' for concat over d…
jsignell Mar 4, 2025
aa3180e
Tidy up tests based on review
jsignell Mar 4, 2025
93d2abc
Merge branch 'main' into concat_default_kwargs
jsignell Mar 7, 2025
e517dcc
Trying to resolve mypy issues
jsignell Mar 10, 2025
0e678e5
Fix mypy in tests
jsignell Mar 10, 2025
37f0147
Fix doctests
jsignell Mar 10, 2025
dac337c
Ignore warnings on error tests
jsignell Mar 10, 2025
a0c16c3
Merge branch 'main' into concat_default_kwargs
jsignell Mar 13, 2025
4eb275c
Use typing.get_args when possible
jsignell Mar 13, 2025
03f1502
Allow `minimal` in concat options at the type level
jsignell Mar 13, 2025
f1649b8
Merge branch 'main' into concat_default_kwargs
dcherian Mar 13, 2025
7dbdd4a
Minimal docs update
jsignell Mar 13, 2025
c6a557b
Tighten up language
jsignell Mar 13, 2025
9667857
Merge branch 'main' into concat_default_kwargs
jsignell Mar 13, 2025
42cf522
Merge branch 'main' into concat_default_kwargs
jsignell Mar 17, 2025
8d0d390
Merge branch 'main' into concat_default_kwargs
jsignell Apr 18, 2025
ba45599
Add to deprecated section of whats new
jsignell Apr 18, 2025
90bd629
Merge branch 'main' into concat_default_kwargs
Illviljan May 9, 2025
d3b484f
Merge branch 'main' into concat_default_kwargs
jsignell May 27, 2025
f233294
Update doc/whats-new.rst
jsignell May 28, 2025
20a3dbd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 28, 2025
324714a
Add a mypy tuple[Any, ...] type
jsignell May 28, 2025
c4d9f74
Merge branch 'main' into concat_default_kwargs
jsignell May 28, 2025
38ef42d
Merge branch 'main' into concat_default_kwargs
jsignell May 30, 2025
eb14402
Apply suggestions from code review
jsignell Jun 2, 2025
729b8ba
Simplify combining docs slightly
jsignell Jun 2, 2025
aca67b9
Don't change concat_dims
jsignell Jun 2, 2025
63c5905
Fix formatting
jsignell Jun 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 37 additions & 7 deletions doc/user-guide/combining.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ new dimension by stacking lower dimensional arrays together:

.. ipython:: python

da.sel(x="a")
xr.concat([da.isel(x=0), da.isel(x=1)], "x")

If the second argument to ``concat`` is a new dimension name, the arrays will
Expand All @@ -52,15 +51,18 @@ dimension:

.. ipython:: python

xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim")
da0 = da.isel(x=0).drop_vars("x")
da1 = da.isel(x=1).drop_vars("x")

xr.concat([da0, da1], "new_dim")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the overlapping "x" means that you don't get a future warning anymore and the outcome won't change with the new defaults. It seemed to me like it was maintaining the spirit of the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change to xr.concat([da.isel(x=[0]), da.isel(x=[1])], dim="new_dim"). I think that preserves the spirit, and gets users closer to what we'd like them to type and understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one will give a FutureWarning about how join is going to change:

In [3]:  xr.concat([da.isel(x=[0]), da.isel(x=[1])], "new_dim")
<ipython-input-3-8d3fee24c8e4>:1: FutureWarning: In a future version of xarray the default value for join will change from join='outer' to join='exact'. This change will result in the following ValueError:cannot be aligned with join='exact' because index/labels/sizes are not equal along these coordinates (dimensions): 'x' ('x',) The recommendation is to set join explicitly for this case.
  xr.concat([da.isel(x=[0]), da.isel(x=[1])], "new_dim")
Out[3]: 
<xarray.DataArray (new_dim: 2, x: 2, y: 3)> Size: 96B
array([[[ 0.,  1.,  2.],
        [nan, nan, nan]],

       [[nan, nan, nan],
        [ 3.,  4.,  5.]]])
Coordinates:
  * x        (x) <U1 8B 'a' 'b'
  * y        (y) int64 24B 10 20 30
Dimensions without coordinates: new_dim

We can add an explicit join value to get rid of the warning or we can allow the docs to build with the warning (I think that is not a good idea because warnings in docs might scare people)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with that with the example as it is on main:

In [3]:  xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim")
<ipython-input-8-5e17a4052d18>:1: FutureWarning: In a future version of xarray the default value for coords will change from coords='different' to coords='minimal'. This is likely to lead to different results when multiple datasets have matching variables with overlapping values. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set coords explicitly.
  xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim")
Out[3]: 
<xarray.DataArray (new_dim: 2, y: 3)> Size: 48B
array([[0, 1, 2],
       [3, 4, 5]])
Coordinates:
    x        (new_dim) <U1 8B 'a' 'b'
  * y        (y) int64 24B 10 20 30
Dimensions without coordinates: new_dim

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep this as suggested in the PR I'd go with

    da0 = da.isel(x=0, drop=True)
    da0 = da.isel(x=1, drop=True)


The second argument to ``concat`` can also be an :py:class:`~pandas.Index` or
:py:class:`~xarray.DataArray` object as well as a string, in which case it is
used to label the values along the new dimension:

.. ipython:: python

xr.concat([da.isel(x=0), da.isel(x=1)], pd.Index([-90, -100], name="new_dim"))
xr.concat([da0, da1], pd.Index([-90, -100], name="new_dim"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


Of course, ``concat`` also works on ``Dataset`` objects:

Expand All @@ -75,6 +77,12 @@ between datasets. With the default parameters, xarray will load some coordinate
variables into memory to compare them between datasets. This may be prohibitively
expensive if you are manipulating your dataset lazily using :ref:`dask`.

.. note::

In a future version of xarray the default values for many of these options
will change. You can opt into the new default values early using
``xr.set_options(use_new_combine_kwarg_defaults=True)``.

.. _merge:

Merge
Expand All @@ -94,10 +102,18 @@ If you merge another dataset (or a dictionary including data array objects), by
default the resulting dataset will be aligned on the **union** of all index
coordinates:

.. note::

In a future version of xarray the default value for ``join`` and ``compat``
will change. This change will mean that xarray will no longer attempt
to align the indices of the merged dataset. You can opt into the new default
values early using ``xr.set_options(use_new_combine_kwarg_defaults=True)``.
Or explicitly set ``join='outer'`` to preserve old behavior.

.. ipython:: python

other = xr.Dataset({"bar": ("x", [1, 2, 3, 4]), "x": list("abcd")})
xr.merge([ds, other])
xr.merge([ds, other], join="outer")

This ensures that ``merge`` is non-destructive. ``xarray.MergeError`` is raised
if you attempt to merge two variables with the same name but different values:
Expand All @@ -114,6 +130,16 @@ if you attempt to merge two variables with the same name but different values:
array([[ 1.4691123 , 0.71713666, -0.5090585 ],
[-0.13563237, 2.21211203, 0.82678535]])

.. note::

In a future version of xarray the default value for ``compat`` will change
from ``compat='no_conflicts'`` to ``compat='override'``. In this scenario
the values in the first object override all the values in other objects.

.. ipython:: python

xr.merge([ds, ds + 1], compat="override")

The same non-destructive merging between ``DataArray`` index coordinates is
used in the :py:class:`~xarray.Dataset` constructor:

Expand Down Expand Up @@ -144,6 +170,11 @@ For datasets, ``ds0.combine_first(ds1)`` works similarly to
there are conflicting values in variables to be merged, whereas
``.combine_first`` defaults to the calling object's values.

.. note::

In a future version of xarray the default options for ``xr.merge`` will change
such that the behavior matches ``combine_first``.

.. _update:

Update
Expand Down Expand Up @@ -236,7 +267,7 @@ coordinates as long as any non-missing values agree or are disjoint:

ds1 = xr.Dataset({"a": ("x", [10, 20, 30, np.nan])}, {"x": [1, 2, 3, 4]})
ds2 = xr.Dataset({"a": ("x", [np.nan, 30, 40, 50])}, {"x": [2, 3, 4, 5]})
xr.merge([ds1, ds2], compat="no_conflicts")
xr.merge([ds1, ds2], join="outer", compat="no_conflicts")

Note that due to the underlying representation of missing values as floating
point numbers (``NaN``), variable data type is not always preserved when merging
Expand Down Expand Up @@ -295,13 +326,12 @@ they are concatenated in order based on the values in their dimension
coordinates, not on their position in the list passed to ``combine_by_coords``.

.. ipython:: python
:okwarning:

x1 = xr.DataArray(name="foo", data=np.random.randn(3), coords=[("x", [0, 1, 2])])
x2 = xr.DataArray(name="foo", data=np.random.randn(3), coords=[("x", [3, 4, 5])])
xr.combine_by_coords([x2, x1])

These functions can be used by :py:func:`~xarray.open_mfdataset` to open many
These functions are used by :py:func:`~xarray.open_mfdataset` to open many
files as one dataset. The particular function used is specified by setting the
argument ``'combine'`` to ``'by_coords'`` or ``'nested'``. This is useful for
situations where your data is split across many files in multiple locations,
Expand Down
2 changes: 1 addition & 1 deletion doc/user-guide/terminology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ complete examples, please consult the relevant documentation.*
)

# combine the datasets
combined_ds = xr.combine_by_coords([ds1, ds2])
combined_ds = xr.combine_by_coords([ds1, ds2], join="outer")
combined_ds

lazy
Expand Down
14 changes: 9 additions & 5 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7925,13 +7925,17 @@ Backwards incompatible changes
Now, the default always concatenates data variables:

.. ipython:: python
:suppress:

ds = xray.Dataset({"x": 0})
:verbatim:

.. ipython:: python
In [1]: ds = xray.Dataset({"x": 0})

xray.concat([ds, ds], dim="y")
In [2]: xray.concat([ds, ds], dim="y")
Out[2]:
<xarray.Dataset> Size: 16B
Dimensions: (y: 2)
Dimensions without coordinates: y
Data variables:
x (y) int64 16B 0 0

To obtain the old behavior, supply the argument ``concat_over=[]``.

Expand Down
25 changes: 16 additions & 9 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
)
from xarray.backends.locks import _get_scheduler
from xarray.coders import CFDatetimeCoder, CFTimedeltaCoder
from xarray.core import indexing
from xarray.core import dtypes, indexing
from xarray.core.chunk import _get_chunk, _maybe_chunk
from xarray.core.combine import (
_infer_concat_order_from_positions,
Expand All @@ -50,6 +50,13 @@
from xarray.core.utils import is_remote_uri
from xarray.namedarray.daskmanager import DaskManager
from xarray.namedarray.parallelcompat import guess_chunkmanager
from xarray.util.deprecation_helpers import (
_COMPAT_DEFAULT,
_COORDS_DEFAULT,
_DATA_VARS_DEFAULT,
_JOIN_DEFAULT,
CombineKwargDefault,
)

if TYPE_CHECKING:
try:
Expand Down Expand Up @@ -1404,14 +1411,16 @@ def open_mfdataset(
| Sequence[Index]
| None
) = None,
compat: CompatOptions = "no_conflicts",
compat: CompatOptions | CombineKwargDefault = _COMPAT_DEFAULT,
preprocess: Callable[[Dataset], Dataset] | None = None,
engine: T_Engine | None = None,
data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
coords="different",
data_vars: Literal["all", "minimal", "different"]
| list[str]
| CombineKwargDefault = _DATA_VARS_DEFAULT,
coords=_COORDS_DEFAULT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about the context and I'm really bad at typing (so feel free to disregard / punt to a different PR), but shouldn't coords have the same type hints as data_vars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? I was trying to limit the scope of this PR as much as possible, since it's already pretty big. So I would prefer to punt this. When you add types there is always the possibility of breaking a bunch of stuff...

combine: Literal["by_coords", "nested"] = "by_coords",
parallel: bool = False,
join: JoinOptions = "outer",
join: JoinOptions | CombineKwargDefault = _JOIN_DEFAULT,
attrs_file: str | os.PathLike | None = None,
combine_attrs: CombineAttrsOptions = "override",
**kwargs,
Expand Down Expand Up @@ -1598,9 +1607,6 @@ def open_mfdataset(

paths1d: list[str | ReadBuffer]
if combine == "nested":
if isinstance(concat_dim, str | DataArray) or concat_dim is None:
concat_dim = [concat_dim] # type: ignore[assignment]

# This creates a flat list which is easier to iterate over, whilst
# encoding the originally-supplied structure as "ids".
# The "ids" are not used at all if combine='by_coords`.
Expand Down Expand Up @@ -1649,13 +1655,14 @@ def open_mfdataset(
# along each dimension, using structure given by "ids"
combined = _nested_combine(
datasets,
concat_dims=concat_dim,
concat_dim=concat_dim,
compat=compat,
data_vars=data_vars,
coords=coords,
ids=ids,
join=join,
combine_attrs=combine_attrs,
fill_value=dtypes.NA,
)
elif combine == "by_coords":
# Redo ordering from coordinates, ignoring how they were ordered
Expand Down
53 changes: 39 additions & 14 deletions xarray/core/alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from collections import defaultdict
from collections.abc import Callable, Hashable, Iterable, Mapping
from contextlib import suppress
from typing import TYPE_CHECKING, Any, Final, Generic, TypeVar, cast, overload
from typing import TYPE_CHECKING, Any, Final, Generic, TypeVar, cast, get_args, overload

import numpy as np
import pandas as pd
Expand All @@ -19,9 +19,10 @@
indexes_all_equal,
safe_cast_to_index,
)
from xarray.core.types import T_Alignable
from xarray.core.utils import is_dict_like, is_full_slice
from xarray.core.types import JoinOptions, T_Alignable
from xarray.core.utils import emit_user_level_warning, is_dict_like, is_full_slice
from xarray.core.variable import Variable, as_compatible_data, calculate_dimensions
from xarray.util.deprecation_helpers import CombineKwargDefault

if TYPE_CHECKING:
from xarray.core.dataarray import DataArray
Expand Down Expand Up @@ -112,7 +113,7 @@ class Aligner(Generic[T_Alignable]):
objects: tuple[T_Alignable, ...]
results: tuple[T_Alignable, ...]
objects_matching_indexes: tuple[dict[MatchingIndexKey, Index], ...]
join: str
join: JoinOptions | CombineKwargDefault
exclude_dims: frozenset[Hashable]
exclude_vars: frozenset[Hashable]
copy: bool
Expand All @@ -132,7 +133,7 @@ class Aligner(Generic[T_Alignable]):
def __init__(
self,
objects: Iterable[T_Alignable],
join: str = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
indexes: Mapping[Any, Any] | None = None,
exclude_dims: str | Iterable[Hashable] = frozenset(),
exclude_vars: Iterable[Hashable] = frozenset(),
Expand All @@ -145,7 +146,9 @@ def __init__(
self.objects = tuple(objects)
self.objects_matching_indexes = ()

if join not in ["inner", "outer", "override", "exact", "left", "right"]:
if not isinstance(join, CombineKwargDefault) and join not in get_args(
JoinOptions
):
raise ValueError(f"invalid value for join: {join}")
self.join = join

Expand Down Expand Up @@ -418,12 +421,34 @@ def align_indexes(self) -> None:
else:
need_reindex = False
if need_reindex:
if (
isinstance(self.join, CombineKwargDefault)
and self.join != "exact"
):
emit_user_level_warning(
self.join.warning_message(
"This change will result in the following ValueError:"
"cannot be aligned with join='exact' because "
"index/labels/sizes are not equal along "
"these coordinates (dimensions): "
+ ", ".join(
f"{name!r} {dims!r}" for name, dims in key[0]
),
recommend_set_options=False,
),
FutureWarning,
)
if self.join == "exact":
raise ValueError(
"cannot align objects with join='exact' where "
"index/labels/sizes are not equal along "
"these coordinates (dimensions): "
+ ", ".join(f"{name!r} {dims!r}" for name, dims in key[0])
+ (
self.join.error_message()
if isinstance(self.join, CombineKwargDefault)
else ""
)
)
joiner = self._get_index_joiner(index_cls)
joined_index = joiner(matching_indexes)
Expand Down Expand Up @@ -595,7 +620,7 @@ def align(
obj1: T_Obj1,
/,
*,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand All @@ -609,7 +634,7 @@ def align(
obj2: T_Obj2,
/,
*,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand All @@ -624,7 +649,7 @@ def align(
obj3: T_Obj3,
/,
*,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand All @@ -640,7 +665,7 @@ def align(
obj4: T_Obj4,
/,
*,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand All @@ -657,7 +682,7 @@ def align(
obj5: T_Obj5,
/,
*,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand All @@ -668,7 +693,7 @@ def align(
@overload
def align(
*objects: T_Alignable,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand All @@ -678,7 +703,7 @@ def align(

def align(
*objects: T_Alignable,
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand Down Expand Up @@ -886,7 +911,7 @@ def align(

def deep_align(
objects: Iterable[Any],
join: JoinOptions = "inner",
join: JoinOptions | CombineKwargDefault = "inner",
copy: bool = True,
indexes=None,
exclude: str | Iterable[Hashable] = frozenset(),
Expand Down
Loading
Loading