Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot round-trip DataArray with string coordinates using numcodecs.Categorize filter #9863

Open
5 tasks done
y4n9squared opened this issue Dec 7, 2024 · 2 comments
Open
5 tasks done
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@y4n9squared
Copy link

y4n9squared commented Dec 7, 2024

What happened?

Writing an array containing string coordinate values using the Numcodecs.Categorize filter succeeds, but Zarr group cannot be read back into a DataArray.

What did you expect to happen?

To get back the same array that I wrote.

Minimal Complete Verifiable Example

import numcodecs
import xarray as xr

da = xr.DataArray(coords={"x": ("x", np.array(["a", "b"], dtype=object))}, dims=("x",))
codec = numcodecs.Categorize(labels=["a", "b"], dtype=object)
encoding = {
    "x": {
        "filters": [codec],
    },
}

da.to_zarr("/tmp/foo.zarr", mode="w", encoding=encoding)
da = xr.open_dataarray("/tmp/foo.zarr")  # crashes

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Traceback (most recent call last):
  File "/home/yang.yang/foo.py", line 58, in <module>
    da = xr.open_dataarray("/tmp/foo.zarr")
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/api.py", line 851, in open_dataarray
    dataset = open_dataset(
        filename_or_obj,
    ...<15 lines>...
        **kwargs,
    )
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/api.py", line 670, in open_dataset
    backend_ds = backend.open_dataset(
        filename_or_obj,
    ...<2 lines>...
        **kwargs,
    )
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/zarr.py", line 1524, in open_dataset
    ds = store_entrypoint.open_dataset(
        store,
    ...<6 lines>...
        decode_timedelta=decode_timedelta,
    )
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/store.py", line 59, in open_dataset
    ds = Dataset(vars, attrs=attrs)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/dataset.py", line 746, in __init__
    variables, coord_names, dims, indexes, _ = merge_data_and_coords(
                                               ~~~~~~~~~~~~~~~~~~~~~^
        data_vars, coords
        ^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/dataset.py", line 459, in merge_data_and_coords
    return merge_core(
        [data_vars, coords],
    ...<5 lines>...
        skip_align_args=[1],
    )
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/merge.py", line 699, in merge_core
    collected = collect_variables_and_indexes(aligned, indexes=indexes)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/merge.py", line 362, in collect_variables_and_indexes
    idx, idx_vars = create_default_index_implicit(variable)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/indexes.py", line 1425, in create_default_index_implicit
    index = PandasIndex.from_variables(dim_var, options={})
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/indexes.py", line 654, in from_variables
    obj = cls(data, dim, coord_dtype=var.dtype)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/indexes.py", line 589, in __init__
    index = safe_cast_to_index(array)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/indexes.py", line 469, in safe_cast_to_index
    index = pd.Index(np.asarray(array), **kwargs)
                     ~~~~~~~~~~^^^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/indexing.py", line 514, in __array__
    return np.asarray(self.get_duck_array(), dtype=dtype, copy=copy)
                      ~~~~~~~~~~~~~~~~~~~^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/common.py", line 268, in get_duck_array
    return self[key]  # type: ignore[index]
           ~~~~^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/zarr.py", line 226, in __getitem__
    return indexing.explicit_indexing_adapter(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        key, array.shape, indexing.IndexingSupport.VECTORIZED, method
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/core/indexing.py", line 1018, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/xarray/backends/zarr.py", line 216, in _getitem
    return self._array[key]
           ~~~~~~~~~~~^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 797, in __getitem__
    result = self.get_basic_selection(pure_selection, fields=fields)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 923, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 965, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 1340, in _get_selection
    self._chunk_getitems(
    ~~~~~~~~~~~~~~~~~~~~^
        lchunk_coords,
        ^^^^^^^^^^^^^^
    ...<4 lines>...
        fields=fields,
        ^^^^^^^^^^^^^^
    )
    ^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 2185, in _chunk_getitems
    self._process_chunk(
    ~~~~~~~~~~~~~~~~~~~^
        out,
        ^^^^
    ...<6 lines>...
        partial_read_decode=partial_read_decode,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 2098, in _process_chunk
    chunk = self._decode_chunk(cdata)
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/zarr/core.py", line 2361, in _decode_chunk
    chunk = f.decode(chunk)
  File "numcodecs/vlen.pyx", line 141, in numcodecs.vlen.VLenUTF8.decode
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/numcodecs/compat.py", line 149, in ensure_contiguous_ndarray
    ensure_contiguous_ndarray_like(buf, max_buffer_size=max_buffer_size, flatten=flatten)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yang.yang/.venv/lib/python3.13/site-packages/numcodecs/compat.py", line 99, in ensure_contiguous_ndarray_like
    raise TypeError("object arrays are not supported")
TypeError: object arrays are not supported

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.13.0 (main, Oct 16 2024, 03:23:02) [Clang 18.1.8 ]
python-bits: 64
OS: Linux
OS-release: 6.8.0-1017-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.11.0
pandas: 2.2.3
numpy: 2.1.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.3
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.12.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: None
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@y4n9squared y4n9squared added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 7, 2024
Copy link

welcome bot commented Dec 7, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@y4n9squared
Copy link
Author

After doing a little digging, I believe the issue is that when the string object dtype is "object" (as opposed to say <U3), the "VLenUTF8" filter is automatically added during the write. The result of the variable-length encoding is a single byte array, which when passed as the input to the Categorize codec, causes all of the values to "miss". On the read back, the Categorize filter transforms the data to object (all empty strings), which is not what the VLenUTF8 filter is expecting.

import numcodecs
import numpy as np
import xarray as xr


da = xr.DataArray(coords={"x": ("x", np.array(["a", "b", "c"], dtype="object"))}, dims=("x",))

c1 = numcodecs.VLenUTF8()
x = c1.encode(da.x.data)
print(repr(x))    # bytearray(b'\x03\x00\x00\x00\x01\x00\x00\x00a\x01\x00\x00\x00b\x01\x00\x00\x00c')

c2 = numcodecs.Categorize(labels=da.x.data, dtype=object)
y = c2.encode(x)  # No matches, since x is a byte array
print(repr(y))  # array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)

z = c2.decode(y)
print(repr(z))  # array(['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], dtype=object)

c1.decode(z)  # boom

Is there anything we can do about this? Being able to encode variable-length data has be pretty useful, but doesn't seem to play well with the cascading level of filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

1 participant