Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick a movement-native file format #341

Open
Tracked by #13
niksirbi opened this issue Nov 6, 2024 · 12 comments
Open
Tracked by #13

Pick a movement-native file format #341

niksirbi opened this issue Nov 6, 2024 · 12 comments

Comments

@niksirbi
Copy link
Member

niksirbi commented Nov 6, 2024

This write-up was prompted by this zulip topic.

The problem

We have so far taken a pluralistic approach to file formats, i.e. we load and write to multiple formats (as interoperability is at the core of our mission). That said, our existing saving functions are essentially limited to DeepLabCut and SLEAP files, which means we can only save pose tracks + associated confidence scores. We are also in the process of adding support for ndx-pose, but the scope of ndx-pose is also limited to pose estimates (and their associated training data).

I think it's high time to decide on a movement-native format for saving our datasets to file. Requirements for this file format:

  • allow users to save any type of movement dataset - including poses, bounding boxes, their associated metadata, as well as any variables/metrics derived from them (e.g. speed, head direction, etc.).
  • enable developers/users of other tools to exchange data with movement
  • ideally support compression, chunking, parallelisation (important for big datasets).

Note

This is not an issue about deciding on a "field standard for animal tracking data", we'll continue looking out for and supporting community efforts on that front (like ndx-pose). It's more an issue saving the current state of movement datasets - i.e. including tracking data, metadata, as well as derived variables and metrics.

Candidate formats

These are the ones I've thought of so far, feel free to add to this list.

xarray-supported formats: netCDF-4 and zarr

netCDF files are essentially HDF5 with a specific data model (see this paper about the HDF5-netCDF relationship). This format is popular in geosciences, especially atmospheric and oceanographic data, but should support any grouping of scientific arrays with metadata. This would be the easiest and most natural format for us, given that xarray was explicitly build around the netCDF data model, and it offers an in-built to_netcdf() method.

Pros:

  • no need for data wrangling, we can (probably) just save/load the xarray datasets to/from disk as they are.
  • support for hierarchical data organisation, compression, chunking.

Cons:

  • Unfamiliarity with the format in neuroscience, ethology and related fields. Our users are unlikely to know what a .nc file is.
  • All HDF5-based formats are a bit opaque - i.e. you can's just double-click to quickly inspect and edit their contents

xarray also offers methods to save data to zarr, see existing issue. I won't discuss zarr separately, because its pros and cons are similar to netCDF.

In summary, if we go for an HDF5-like format, it should be netCDF-4, and we might as well offer the zarr option.

Parquet

See existing issue, and the related discussion in the idtracker GitLab.

Apache Parquet is an open-source, columnar storage file format designed for efficient data storage and retrieval. It's supported in Python via the pyarrow library.

It's supports compression, (probably) metadata storage, and allows for efficient read access. It's favoured by @roaldarbol who develops animovement. Cons are the same as for netCDF and zarr, plus we'd have to implement function's for going back-and-forth between the xarray and the "tidy" representations.

csv

This is the most 'transparent' option: almost everyone (in research) is familiar with it, and users can easily inspect and edit the files without installing any software (a text editor is enough).

Its cons should be obvious from the above discussion: no compression, chunking, or metadata support.

We already sort-of support csv, since save_poses.to_dlc_file() can write a DLC-style csv files. But as discussed in the Parquet issue, we'd ideally want a "tidy" dataframe (in pandas) which we can the export to either Parquet or csv formats (which can in turn be read by animovement).

My current take

  • I think netCDF should be the default movement-native format, as it would allow us to seamlessly read/write all the info contained within xarray objects, without doing any work (I think, remains to be tried). It should be best for "internal" uses, i.e. writing intermediate derivatives (e.g. filtered data) and loading them later for other analysis steps.
  • That said, we should also support a "tidy" dataframe format (with Parquet as the representation on disk), which will be useful for exchanging data with animovement (+ maybe idtracker), and is perhaps more intuitive to some people (compared to multi-dimensional labelled arrays). There should also be an option to export the tidy dataframe to csv. Despite .csv 's inefficiency and (and clunkiness when it comes to storing metadata), I expect many users will be eager to just open the file in excel / google sheets and the like.
@adamltyson
Copy link
Member

Is the aim to pick:

  • An existing standard that becomes the default (e.g. NWB) OR
  • A minimal way of saving the current state of the dataset (e.g. parquet, zarr etc etc)

@niksirbi
Copy link
Member Author

niksirbi commented Nov 6, 2024

Sorry, accidentally opened this issue without a description. I'm in the process of editing it now (it's going to be extensive).

EDIT: Issue description has been updated now.

@adamltyson
Copy link
Member

FWIW @niksirbi I agree with your summary. Although I fully expect to agree with the next person who makes a reasoned argument!

I expect this issue to be less important over time as we gradually support more formats.

@luiztauffer
Copy link

just putting myself here to follow the thread. This will be very relevant to VAME, as we might incorporate movement's standard to VAME's intermediate data steps as well as data ingestion

@niksirbi
Copy link
Member Author

niksirbi commented Nov 25, 2024

I did a little experiment to test saving movement dataset to netCDF files, and it works as expected, apart from the fact that attributes have to be made serialisable. The attrs could be sanitised with a thin wrapper, or alternatively we could take care to only define attrs in serialisable formats to begin with.

import tempfile
from pathlib import Path

import xarray as xr

from movement import sample_data

ds = sample_data.fetch_dataset("SLEAP_three-mice_Aeon_proofread.analysis.h5")
print(ds)

# A temporary path to save the data
temp_dir = tempfile.TemporaryDirectory()
temp_dir_path = Path(temp_dir.name)
save_path = temp_dir_path / "saved_data.nc"

# Make all attrs serializable (for netCDF)
for key, value in ds.attrs.items():
    # Convert Path objects to strings
    if isinstance(value, Path):
        ds.attrs[key] = str(value)
    # Convert None to empty string
    elif value is None:
        ds.attrs[key] = ""

# Save the data to a netCDF file
ds.to_netcdf(save_path)

# Load the saved data from the netCDF file
loaded_ds = xr.load_dataset(save_path)
print(loaded_ds)

# Check that the loaded dataset is identical to the one we saved
assert ds.identical(loaded_ds)

# Clean up
temp_dir.cleanup()

@vigji
Copy link
Collaborator

vigji commented Dec 6, 2024

Jumping in as since movement is getting increasingly integrated in my pipelines this Is becoming important - I am not sure that this will be less and less relevant as you support an increasing number of formats @adamltyson -- I/O to any of them comes always at the risk of sacrificing some of the (especially, meta-)data to adhere to the target data format. IMHO pipeline that wants to be movement-powered and input-agnostic should have a way of storing datasets in a movement native fashion, without striving for universality.

I think it makes sense to keep up a dual support for netCDF() which is what feels the most natural, and would require very little wrapping and sanitising around a movement-independent IO library, which is an important feature; and to csv, which tabular representation could anyway be useful as you noted in #307, but whose universality will clearly be appealing to anyone planning long-term, python-unspecific storage

@adamltyson
Copy link
Member

@vigji agree, this is my view, and I think @niksirbi agrees (?).

In my mind, netCDF should be the default "Save". Loading this file should get you to the exact same state you were in before. All the other formats (and I hope there are plenty) should be analagous to "Save as" or "Export".

@niksirbi
Copy link
Member Author

niksirbi commented Dec 6, 2024

Agreed!

From my readings and experiment so far, I'd say netCDF should be the movement-native format, as in it should be how movement objects reside on disk, and there's a natural 1-to-1 mapping between netCDF and xarray (because they were made for each other).

All other formats, including some kind of tidy csv, are formats we can interoperate with (import/export), but each of them will only be able to express a subset of the information that movement can hold/produce.

@luiztauffer
Copy link

+1 for netCDF. I found it easy to operate with netCDF for xarrays.

I've had, however, problems with the netCDF4 backend in different environments, the only backend that worked consistently for me was scipy. Something to consider.

@niksirbi
Copy link
Member Author

niksirbi commented Dec 6, 2024

I've had, however, problems with the netCDF4 backend in different environments, the only backend that worked consistently for me was scipy. Something to consider.

By backend you mean the "engine"?

engine ({"netcdf4", "scipy", "h5netcdf"}, optional) – Engine to use when writing netCDF files. If not provided, the default engine is chosen based on available dependencies, with a preference for ‘netcdf4’ if writing to a file on disk.

We could choose to default to scipy in our "save" function

What do you mean by "different environments"? I'm curious to hear about it if you don't mind sharing.

@luiztauffer
Copy link

By backend you mean the "engine"?

yes

What do you mean by "different environments"? I

different python environments. I didn't dig much into the issue, since scipy engine solved it for me. But basically I was getting errors when writing and sometimes reading in different python environments, when trying to use netcdf4 (with netcdf4 pip installed)

@niksirbi
Copy link
Member Author

niksirbi commented Dec 6, 2024

I see, we'll try to cover the movement-to-netcdf saving with tests, to see if we get any such issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🤔 Triage
Development

No branches or pull requests

4 participants