Skip to content

data-engineering-collective/plateau

Repository files navigation

plateau

flat files, flat land

Build Status conda-forge pypi-version python-version Documentation Status codecov.io License: MIT Anaconda-Server Badge

plateau is a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store. It stores data as datasets, which it presents as pandas DataFrames to the user. Datasets are a collection of files with the same schema that reside in a blob store. plateau uses a metadata definition to handle these datasets efficiently. For distributed access and manipulation of datasets plateau offers a Dask interface.

Storing data distributed over multiple files in a blob store (S3, ABS, GCS, etc.) allows for a fast, cost-efficient and highly scalable data infrastructure. A downside of storing data solely in an object store is that the storages themselves give little to no guarantees beyond the consistency of a single file. In particular, they cannot guarantee the consistency of your dataset. If we demand a consistent state of our dataset at all times, we need to track the state of the dataset. plateau frees us from having to do this manually.

The plateau.io module provides building blocks to create and modify these datasets in data pipelines. plateau handles I/O, tracks dataset partitions and selects subsets of data transparently.

Installation

This project is managed by pixi. You can install the package in development mode using:

git clone https://github.com/data-engineering-collective/plateau
cd plateau

pixi run pre-commit-install
pixi run postinstall
pixi run test

Plateau is also available on PyPI and can be installed through pip:

pip install plateau

Contributing

Find details on how to contribute here.