[Feature Request] Using Polars for loading and dumping data #304

takeyama0 · 2023-02-04T05:12:25Z

Hello, thank you for developing really cool tool!

Summary

I have one feature request to use Polars for loading and dumping data:
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
If this library would support it, it would speed up the machine learning cycle even more.

Implementation idea

I have tried a very simple implementation for parquet files here.
The changes are as follows.

Add config module as gokart/config and init.py in this module.

# gokart/config/__init__.py
from gokart.config import config
from gokart.config.config import (
    get_option,
    set_option,
)

Create config.py in gokart/config. This file contains "_global_config" variable, "register_option", "get_option", and "set_option" methods. "_global_config" contains global settings as dictionary and is handled by the above methods. (Currently, only the "use_polars" option is included in "_gloaval_config" by config_init.py.)

# gokart/config/config.py
from typing import Any, Dict

_global_config: Dict[str, Any] = {}


def register_option(
    key: str,
    val: object,
    doc: str = "",
) -> None:
    _global_config.update({key: val})


def get_option(
    key: str,
) -> object:
    assert key in _global_config, f"No such keys: {key}"
    return _global_config[key]


def set_option(
    key: str,
    val: object,
    doc: str = "",
) -> None:
    assert key in _global_config, f"No such keys: {key}"
    _global_config.update({key: val})

Create config_init.py in gokart/config. This file is used for "_global_config" initialization.

# gokart/config/config_init.py
import gokart.config.config as cf

use_polars = """
: boolean
    Whether to use polars instead of pandas
"""

cf.register_option(
    "use_polars",
    False,
    use_polars,
)

Modify gokart/init.py to include gokart.config.

# gokart/__init__.py
from gokart.config import config_init, get_option, set_option
from gokart.build import build
...

Modify ParquetFileProcessor Class in gokart/file_processor.py to load and dump data by Polars when "use_polars" option is True.

class ParquetFileProcessor(FileProcessor):
    ...

    def load(self, file):
        # MEMO: read_parquet only supports a filepath as string (not a file handle)
        if get_option("use_polars"):
            return pl.read_parquet(file.name)
        else:
            return pd.read_parquet(file.name)

    def dump(self, obj, file):
        assert isinstance(obj, (pd.DataFrame, pl.internals.dataframe.frame.DataFrame)), \
            f'requires pd.DataFrame or pl.internals.dataframe.frame.DataFrame, but {type(obj)} is passed.'
        # MEMO: to_parquet only supports a filepath as string (not a file handle)
        if isinstance(obj, pd.DataFrame):
            obj.to_parquet(file.name, index=False, compression=self._compression)
        else:
            obj.write_parquet(file.name, compression=self._compression if self._compression is not None else 'zstd')

I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.

hirosassa · 2023-02-04T06:53:11Z

@takeyama0 Thanks for your suggestion and implementation idea!
I'm positive with supporting polars for its good performance as you suggest.

IMO, I would like to move pandas and polars on python extras and raise import error when the users use pandas/polars features without import it.
It is because I think there's no application using both pandas and polars.

@Hi-king @ujiuji1259 @mski-iksm How do you think about this?

ujiuji1259 · 2023-02-04T10:56:13Z

@takeyama0 Thanks for your suggestion! I think it’s great to support Polars too.

And I basically agree with @hirosassa ’s idea to minimize dependencies, but I’m a little bit concerned about moving pandas on extras because some common methods (like TaskOnKart.load_data_frame) already use pandas.

takeyama0 · 2023-02-06T09:33:50Z

@hirosassa , @ujiuji1259 Thank you for your replaying! I am glad to hear your positive feedback about supporting polars.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Using Polars for loading and dumping data #304

[Feature Request] Using Polars for loading and dumping data #304

takeyama0 commented Feb 4, 2023 •

edited

Loading

hirosassa commented Feb 4, 2023 •

edited

Loading

ujiuji1259 commented Feb 4, 2023

takeyama0 commented Feb 6, 2023

[Feature Request] Using Polars for loading and dumping data #304

[Feature Request] Using Polars for loading and dumping data #304

Comments

takeyama0 commented Feb 4, 2023 • edited Loading

Summary

Implementation idea

hirosassa commented Feb 4, 2023 • edited Loading

ujiuji1259 commented Feb 4, 2023

takeyama0 commented Feb 6, 2023

takeyama0 commented Feb 4, 2023 •

edited

Loading

hirosassa commented Feb 4, 2023 •

edited

Loading