You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have one feature request to use Polars for loading and dumping data: Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
If this library would support it, it would speed up the machine learning cycle even more.
Implementation idea
I have tried a very simple implementation for parquet files here.
The changes are as follows.
Add config module as gokart/config and init.py in this module.
Create config.py in gokart/config. This file contains "_global_config" variable, "register_option", "get_option", and "set_option" methods. "_global_config" contains global settings as dictionary and is handled by the above methods. (Currently, only the "use_polars" option is included in "_gloaval_config" by config_init.py.)
Create config_init.py in gokart/config. This file is used for "_global_config" initialization.
# gokart/config/config_init.pyimportgokart.config.configascfuse_polars=""": boolean Whether to use polars instead of pandas"""cf.register_option(
"use_polars",
False,
use_polars,
)
Modify ParquetFileProcessor Class in gokart/file_processor.py to load and dump data by Polars when "use_polars" option is True.
classParquetFileProcessor(FileProcessor):
...
defload(self, file):
# MEMO: read_parquet only supports a filepath as string (not a file handle)ifget_option("use_polars"):
returnpl.read_parquet(file.name)
else:
returnpd.read_parquet(file.name)
defdump(self, obj, file):
assertisinstance(obj, (pd.DataFrame, pl.internals.dataframe.frame.DataFrame)), \
f'requires pd.DataFrame or pl.internals.dataframe.frame.DataFrame, but {type(obj)} is passed.'# MEMO: to_parquet only supports a filepath as string (not a file handle)ifisinstance(obj, pd.DataFrame):
obj.to_parquet(file.name, index=False, compression=self._compression)
else:
obj.write_parquet(file.name, compression=self._compressionifself._compressionisnotNoneelse'zstd')
I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.
The text was updated successfully, but these errors were encountered:
@takeyama0 Thanks for your suggestion and implementation idea!
I'm positive with supporting polars for its good performance as you suggest.
IMO, I would like to move pandas and polars on python extras and raise import error when the users use pandas/polars features without import it.
It is because I think there's no application using both pandas and polars.
@takeyama0 Thanks for your suggestion! I think it’s great to support Polars too.
And I basically agree with @hirosassa ’s idea to minimize dependencies, but I’m a little bit concerned about moving pandas on extras because some common methods (like TaskOnKart.load_data_frame) already use pandas.
Hello, thank you for developing really cool tool!
Summary
I have one feature request to use Polars for loading and dumping data:
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
If this library would support it, it would speed up the machine learning cycle even more.
Implementation idea
I have tried a very simple implementation for parquet files here.
The changes are as follows.
I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.
The text was updated successfully, but these errors were encountered: