Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] add support for .lmdb #261

Open
S-aiueo32 opened this issue Dec 17, 2021 · 4 comments
Open

[Feature Request] add support for .lmdb #261

S-aiueo32 opened this issue Dec 17, 2021 · 4 comments
Labels

Comments

@S-aiueo32
Copy link

I would like to add .lmdb to the file formats supported by TaskOnKart.make_target(). .lmdb is the format used by several popular datasets, and is actually suitable for handling large datasets (especially images).

For those who are not familiar with .lmdb, here is a brief explanation of its usage. .lmdb is a Key-Value store whose values can be retrieved via the lmdb.Environment object:

import lmdb

# open environment
env = lmdb.open('foo')

# put item
key: bytes = ...
value: bytes = ...
with env.begin(write=True) as txn:
    txn.put(key, value)

# get item
with env.begin() as txn:
    _value = txn.put(key)
    assert value == _value

The following changes are necessary to support .lmdb:

  1. Add LmdbFileProcessor:
class LmdbFileProcessor(FileProcessor):

    def __init__(self, file_path: str) -> None:
        self.__file_path = file_path
        super(LmdbFileProcessor, self).__init__()

    @property
    def env(self) -> lmdb.Environment:
        # To avoid creating `env` in the constructor for paths other than .lmdb. 
        if not hasattr(self, '__env'):
            self.__env = lmdb.open(self.__file_path, subdir=False, lock=True)
        return self.__env

    def load(self, file):
        with self.env.begin(write=False) as txn:
            for k, v in txn.cursor():
                yield k, v

    def dump(self, obj):
        key, value = obj
        with self.env.begin(write=True) as txn:
            txn.put(key, value)

...

def make_file_processor(file_path: str, store_index_in_feather: bool) -> FileProcessor:
    extension2processor = {
        ...
        '.lmdb': LmdbFileProcessor(file_path)
    }
  1. Prevent SingleFileTarget from opening files before dumping.
    def _dump(self, obj) -> None:
        if self._processor.__class__.__name__ == 'LmdbFileProcessor':
            self._processor.dump(obj)
        else:
            with self._target.open('w') as f:
                self._processor.dump(obj, f)
  1. Write tasks!
class DownloadImages(GokartTask):
    def run(self):
        image_ids = self.load()
        for image_id in image_ids:
            image: bytes = imread(image_id)
            self.dump((image_id.encode(), image))

    def output(self):
        return self.make_target(f'{self.__class__.__name__}.lmdb')

class LoadImages(GokartTask):
    def requires(self):
        return DownloadImages()

    def run(self):
        for key, value in self.load():  # load keys and values of previous db
            # do something with the images
            ...

They work perfectly in my environment, but there are a few untested concerns:

  1. It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().
  2. It may cause problems with parallelization.

I would like to commit what I have sorted out here as a Pull Request, and I would like to discuss it in this issue.

@Hi-king
Copy link
Member

Hi-king commented Dec 17, 2021

Cool feature :)

Mostly I agree to this proposal.

It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().

I think this is not a problem. With current implementation, we can dump multiple times with named targets as follows

class Task(GokartTask):
    def requires(self):
        return dict(
           a=self.make_target("a.pkl"),
           b=self.make_target("b.pkl"),
       )

    def run(self):
      self.dump(1, "a.pkl")
      self.dump(2, "b.pkl")

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

@S-aiueo32
Copy link
Author

S-aiueo32 commented Dec 21, 2021

Thank you for your reply.

For now, I have implemented a way to use SingleFileTarget to support lmdb, and passed the test locally.
https://github.com/S-aiueo32/gokart/tree/4d65e5f359a44d4113945dd82fc2c3171fa973da
In LocalTargetTest, there was an error around the locking of lmdb, so I made sure not to lock it when opening lmdb.environment (probably not a problem since it is locked in other parts of gokart).

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

I see. It is indeed awkward to have a conditional branch depending on the processor type in SingleFileTarget. It would be better to create a new LmdbTarget that inherits from SingleFileTarget, and rewrite load and dump.

If we create a new LmdbTarget, do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target?

@mski-iksm
Copy link
Contributor

@S-aiueo32
Thank you for proposing great feature!

It would be better to create a new LmdbTarget that inherits from SingleFileTarget

I think LmdbTarget should inherite TargetOnKart just like SingleFileTarget or ModelTarget, not to make overcomplicating inheritance dependencies.

do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target

I think creating new function make_lmdb_target is better, just like existing make_model_target.

@hirosassa
Copy link
Collaborator

@S-aiueo32 Hi,
This is just a friendly reminder (not urgent). What's the status of this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants