You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for developing such a great tool!
Summary
I have one feature request to validate TaskInstanceParameter() at runtime by its subclass bound like the following:
classUpstreamBase(gokart.TaskOnKart): ...
classUpstreamA(UpstreamBase): ...
classUpstreamB(UpstreamBase): ...
classExample(gokart.TaskOnKart):
upstream_task=gokart.TaskInstanceParameter(bound=UpstreamBase)
...
defrequires(self):
returnself.upstream_task# guaranteed it is a subclass of UpstreamBase at runtime
...
Detail
More concrete motivating example is as follows. Suppose we want to perform some feature embedding pre-processing followed by actual data analysis task on top of it. Since we would like to empirically compare which embedding method is better, consider making this pre-processing task (upstream task) abstract and injecting actual choice as a TaskInstanceParameter(). The examples of illustrative pipelines are like:
In such a situation, we may want to limit the task instance parameter to be injected as an upstream task to some tasks with a specific output format, rather than all possible tasks defined. The current default behavior of TaskInstanceParameter() can cause potential bugs because it does not raise any kind of exceptions in all subclasses of luigi.Task().
However, if we intentionally inject the wrong upstream task, it fails after hitting incorrect API access to input data due to duck typing, which means that if the wrong API access is only after a very long process in the downstream task (e.g., NN training), we won't notice the problem until we get an error.
>>>gokart.build(DownstreamTask(upstream_task=SayHello()))
startdownstreamtask# DownstreamTask().run() started without error!ERROR: [pid2305701] WorkerWorker(...) failedDownstreamTask(upstream_task=SayHello(251d2defb17d8f40d3dfb3128ef72945))
Traceback (mostrecentcalllast):
...
print(f'data shape: {self.load().shape}')
AttributeError: 'str'objecthasnoattribute'shape'
...
It would be better if the problem could be detected at the initialization of the task.
>>>DownstreamTask(upstream_task=SayHello())
DownstreamTask(upstream_task=SayHello(251d2defb17d8f40d3dfb3128ef72945)) # we want to raise an exception here!
Implementation idea
luigi.Parameter() provides normalize(v) to normalize & validate injected parameters at runtime (spotify/luigi#1273). This method is executed when task object is instanciated. Therefore, it seems that a subclass check can be done by adding the following implementation to TaskInstanceParameter().
fromtypingimportOptionalimportluigiclassTaskInstanceParameter(luigi.Parameter):
def__init__(self, *args, bound: Optional[type] =None, **kwargs):
super().__init__(*args, **kwargs)
self._bound= [luigi.Task]
ifboundisnotNone:
ifisinstance(bound, type):
self._bound.append(bound)
else:
raiseValueError(f'bound must be a type, not {type(bound)}')
defnormalize(self, v):
fortinself._bound:
ifnotisinstance(v, t):
raiseValueError(f'{v} is not an instance of {t}')
returnv
...
I am happy to create PR for this if the proposal and implementation idea are reasonable for you.
Thank you.
The text was updated successfully, but these errors were encountered:
Hello, thank you for developing such a great tool!
Summary
I have one feature request to validate
TaskInstanceParameter()
at runtime by its subclass bound like the following:Detail
More concrete motivating example is as follows. Suppose we want to perform some feature embedding pre-processing followed by actual data analysis task on top of it. Since we would like to empirically compare which embedding method is better, consider making this pre-processing task (upstream task) abstract and injecting actual choice as a
TaskInstanceParameter()
. The examples of illustrative pipelines are like:In such a situation, we may want to limit the task instance parameter to be injected as an upstream task to some tasks with a specific output format, rather than all possible tasks defined. The current default behavior of
TaskInstanceParameter()
can cause potential bugs because it does not raise any kind of exceptions in all subclasses ofluigi.Task()
.Here is an example code:
For the above code, when PCAEmbed or IsomapEmbed is used for the upstream task, it works fine as follows:
However, if we intentionally inject the wrong upstream task, it fails after hitting incorrect API access to input data due to duck typing, which means that if the wrong API access is only after a very long process in the downstream task (e.g., NN training), we won't notice the problem until we get an error.
It would be better if the problem could be detected at the initialization of the task.
Implementation idea
luigi.Parameter()
providesnormalize(v)
to normalize & validate injected parameters at runtime (spotify/luigi#1273). This method is executed when task object is instanciated. Therefore, it seems that a subclass check can be done by adding the following implementation toTaskInstanceParameter()
.I am happy to create PR for this if the proposal and implementation idea are reasonable for you.
Thank you.
The text was updated successfully, but these errors were encountered: