time machine strategy for validation dataset #101
Replies: 2 comments 5 replies
-
Time Machine Strategy -- it even sounds cool 😄 PheKnowLator is part of a continuous evaluation-based challenge (CERLIB) that is doing something like this. Details Caveats
|
Beta Was this translation helpful? Give feedback.
-
Hi @fmellomascarenhas @sbs87, I am a fellow PheKnowLator enthusiast working on generating mechanistic pathways for natural product-drug interactions and your time machine strategy sounds really interesting! I am trying to do something similar with my graph (PheKnowLator + machine reading triples) to keep track of the timestamp and am curious if your strategy above worked as you hoped? My initial attempts have been geared more toward approach#2 (mentioned above) and trying to add a timestamp as metadata in the graph. Thanks for starting this discussion! |
Beta Was this translation helpful? Give feedback.
-
Hi,
When using ML models applied to graphs, it is hard to create validation datasets, since all nodes can be so connected and data leakage is more like a feature than a bug.
After thinking of different strategies to create a decent validation dataset, the only one that made sense to me was a time machine concept. To do it, I imagine that there are two approachs:
Build two graphs: One using current data, and one using old data (lets say up to 2014). The validation data then becomes all the new edges from 2014 to current date. Main challenges would be to pair node IDs that might have changed, and also find archived datasets.
Store some sort of timestamp, and just filter the validation set based on it. This approach seems to be way simpler if the timestamp is available (for ex: Date of publication).
In your opinion, would it be hard to be done with PheKnowLator?
Thank you so much!
Beta Was this translation helpful? Give feedback.
All reactions