time machine strategy for validation dataset #101

felipemello1 · 2021-05-13T16:30:34Z

felipemello1
May 13, 2021

Hi,

When using ML models applied to graphs, it is hard to create validation datasets, since all nodes can be so connected and data leakage is more like a feature than a bug.

After thinking of different strategies to create a decent validation dataset, the only one that made sense to me was a time machine concept. To do it, I imagine that there are two approachs:

Build two graphs: One using current data, and one using old data (lets say up to 2014). The validation data then becomes all the new edges from 2014 to current date. Main challenges would be to pair node IDs that might have changed, and also find archived datasets.
Store some sort of timestamp, and just filter the validation set based on it. This approach seems to be way simpler if the timestamp is available (for ex: Date of publication).

In your opinion, would it be hard to be done with PheKnowLator?

Thank you so much!

callahantiff · 2021-05-13T21:19:36Z

callahantiff
May 13, 2021
Maintainer

Time Machine Strategy -- it even sounds cool 😄 PheKnowLator is part of a continuous evaluation-based challenge (CERLIB) that is doing something like this.

Details
Right now PheKnowLator graphs are built monthly. If you wanted to work from our pre-built KGs, you would be able to do what you describe on a monthly interval basis or you could extend the interval beyond a month by choosing specific builds (GCS Archive, specific builds are grouped under each release). For each build, there is a data directory that stores all data used to construct the KGs and there are logs that mark when the data was downloaded and when the build process for each KG started and completed.

Caveats
The only potential limitation is that since January 2021, when we started monthly builds, the changes each month are relatively small, so it may take a few more months of builds to gain momentum. That said, you could always download archived data sources from the different providers we use and run the builds yourself to supplement the timeline of the current builds. In July, I will be adding a page to the Wiki that will serve as a dashboard to track changes and note any weird observations for each build. That information is currently available, but it's through each log, the website will help synthesize that.

Tracking node-level changes within the ontologies won't be hard, that information is tracked explicitly within each ontology. For other sources, it could be tricky, but most sources we use are pretty stable in my experience. That said, I am sure there will be some instances where it's not easy to track, but I would hope that number would be very small.
Most of the sources we use keep archived data, I'm sure how far back it goes, but I suspect most go back at least one year. Additionally, the people that maintain these sources are incredibly kind. I bet you could also contact them directly to inquire about archived data that are not present through their websites.

2 replies

sbs87 May 17, 2021

This is a really helpful recommendation! I was also looking in to using either archived database files or a time stamp contained within each data source. I was able to go through each source from your code and come up with a rough idea of how many databases have archived data, time stamped data, or neither. For example, the DisGeNet file you download contains 'YearInitial' and 'YearFinal' as two columns. Likewise, ClinVar has a column called 'LastEvaluated'. STRING has different archived versions to download (v10, v11). The CTD and Reactome sources don't have a timestamp per se, but instead have the PMID associated with the relation, which can be used to lookup the publication year.

Do you have any recommendations on how to implement this? I was thinking of generating 'time machine' data using two different builds of v2. One build would contain pointers to archived/older data while the second would be strictly new data. This gets a bit messy for things like versioned databases where I'd have to subtract out previous relations to get only new data. The other messy item would be trying to optimally pick a year cut point as some sources, e.g. ProteinAtlas, have data in discrete chunks of time (as releases) and may skip years.

Some databases have neither archive or timestamp that I could find but I am less worried about ones where we don't expect the data to change all that much (e.g., DNA->RNA-> Protein mapping, especially for limiting to coding genes only).

I am only looking for broad strokes recommendations here since each source will be handled differently.

callahantiff May 20, 2021
Maintainer

Sounds like a completely reasonable approach to me! In terms of having other recommendations for how to implement this, I think what you have described sounds solid. I will do the exercise of mentally mocking up this approach and let you know if I come up with anything different though!

sanyabt · 2021-09-17T19:53:04Z

sanyabt
Sep 17, 2021
Collaborator

Hi @fmellomascarenhas @sbs87, I am a fellow PheKnowLator enthusiast working on generating mechanistic pathways for natural product-drug interactions and your time machine strategy sounds really interesting! I am trying to do something similar with my graph (PheKnowLator + machine reading triples) to keep track of the timestamp and am curious if your strategy above worked as you hoped? My initial attempts have been geared more toward approach#2 (mentioned above) and trying to add a timestamp as metadata in the graph. Thanks for starting this discussion!

3 replies

felipemello1 Sep 17, 2021
Author

Hi @sanyabt, we haven't had the bandwidth to implement it. Please, if you would be so kind, let us know how it works for you. I believe we have already mapped which files have timestamps in them (directly or indirectly), I can share it with you if you want. Moving forward, I guess three steps would be necessary:

Getting timestamps available indirectly, such as the paper publication date;
Find older versions of data that don't have timestamp in them;
Adding it to PheKnowLator's pipeline and storing the information;

Number 3 sounds the most challenging to me, since understanding the code is necessary.

callahantiff Sep 17, 2021
Maintainer

Hey @sanyabt and @fmellomascarenhas!

If you both think moving forward with something like what @fmellomascarenhas proposed (specifically number 3) makes the most sense, I would be very happy to discuss it. I have some ideas off of the top of my head that might actually make it easier than it initially seems to implement without a huge overhaul to the existing code. Just let me know!

sanyabt Sep 19, 2021
Collaborator

@fmellomascarenhas I implemented step 1 for a subset of PubMed-index articles and have just reached the point of wanting to figure out how to incorporate the timestamp in the graph. I would be grateful for anything you can share that would help and would be happy to let you know how it works out!

@callahantiff that sounds great! Will add this to our list of discussion items 😁

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time machine strategy for validation dataset #101

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

time machine strategy for validation dataset #101

felipemello1 May 13, 2021

Replies: 2 comments · 5 replies

callahantiff May 13, 2021 Maintainer

sbs87 May 17, 2021

callahantiff May 20, 2021 Maintainer

sanyabt Sep 17, 2021 Collaborator

felipemello1 Sep 17, 2021 Author

callahantiff Sep 17, 2021 Maintainer

sanyabt Sep 19, 2021 Collaborator

felipemello1
May 13, 2021

Replies: 2 comments 5 replies

callahantiff
May 13, 2021
Maintainer

callahantiff May 20, 2021
Maintainer

sanyabt
Sep 17, 2021
Collaborator

felipemello1 Sep 17, 2021
Author

callahantiff Sep 17, 2021
Maintainer

sanyabt Sep 19, 2021
Collaborator