Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Edge-Type Attribute File #99

Open
callahantiff opened this issue May 11, 2021 · 15 comments · May be fixed by #122
Open

Create Edge-Type Attribute File #99

callahantiff opened this issue May 11, 2021 · 15 comments · May be fixed by #122
Assignees
Labels
enhancement New feature or request other

Comments

@callahantiff
Copy link
Owner

callahantiff commented May 11, 2021

Task

Add an output file to accompany the node metadata created for each build that provides users with an easy way to identify what type each triple is.

Design

The file should at a minimum contain the following information:

  • Triple Identifier: Ideally, this file would either take the subject, object, and predicate of each triple as an identifier or preferably, we would index each triple with an integer identifier that would align to the integer-based edge lists (for n-triple files we can create this as a named graph).
  • Edge Type: String label recording edge type, current types are shown here
  • Pattern: Relation type information like symmetry, transitivity, composition, inversion
  • Category: Relation type information like N:1, 1:N, 1:1
  • Weight1: Values that can be used as an edge weight. Will consider including multiple types of weight, if available, some of which could be derived from the evidence used in the original source when the edge was created
  • Weight1_Type: String that specifies the type for Weight1

Additional information related to the edge could also be added, but it's unclear at this time what would be useful.

@callahantiff callahantiff added enhancement New feature or request other labels May 11, 2021
@callahantiff callahantiff self-assigned this May 11, 2021
@callahantiff
Copy link
Owner Author

Per discussion with @LucaCappelletti94 - create two separate tsv files:

  • Node Data: contains node identifiers and node type. For nodes with multiple types, separate each type label with a |
  • Edge Data: contains columns for source and destination node identifiers, edge weight, and edge type. For edges with multiple types, separate each type label with a |. Example: relation:edge type | relation:edge type

Will let you know as soon as this is ready @LucaCappelletti94!

@LucaCappelletti94
Copy link
Contributor

One more thing: if you have, generally speaking, node features and edge features, as in either other categorical or metric ones, even stuff like the BED coordinates if some are genomic regions, they can be useful when running GNN and GCN models on the graph.

@callahantiff
Copy link
Owner Author

Great suggestion @LucaCappelletti94! One thought that immediately comes to mind is gene expression values (for specific tissues) we can definitely add that. I'll think through what else might make a good edge type.

Sometimes the distinction between what should be used as a weight or type gets blurred. However, if we were to first mark everything that may be interesting/useful as an edge type (i.e., all categorical and metric-based), then we would also allow the user the ability to select from those what they wanted to use an edge based on their use case. I like this idea a lot!

@felipemello1
Copy link

felipemello1 commented Sep 23, 2021

Hi, if you dont mind me sharing my two cents. In the data PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Identifiers.txt there are 291 unique edge_types. Are all of these edge_types in the dataset?

Also, in DisGeNet, for example, the edges between DIS-GENE can have multiple labels, which completely changes the meaning between their interaction. Would it be possible to add the edge subtypes? https://www.disgenet.org/dbinfo paragraph The DisGeNET Association Type Ontology

In PheKnowLator's data sources description, it doesnt seem that any particular type of edge was filtered/selected: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet

For Protein-Protein from String, there are multiple scores. Some of them, for example, are lab based, others are from literature, and others are from predictions. I would imagine that some stakeholders would feel better being able to filter out only the scores coming from lab based experiments. So maybe all the provided scores could be made available for filtering/enhancement? Same logic applies to DisGeNet, where count of # of papers seems to be a good score metric too.

@callahantiff
Copy link
Owner Author

Hi, if you dont mind me sharing my two cents. In the data PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Identifiers.txt there are 291 unique edge_types. Are all of these edge_types in the dataset?

Also, in DisGeNet, for example, the edges between DIS-GENE can have multiple labels, which completely changes the meaning between their interaction. Would it be possible to add the edge subtypes? https://www.disgenet.org/dbinfo paragraph The DisGeNET Association Type Ontology

In PheKnowLator's data sources description, it doesnt seem that any particular type of edge was filtered/selected: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet

For Protein-Protein from String, there are multiple scores. Some of them, for example, are lab based, others are from literature, and others are from predictions. I would imagine that some stakeholders would feel better being able to filter out only the scores coming from lab based experiments. So maybe all the provided scores could be made available for filtering/enhancement? Same logic applies to DisGeNet, where count of # of papers seems to be a good score metric too.

Hi @fmellomascarenhas - I always appreciate your feedback!

You are right that I have not yet included specific edge typing from the resources that we import, like DisGeNet and STRING. I agree it's time that we do!

I will spend some time working on and thinking through this tomorrow and will create a spec for what we can add from each source we import. I will also include a brief plan/overview of how I might approach integrating them (there will always be some solutions that are easier or better than others 😄). I can post those both here so guys can take a look. I will set aside time next week to make the changes as part of a new major release.

How does that sound?

@felipemello1
Copy link

Sounds great! :)

From a Machine Learning point of view, I see four main uses for that:

  1. Filter data that my stakeholders might not want there, for example, edges created from protein-protein predictions;
  2. Predict specific edgetypes, for example, drug-approved_treatment-disease;
  3. Predict multiple edgetypes at once, as a multitask learning problem;
  4. Use as an edge feature;

@callahantiff
Copy link
Owner Author

Excellent points and even more motivation for me to make these changes! 👍

@callahantiff
Copy link
Owner Author

Just an update -- I was not able to get to this last week, but plan on coming back to it next week. Sorry for the delay!

@callahantiff
Copy link
Owner Author

callahantiff commented Oct 9, 2021

Sorry for the delay, I think we are close to being able to make the updates we have been discussing in this thread. I have been reviewing the different resources that we bring in and thinking through some of the challenges with @bill-baumgartner, who has been involved with me from the beginning in building pkt.

I think we came up with the best possible solution in terms of being able to incorporate the greatest amount of edge and node metadata from all input data sources. This should also allow us to easily incorporate other attributes or data types like timestamps and multiple edge weights. Note that this approach is specifically not designed to extend the base OWL knowledge representation as that can quickly get complicated. Instead, this approach is meant to supplement the existing output files in the least complicated way. Ideally, it provides each user with the most flexibility and options and enabling full downstream customization without having to know the details of each use case ahead of time. A brief description of what I am proposing is included below. Hope you like it!


Current Approach and Output

Currently, we are only producing metadata output for nodes and relations, not for triples or edges.

Node Metadata

Filename: XXXX_OWL_NodeLabels.txt
Output: tab-delimited txt file containing six columns. This file is somewhat misleading as it also contains information for each relation.

entity_type   integer_id   entity_uri                                      label      description/definition   synonym
NODES         375312       <http://www.ncbi.nlm.nih.gov/gene/58155>        PTBP2 (human)         A protein coding gene PTBP2 in human.   None
NODES         6297907      <https://www.ncbi.nlm.nih.gov/snp/rs10902762>   NM_000203.5(IDUA):c.60G>A (p.Ala20=)   This variant is a germline/unknown single nucleotide variant located on chromosome 4 (NC_000004.12, start:987144/stop:987144 positions, cytogenetic location:4p16.3) and has clinical significance 'Benign'. This entry is for the GRCh38 and was last reviewed on Nov 26, 2020 with review status 'criteria provided, multiple submitters, no conflicts'    None
RELATIONS     2057563      <http://purl.obolibrary.org/obo/RO_0002002>     has boundary   a relation between a material entity and a 2D immaterial entity (the boundary), in which the boundary delimits the material entity   None
RELATIONS     958453       <http://purl.obolibrary.org/obo/RO_0002444>     parasite of   None   direct parasite of
...


Proposed Representation

All data for nodes and edges will be output to a JSON Lines file (jsonl) file. This file essentially outputs a separate JSON file for each element. A more detailed description of the benefits of using this type of file can be found here. An example output is shown below (taken from https://jsonlines.org).

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

Node Metadata

Output Filename: XXXXX_node_metadata.jsonl

The node metadata file will be keyed uri and each node will contain at a minimum, the following metadata:

  • integer_id: the numeric identifier generated for the integer-based edge lists
  • primary_type: A string stating the type of node, taken from the source ontology or input data type. Additional node types can be added, but will appear like <<src>>_entity_type, where src with being the name of the data source that the type was obtained from (more details below)
  • primary_label: a string label for the node
  • primary_description/definition: a string containing the definition for node
  • primary_synonym: a list of synonyms for the node

Additional types of metadata at the node level will be added and the general format will be: <<src>>_<<type>>, where src is the name of the data source and type is the metadata type for that node (e.g., type, weight).

Edge Metadata

Output Filename: XXXXX_edge_metadata.jsonl

The edge metadata file will be keyed by a triple or edge identifier created as the MD5 hash of each identifier in the edge (i.e., MD5(subj_uri, relation_uri, obj_uri), and each edge will contain at a minimum, the following metadata:

  • ``primary_relation_type`: A string containing the primary type for the edge
  • weight: all edges will be initialized to 0.0

Additional types of metadata at the edge level will be added and the general format will be: <<src>>_<<type>>, where src is the name of the data source and type is the metadata type for that edge (e.g., weight).

As a result of including this file, I will also update the two flat-file outputs (XXXX_Triples_Identifiers.txt and XXXX_Triples_Integers.txt) to include the triple identifiers. Even though these can easily be generated on the fly.


Feedback/Questions

@bill-baumgartner - does that seem correct and cover everything we talked about?

@LucaCappelletti94 - I realize that the proposed output would not readily work as input to Embiggen, I am still very happy to produce a file in the input format we originally discussed.

@fmellomascarenhas and @sanyabt - Please let me know if you have any comments/feedback or if you have any issues with this approach. I think it will be the best overall and hopefully, be flexible enough to be useful for most use cases.

@felipemello1
Copy link

Hi @callahantiff , I am vacation this week, but I will get back to you soon! Thanks :)

@callahantiff
Copy link
Owner Author

Hi @callahantiff , I am vacation this week, but I will get back to you soon! Thanks :)

That sounds great! Have a great vacation! 😄

@sanyabt
Copy link
Collaborator

sanyabt commented Oct 13, 2021

Hi @callahantiff, just caught up with the discussion here and I agree that this would be a great solution! I can envision adding timestamps to the edge metadata and other metrics (eg. node centrality) to node metadata if needed. Thank you for figuring out a solution so quickly 😄

@callahantiff
Copy link
Owner Author

Hi @callahantiff, just caught up with the discussion here and I agree that this would be a great solution! I can envision adding timestamps to the edge metadata and other metrics (eg. node centrality) to node metadata if needed. Thank you for figuring out a solution so quickly 😄

Absolutely, that's what I was envisioning too. That we would have a baseline amount of metadata we provide, that users can choose from and/or extend -- with things like timestamps -- as needed.

I will likely make the updates the week after next and will let you know when it's ready. Thanks again for your feedback!

@felipemello1
Copy link

felipemello1 commented Oct 18, 2021

Hi @callahantiff, I had the time today to read everything. I think it sounds good! I don't think I am in a position to propose a better way of organizing the files, but if that helps, I thought of some additional ideas of features/metadata. Maybe this can help with the brainstorming process :) :

Edge related:
1. Timestamp: So users can split train/test/validation by time, which is much more robust;
2. Edge subtypes: The meaning of gene-biomarker-disease is completely different than gene-causalMutation-disease;
3. Edge scores: Many edgetypes have multiple scores. DisGeNet for example has multiple ways of scoring the edges. Number of papers is completely different than number of different sources with is completely different than number of positive papers / number total papers. You can have a DIS-GENE edge with 500+ papers, but that is reported by only one source, and thus, will have a low score. OpenBioLink, for example, provides two datasets: One with all scores, and one containing only high confidence. This allows the user to create their own threshold using their favorite score type.

4. Edge features: Examples

  • Knowing if a drug-dis is FDA approved;
  • Gene expression;

5. Source information: If paper ID is available, possibly add it. This can help with generating a timestamp. Also, I remember once checking one edge that had 3 sources, but when checking the paper ID, they were 2 different versions of the same manuscript and a third paper of the same group citing themselves. So there weren't 3 sources, just one. This information can help stakeholders validate why the edge exists.

Node related:
6. Node features: Examples

  • Is the DRUG a small molecule or an antibody?
  • Drug chemical properties;

7. Parent/Children: Some biological entities can be described as a tree structure. Diseases, for example, branch into multiple disease subtypes. This information can be very useful to:

  • Find missing IDs (if a child ID is missing, replace with parent);
  • Group diseases: In OpenBioLink, Alzheimer's disease has about 20 subtypes (AZ1, AZ2, etc);
  • Avoid dataleakage: If you mask a connection DIS-GENE, should you also mask the parent and children of that DIS-GENE?

One thing I haven't had the bandwidth to think about is edge properties that are true only when others conditions are also true. For example, the gene expression in a cell type is X1 when disease D in present, otherwise the expression is X2. Or features that differ by gender/race/age. But this is probably way too complex for this stage.

Thanks for all of your great work!

@callahantiff
Copy link
Owner Author

Hi @callahantiff, I had the time today to read everything. I think it sounds good! I don't think I am in a position to propose a better way of organizing the files, but if that helps, I thought of some additional ideas of features/metadata. Maybe this can help with the brainstorming process :) :

Edge related: 1. Timestamp: So users can split train/test/validation by time, which is much more robust; 2. Edge subtypes: The meaning of gene-biomarker-disease is completely different than gene-causalMutation-disease; 3. Edge scores: Many edgetypes have multiple scores. DisGeNet for example has multiple ways of scoring the edges. Number of papers is completely different than number of different sources with is completely different than number of positive papers / number total papers. You can have a DIS-GENE edge with 500+ papers, but that is reported by only one source, and thus, will have a low score. OpenBioLink, for example, provides two datasets: One with all scores, and one containing only high confidence. This allows the user to create their own threshold using their favorite score type.

4. Edge features: Examples

  • Knowing if a drug-dis is FDA approved;
  • Gene expression;

5. Source information: If paper ID is available, possibly add it. This can help with generating a timestamp. Also, I remember once checking one edge that had 3 sources, but when checking the paper ID, they were 2 different versions of the same manuscript and a third paper of the same group citing themselves. So there weren't 3 sources, just one. This information can help stakeholders validate why the edge exists.

Node related: 6. Node features: Examples

  • Is the DRUG a small molecule or an antibody?
  • Drug chemical properties;

7. Parent/Children: Some biological entities can be described as a tree structure. Diseases, for example, branch into multiple disease subtypes. This information can be very useful to:

  • Find missing IDs (if a child ID is missing, replace with parent);
  • Group diseases: In OpenBioLink, Alzheimer's disease has about 20 subtypes (AZ1, AZ2, etc);
  • Avoid dataleakage: If you mask a connection DIS-GENE, should you also mask the parent and children of that DIS-GENE?

One thing I haven't had the bandwidth to think about is edge properties that are true only when others conditions are also true. For example, the gene expression in a cell type is X1 when disease D in present, otherwise the expression is X2. Or features that differ by gender/race/age. But this is probably way too complex for this stage.

Thanks for all of your great work!

@fmellomascarenhas this is fantastic feedback, thank you very much! I also really enjoy the examples. I am not sure we can accommodate everything in the first pass, but this format will allow easy integration of the types of metadata you suggest (and likely things neither of us has thought of yet [I think 🤔 and hope 😄 ])! OK, will keep you posted as I begin working on this over the next few weeks.

Thanks so much for the feedback and suggestions!

@callahantiff callahantiff linked a pull request Dec 8, 2021 that will close this issue
@callahantiff callahantiff linked a pull request Dec 8, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request other
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants