-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find obo file to calculate semantic similarity between MPO and HPO terms #25
Comments
Hey @frequena ; we are currently working on trying to improving our own pipelines for similarity mappings between HP and MP. If you are happy to experiment with an experimental version of upheno, i can get you one of those, but I have to say that you are loosing too much important information when using OBO files for the semantic similarity calculation (especially the logical definitions which are integral on linking HP and MP classes together). Can I ask, what semantic similarity algorithms do you use? And what use case you want the mappings for? We have quite a few requests at the moment for good MP-HP mappings, and are keen on understanding what they are used for out there. |
Thank you @matentzn for your fast answer! Currently, I'm working on the interpretation of variants in patients with rare diseases. Since the number of genes whose have HP terms is limited, I thought about following the same approach that is used by PHIVE (integrated into Exomiser): taking MP terms associated with genes from mouse models who have knock-out or knock-out mutations in those specific genes and calculate the phenotypic similarity with the list of HPO terms from the patient. Until now, I have been using the Resnik method to calculate the distance between HPO terms lists. I just simply load the .obo file and calculate the Information Content (IC) of the most informative common ancestor. Now, my goal is to calculate MP-HP similarity scores. To do that, I found the original paper which makes possible the alignment of HP and MP terms following the next track: Exomiser -> PHIVE -> PhenoDIgm -> OWLSim I'm not familiar with OWLSim but I think you can calculate the IC of the most informative common ancestor (Resnik method) through an API query. Maybe this solution is better than my original idea (load a .obo file and calculate similarity) since, as you said, you are losing information using .obo files. Thank you for any help! ;) |
Great, this sounds amazing! If you are not in a super rush, I am happy to work with you on your problem (actually quite interested). As a first step, could you try and see how far these two get you:
I will keep you updated here as we progress creating more reliable mappings. |
Perfect @matentzn ! I find it a very interesting problem as well! |
Hey @frequena I am currently in the process of calculating similarity using various axiomatic representations (I care less about the semantic similarity algorithm, and more on the shape of the ontology that is fed to the algorithm). If you ever have the time, can you check the similarity scores and tell me if this is more or less what you were looking for? https://www.dropbox.com/s/h63iahh3xqoztgd/mp-hp-mappings.zip?dl=0 The mappings were generated by a very simple semantic similarity algorithm called jaccard; the formula is: Sim(A,B) = (Number of shared superclasses) / (Number of all superclasses (union)) There are three different ontology inputs:
Anyways, if you ever have time to experiment, would be happy to work on this together. |
Hello @matentzn There is a paper: What do you think? |
Although this paper is written by members of my project that I respect a lot, I personally do not believe in assessing the performance of semantic similarity algorithms for anything but a particular use case and set of inputs. I may be wrong, but my sense is that the choice of algorithm and which works best depends on the axiomatisation in your source ontology (that means, what is best can even differ from branch to branch), and, most importantly, on the use case query - even cross-species inference, which sounds like some kind of special case, is actually way too general to allow identification of a single measure. My suggestion for any paper is to always evaluate a wide range of metrics, and pick the one which works best for your case... Using jaccard is basically convenience - its simple, its known to work in a good number of cases; but it is FAR from perfect. In fact, its probably quite crude, and any results from jaccard should be treated with a grain of salt as it is super sensitive to "degree of axiomatisation" - imagine you just define 150 intermediate classes in one branch and only 1 in another, the jaccard score would differ a lot between the two branches - while the biology may not differ at all. I think jaccard is reasonable if the axiomisation is relatively balanced and homegenous.. What do you think? Happy to be convinced otherwise.. |
Hello @matentzn That's a very interesting comment. Let me explain it: I find extremely valuable the possibility to measure the similarity between terms from an ontology tree. gene (list of HPO terms) - patient (list of HPO terms) - Disease-gene prioritization Searching for literature about the calculation of similarity between a list of terms from ontology trees, I find:
There are of course some papers which propose new approaches, but currently, the tendency is the usage of methods based on Resnik. I want to illustrate with the examples above, that the most recent published techniques haven't changed so much respect to the simple but effective Resnik method. That is why I find Jaccard measure interesting, thanks to its simplicity, you can get a robust measure in comparison with other techniques more sophisticated but sensitive to noise data (the common scenario with annotation of clinical features). The only thing I miss from Jaccard in comparison with Resnik method, it's the representation of the deep grade of a term. The IC of a HPO term increases depending on the location in the tree, deeper -> less nº of child terms -> lower the ratio (child terms / nº terms in the tree) -> Higher IC Am I missing something respect to Jaccard? This part is very relevant for the biomedical problem especially for the aggregation part of multiple similarity scores when you have to discern the relevant scores (from the noisy) and provide a unique value that defines the similarity of two lists of HPO terms. |
I think you are right! I would rather not speculate about the for and against of Resnik (I don't remember the exact way IC scores are computed), but in the end, you again have the same bias - if the axiomatisation is unbalanced (different branches are axiomatised more than others). At this point, I wont be able to implement Resnik for uPheno myself, but I would be super interested in the outcome of your experiments.. I will probably return the question of semantic similarity and phenotypes around August. If you want to implement resnik in python yourself, I would recommend to use ROBOT convert to dump uPheno into JSON (or OBO), then parse it with a normal JSON parser and work out how to extract the relevant data for Resnik.. Keep me posted! |
Hello @matentzn, yes! I wanted to test the option of making queries directly to the API to calculate the distance between a MP and HP term but I didn't know the API endpoint. I have checked the link you sent and I think it would be possible using "GET /sim/score" ... |
Try it! @kshefchek Wasnt there an OWLSIM endpoint somewhere as well? Or should everyone use biolink only? |
We're working on providing these files as part of our release - eg HP-MP-jaccard.tsv, HP-MP-resnik.tsv, HP-MP-phenodigm.tsv, and the inverse, but I think this is a few months out. I also have some python code that's not quite ready, but it could be used for this task eventually (a month or so?) We have an similarity endpoint but it is meant for analyzing phenotype profiles/lists instead of term-term similarity. Since you have to query so many term-term combinations, you'll end up with a large i/o hit, and it also overwhelms our servers. This works better as a command line utility. |
Hello @kshefchek, thank you for your message! Yes, I would be interested in the HP-MP file, since I don't have to make queries constantly. Some time ago, @matentzn posted a link (in this issue) with these files which I found useful. Is there any difference? Maybe, a difference "axiomatic representation"? |
The files will be a bit different, because different algorithms were used to generate the similarity scores (at least different implementations). Hopefully by the end of the year, all this will be consolidated. :) Thx for your continuing interest! |
Hey guys, |
This is amazing @JohannKaspar hopefully by the end of this year, we have something very good in place. In the meantime, if you all have time to experiment a bit, here are some files from an early alpha that you may be able to play with. Note that the files include all species, not just HP/MP, so some filtering could be necessary... We will keep you updated. |
Thank you so much @matentzn, this is some great stuff! I will definitely play around with it. |
Also check out https://monarchinitiative.org/analyze/phenotypes, the instructions ask for HPO terms but you can put terms in from MP |
@frequena @JohannKaspar are you still interested in the issue? We are planning a community working group to work on the matter in a more formal and proper fashion as we speak! Let me know if you like to join (message me on linked in or here, so I get your emails). |
@matentzn I will check in from time to time to see what's new, but I think for contributing, my coding skills are way too limited, especially regarding ontologies. |
No problem! :) We will keep you updated in this issue. :) |
Of course, @matentzn , count on me!! |
Could you email me a short description of why you are interested in MP-HP mappings in particular? Or are you mostly i interested in Semantic similarity in general? |
I have been calculating the phenotypic similarity of HP lists using the obo file (http://purl.obolibrary.org/obo/hp.obo).
Now, my goal is to calculate the semantic similarity of lists of HP and MP terms. To that end, I found this paper: https://f1000research.com/articles/2-30/v2 - Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research, but the links are broken. Checking the repository (and the previous version of upheno), I am not sure what .obo file to use or if I should just directly take a .owl file and convert it to a .obo file.
Thank you for any help!
The text was updated successfully, but these errors were encountered: