Matching Mondo with OAK #5037

matentzn · 2022-06-13T09:56:46Z

matentzn
Jun 13, 2022
Maintainer

Here are try to figure out what the best way is to match mondo with OAK.

I am currently using a naive approach:

Merge Mondo with external ontology. If external ontology is a table, convert to OWL first using ROBOT (potentially flat list)
runoak -i merged.owl lexmatch -o merged.sssom.tsv

Here you can see an example:

https://docs.google.com/spreadsheets/d/1m8dr1jtHaiLLpGpavla6CQKD7pFhZeREGf-0JupwsP0/edit?usp=sharing

The first sheet is a vocabulary fed to us by one of our stakeholders, and we wanted to exploring automated approaches to map to Mondo.

The second sheet is the result. I am not going to comment on it here, but its obvious that by default, lexmatch does not do as well as it could.

@cmungall do you have a vision here on how this kind of process should work with oak? Is the idea that we define preprocessing steps using manual labour (renamings, normalisations etc), put this in a "rules" file and then run oak over that?

cmungall · 2022-06-13T15:33:15Z

cmungall
Jun 13, 2022
Maintainer

I think we should distinguish two tasks

matching between mondo and another database/controlled vocabulary
matching between mondo and highly idiosyncratic spreadsheets

This falls into 2 and necessarily requires bespoke analysis and pre-processing.

I spent three minutes looking at this spreadsheet

there is an implicit grammar here, roughly disease name followed by gene name
there needs to be more documentation of the grammar. Why is the gene name repeated sometimes? Why is sometimes a gene alias listed and sometimes a different gene?
there are a lot of non-english words that will probably have to be translated, unless they are stop words
there is a small vocabulary of abbreviations that are worth curating rather than attempting anything automated eg AutoInflam
there are a lot of typos that are probably easier to spend 5 mins manually fixing, e.g. periodotal

My vision is that the input is at least cleaned up before attempting any kind of matching. Database normalization principles apply. Put genes in one column, disease in another.

If you really want to try with this input, this isn't a matching problem. Run annotate, not match. We could perhaps have a wrapper around annotate such that the user can say "annotate such that there is a text span with a disease and a text span with a gene both of which cover all non stop words", and then have a bespoke procedure where we align to a mondo term if both match.

Or we could try using robot or oak to pregenerate synonyms following this grammar in mondo itself in a supplemental synonyms file.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching Mondo with OAK #5037

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Matching Mondo with OAK #5037

matentzn Jun 13, 2022 Maintainer

Replies: 1 comment

cmungall Jun 13, 2022 Maintainer

matentzn
Jun 13, 2022
Maintainer

cmungall
Jun 13, 2022
Maintainer