Replies: 1 comment
-
I think we should distinguish two tasks
This falls into 2 and necessarily requires bespoke analysis and pre-processing. I spent three minutes looking at this spreadsheet
My vision is that the input is at least cleaned up before attempting any kind of matching. Database normalization principles apply. Put genes in one column, disease in another. If you really want to try with this input, this isn't a matching problem. Run annotate, not match. We could perhaps have a wrapper around annotate such that the user can say "annotate such that there is a text span with a disease and a text span with a gene both of which cover all non stop words", and then have a bespoke procedure where we align to a mondo term if both match. Or we could try using robot or oak to pregenerate synonyms following this grammar in mondo itself in a supplemental synonyms file. |
Beta Was this translation helpful? Give feedback.
-
Here are try to figure out what the best way is to match mondo with OAK.
I am currently using a naive approach:
runoak -i merged.owl lexmatch -o merged.sssom.tsv
Here you can see an example:
https://docs.google.com/spreadsheets/d/1m8dr1jtHaiLLpGpavla6CQKD7pFhZeREGf-0JupwsP0/edit?usp=sharing
The first sheet is a vocabulary fed to us by one of our stakeholders, and we wanted to exploring automated approaches to map to Mondo.
The second sheet is the result. I am not going to comment on it here, but its obvious that by default, lexmatch does not do as well as it could.
@cmungall do you have a vision here on how this kind of process should work with oak? Is the idea that we define preprocessing steps using manual labour (renamings, normalisations etc), put this in a "rules" file and then run oak over that?
Beta Was this translation helpful? Give feedback.
All reactions