-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #397 from INCATools/1.2.26-fixes
1.2.26 fixes
- Loading branch information
Showing
9 changed files
with
311 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
# Dealing with huge ontologies in your import chain | ||
|
||
Dealing with very large ontologies, such as Protein Ontology (PR), NCBI Taxonomy (NXBITaxon), Gene Ontology (GO) and CHEBI is a big challenge when developing ontologies, especially if we want to import and re-use terms from them. There are two major problems: | ||
1. It currently takes about 12-16 GB of memory to process PR and NCBITaxon - memory that many of us do not have available. | ||
2. The files are so large, pulling them over the internet can lead to failures, timeouts and other problems. | ||
|
||
There are a few strategies we can employ to deal with the problem of memory consumption: | ||
1. We try to reduce the memory footprint of the import as much as possible. In other words: we try to not do the fancy stuff ODK does by default when extracting a module, and keep it simple. | ||
2. We manage the import manually ourselves (no import) | ||
|
||
To deal with file size, we: | ||
1. Instead of importing the whole thing, we import curated subsets. | ||
2. If available, we use gzipped (compressed) versions. | ||
|
||
All four strategies will be discussed in the following. We will then look a bit | ||
|
||
## Overwrite ODK default: less fancy, custom modules | ||
|
||
The default recipe for creating a module looks something like that: | ||
|
||
``` | ||
imports/%_import.owl: mirror/%.owl imports/%_terms_combined.txt | ||
if [ $(IMP) = true ]; then $(ROBOT) query -i $< --update ../sparql/preprocess-module.ru \ | ||
extract -T imports/$*_terms_combined.txt --force true --copy-ontology-annotations true --individuals exclude --method BOT \ | ||
query --update ../sparql/inject-subset-declaration.ru --update ../sparql/postprocess-module.ru \ | ||
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output [email protected] && mv [email protected] $@; fi | ||
.PRECIOUS: imports/%_import.owl | ||
``` | ||
(Note: This snippet was copied her on the 10 February 2021 and may be out of date by the time you read this.) | ||
|
||
As you can see, a lot of stuff is going on here: first we run some preprocessing (which is really costly in ROBOT, as we need to load the ontology into Jena, and then back into the OWL API - so basically the ontology is loaded three times in total), then extract a module, then run more sparql queries etc etc. Costly. For small ontologies, this is fine. All of these processes are important to mitigate some of the shortcomings of module extraction techniques, but even if they would be sorted in ROBOT, it may still not be enough. | ||
|
||
So what we can do now is this. In your `ont.Makefile` (for example, `go.Makefile`, NOT `Makefile`), located in `src/ontology`, you can add a snippet like this: | ||
|
||
``` | ||
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt | ||
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \ | ||
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output [email protected] && mv [email protected] $@; fi | ||
.PRECIOUS: imports/pr_import.owl | ||
``` | ||
|
||
Note that all the `%` variables and uses of `$*` are replaced by the ontology id in question. Adding this to your `ont.Makefile` will overwrite the default ODK behaviour in favour of this new recipe. | ||
|
||
_The ODK supports this reduced module out of the box. To activate it, do this:_ | ||
|
||
``` | ||
import_group: | ||
products: | ||
- id: pr | ||
use_gzipped: TRUE | ||
is_large: TRUE | ||
``` | ||
|
||
This will (a) ensure that PR is pulled from a gzipped location (you _have_ to check whether it exists though. It must correspond to the PURL, followed by the extension `.gz`, for example `http://purl.obolibrary.org/obo/pr.owl.gz`) and (b) that it is considered large, so the default handling of large imports is activated for `pr`, and you dont need to paste anything int `ont.Makefile`. | ||
|
||
If you prefer to do it yourself, in the following you can find a few snippets you can use that work for three large ontologies. Just copy them and drop them into `ont.Makefile`; and adjust them however you wish. | ||
|
||
### Protein Ontology (PR) | ||
|
||
``` | ||
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt | ||
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \ | ||
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output [email protected] && mv [email protected] $@; fi | ||
.PRECIOUS: imports/pr_import.owl | ||
``` | ||
|
||
### NCBI Taxonomy (NCBITaxon) | ||
|
||
``` | ||
imports/ncbitaxon_import.owl: mirror/ncbitaxon.owl imports/ncbitaxon_terms_combined.txt | ||
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/ncbitaxon_terms_combined.txt --force true --method BOT \ | ||
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output [email protected] && mv [email protected] $@; fi | ||
.PRECIOUS: imports/ncbitaxon_import.owl | ||
``` | ||
|
||
### CHEBI | ||
|
||
``` | ||
imports/chebi_import.owl: mirror/chebi.owl imports/chebi_terms_combined.txt | ||
if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/chebi_terms_combined.txt --force true --method BOT \ | ||
annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output [email protected] && mv [email protected] $@; fi | ||
.PRECIOUS: imports/chebi_import.owl | ||
``` | ||
|
||
Feel free to use an even cheaper approach - even one that does not use ROBOT -> as long as it produces the target of the goal (e.g. `imports/chebi_import.owl`). | ||
|
||
## Use, slims when they are available | ||
|
||
For some ontologies, you can find slims that are _much_ smaller than full ontology. For example, NCBITaxon maintains a slim for OBO here: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo, which is just 3 M(!!)B compared to the 1 or 2 GB of the full version. Many ontologies maintain such slims, and if not, probably should (I would really like to see an OBO slim for Protein Ontology!). | ||
|
||
You can also add your favourite Taxa to that slim by simply making a pull request on here: https://github.com/obophenotype/ncbitaxon/blob/master/subsets/taxon-subset-ids.txt | ||
|
||
You can use those slims simply like this: | ||
|
||
``` | ||
import_group: | ||
products: | ||
- id: ncbitaxon | ||
mirror_from: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo | ||
``` | ||
|
||
## Manage imports manually | ||
|
||
This is a real hack, and we want to discourage it with very strong terms. But sometimes, importing an ontology just to import a single term is total overkill. What we do in these cases is to maintain a simple template to "import" minimal information. I cant stress enough that we want to avoid this, as such information must necessarily get out of date, but here is a pattern you can use to handle it in an ok way: | ||
|
||
Add this to your `src/ontology/ont-odk.yaml` | ||
|
||
``` | ||
import_group: | ||
products: | ||
- id: my_ncbitaxon | ||
``` | ||
|
||
Then add this to `src/ontology/ont.Makefile`: | ||
|
||
``` | ||
mirror/my_ncbitaxon.owl: | ||
echo "No mirror for $@" | ||
imports/my_ncbitaxon_import.owl: imports/my_ncbitaxon_import.tsv | ||
if [ $(IMP) = true ]; then $(ROBOT) template --template $< \ | ||
--ontology-iri "$(ONTBASE)/$@" --output [email protected] && mv [email protected] $@; fi | ||
.PRECIOUS: imports/my_ncbitaxon_import.owl | ||
``` | ||
|
||
Now you can manage your import manually in the template, and the ODK will know not to include your manually curated import in your base release. But again, avoid this pattern for anything but the most trivial case (e.g. you need 1 term from a huge ontology). | ||
|
||
|
||
## File is too large: Network timeouts and long runtimes | ||
|
||
Remember that ontologies are text files. While this makes them easy to read im your browser, it also makes them huge - from 500 MB (Chebi) to 2 GB (NCBITaxon) - which is an enormous amount. | ||
|
||
|
||
Thankfully, ROBOT can automatically read gzipped ontologies without the need of unpacking. To avoid long runtimes and network timeouts, we can do the following two things (with the new ODK 1.2.26): | ||
|
||
``` | ||
import_group: | ||
products: | ||
- id: pr | ||
use_gzipped: TRUE | ||
``` | ||
This will try to append `.gz` to the default download location (http://purl.obolibrary.org/obo/pr.owl ---> http://purl.obolibrary.org/obo/pr.owl.gz). Note that you must make sure that this file actually exists. It does, for Chebi and Protein Ontology, but not for many others. | ||
|
||
|
||
If the file exists, but is located elsewhere, you can do this: | ||
|
||
``` | ||
import_group: | ||
products: | ||
- id: pr | ||
mirror_from: http://purl.obolibrary.org/obo/pr.owl.gz | ||
``` | ||
|
||
You can put any URL in `mirror_from` (including non-obo ones!). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.