HPC pipeline to aggregate knowledge graphs from EMBL-EBI resources, the MONARCH Initiative KG, ROBOKOP, Ubergraph, and other sources into giant (multi-terabyte) transient Neo4j+Solr databases for querying.
The resulting transient databases can be downloaded from https://ftp.ebi.ac.uk/pub/databases/spot/kg/ebi/
Name | Description | # Nodes | # Edges | Neo4j DB size |
---|---|---|---|---|
ebi_monarch_xspecies |
All datasources with cross-species phenotype matches merged | ~130m | ~850m | ~900 GB |
ebi_monarch |
All datasources with cross-species phenotype matches separated | |||
impc_x_gwas |
Limited to data from IMPC, GWAS Catalog, and related ontologies and mappings | ~30m | ~184m |
Note that the purpose of this pipeline is not to supply another knowledge graph, but to facilitate querying and analysis across existing ones. Consequently the above databases should be considered temporary and are subject to be removed and/or replaced with new ones without warning.
The following mapping tables are loaded:
- https://data.monarchinitiative.org/mappings/latest/gene_mappings.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/hp_mesh.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/mesh_chebi_biomappings.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/mondo.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/umls_hp.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/upheno_custom.sssom.tsv
- https://raw.githubusercontent.com/mapping-commons/mh_mapping_initiative/master/mappings/mp_hp_mgi_all.sssom.tsv
- https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/mappings/oba-efo.sssom.tsv
- https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/mappings/oba-vt.sssom.tsv
- https://github.com/biopragmatics/biomappings/raw/refs/heads/master/src/biomappings/resources/mappings.tsv
In all of the currently configured outputs, skos:exactMatch
mappings cause clique merging. In ebi_monarch_xspecies
, semapv:crossSpeciesExactMatch
also causes clique merging (so e.g. corresponding HP and MP terms will share a graph node). As this is not always desirable, a separate graph ebi_monarch
is also provided where semapv:crossSpeciesExactMatch
mappings are represented as edges.
The pipeline is implemented as Rust programs with simple CLIs, orchestrated with Nextflow. Input KGs are represented in a variety of formats including KGX, RDF, and JSONL files. After loading, a simple "bruteforce" integration strategy is applied:
- All strings that begin with any IRI or CURIE prefix from the Bioregistry are canonicalised to the standard CURIE form
- All property values that are the identifier of another node in the graph become edges
- Cliques of equivalent nodes are merged into single nodes
- Cliques of equivalent properties are merged into single properties (and for ontology-defined properties, the qualified safe labels are used)
The primary output of the pipeline is a property graph for Neo4j. The nodes and edges are also loaded into Solr for full-text search and sqlite for id->compressed object resolution.