ncbo / bioportal-to-kgx Goto Github PK
View Code? Open in Web Editor NEWAssemble a BioPortal Knowledge Graph
License: BSD 3-Clause "New" or "Revised" License
Assemble a BioPortal Knowledge Graph
License: BSD 3-Clause "New" or "Revised" License
A few SSSOM maps are OK, but they are overkill when the entirety of a large ontology needs to be mapped to the same category.
There's likely a more type entity-based strategy to be had here, or define the type entity in its own map (don't really like that latter options in terms of source of truth management, but it would work).
Parsing ABD
results in the edges including the following:
urn:uuid:c95fa7fd-b1f0-4f35-83d4-4fce72d8bbc0 http://brd.bsvgateway.org/api/organism/?id=620 biolink:subclass_of http://brd.bsvgateway.org/api/organism/ rdfs:subClassOf BioPortal Anthology of Biosurveillance Diseases
urn:uuid:7a7d352b-0ebb-42a4-a950-fcf914d72fd4 http://brd.bsvgateway.org/api ransmission/ biolink:subclass_of owl:Thing rdfs:subClassOf BioPortal Anthology of Biosurveillance Diseases
urn:uuid:b9b23088-74d8-45be-b1d8-9ef0085b1a8d http://brd.bsvgateway.org/api/organism/?id=628 biolink:subclass_of http://brd.bsvgateway.org/api/organism/ rdfs:subClassOf BioPortal Anthology of Biosurveillance Diseases
The middle line is the issue - this breaks any merge in which these edges are included, raising a pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 1017, saw 8
It looks like there's some truncation/extra whitespace in the id. This ontology has a class named Transmission
so this is likely something with the prefix http://brd.bsvgateway.org/api/transmission/
ABD
has some other weirdness too, with lots of Error classes. See on Bioportal at https://bioportal.bioontology.org/ontologies/ABD/?p=classes&conceptid=root
The biolink:OntologyClass is too abstract to be useful when we have another node category available, so when re-mapping, remove that category.
Transform for RDL
fails:
Traceback (most recent call last):
File "run.py", line 81, in <module>
run()
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "run.py", line 69, in run
transform_status = do_transforms(data_filepaths, kgx_validate, robot_validate, pandas_validate,
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 170, in do_transforms
metadata = (header.split(NAMESPACE))[1]
IndexError: list index out of range
A frustrating issue, because I momentarily had a solution, but then it slipped through my fingers like fine sand...
anyway, calling kgx.cli.validate
in functions.py
as follows:
tx_filename = os.path.basename(tx_filepaths[0])
tx_name = "_".join(tx_filename.split("_", 2)[:2])
parent_dir = os.path.dirname(tx_filepaths[0])
log_path = os.path.join(parent_dir,f'kgx_validate_{tx_name}.log')
try:
errors = kgx.cli.validate(inputs=tx_filepaths,
input_format="tsv",
input_compression=None,
stream=True,
output=log_path)
if len(errors) > 0: # i.e. there are any real errors
print(f"KGX found errors in graph files. See {log_path}")
else:
print(f"KGX found no errors in {tx_name}.")
except TypeError as e:
print(f"Error while validating {tx_name}: {e}")
will only output validation errors to STDOUT.
A file is created at log_path
but not modified in any way - it remains empty.
Strangely enough, this behavior is consistent no matter what value is passed to the output
parameter.
Likely related to Knowledge-Graph-Hub/knowledge-graph-hub.github.io#13
and also mentioned here biolink/kgx#309
Looking at kgx
, the validate
function certainly does open a file for writing:
https://github.com/biolink/kgx/blob/496934e0427bad695d3fe8ee05111140418fb200/kgx/cli/cli_utils.py#L249-L252
but then it looks like that defaults to stderr in write_report
:
https://github.com/biolink/kgx/blob/496934e0427bad695d3fe8ee05111140418fb200/kgx/error_detection.py#L157-L178
unless a level
is provided, which isn't an option through the CLI as far as I can tell.
The ROBOT relax fails because of the following error when parsing DOID_617
:
ROBOT encountered an error:
RAN: /home/harry/BioPortal-to-KGX/robot relax --input /tmp/tmpq3t8cyv0 --output transformed/ontologies/DOID/DOID_617_relaxed.json --vvv
STDOUT:
2022-03-07 14:01:21,081 DEBUG org.obolibrary.robot.IOHelper - Loading ontology /tmp/tmpq3t8cyv0 with catalog file null
2022-03-07 14:01:21,082 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading file META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-03-07 14:01:21,083 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-03-07 14:01:21,083 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-03-07 14:01:21,083 DEBUG org.semanticweb.owlapi... (24347 more, please see e.stdout)
STDERR:
java.lang.NullPointerException: Cannot invoke "String.toString()" because "lv" is null
at org.geneontology.obographs.owlapi.FromOwl.generateGraph(FromOwl.java:409)
at org.geneontology.obographs.owlapi.FromOwl.generateGraphDocument(FromOwl.java:63)
at org.obolibrary.robot.IOHelper.saveOntologyFile(IOHelper.java:1684)
at org.obolibrary.robot.IOHelper.saveOntology(IOHelper.java:838)
at org.obolibrary.robot.CommandLineHelper.maybeSaveOutput(CommandLineHelper.java:671)
at org.obolibrary.robot.RelaxCommand.execute(RelaxCommand.java:113)
at org.obolibrary.robot.CommandManager.executeCommand(CommandManager.java:248)
at org.obolibrary.robot.CommandManager.execute(CommandManager.java:192)
at org.obolibrary.robot.CommandManager.main(CommandMa... (97 more, please see e.stderr)
ROBOT relax of DOID_617 failed - skipping.
The next transform continues as expected, but DOID doesn't get transformed.
ROBOT is used to pre-process each ontology, but it should also produce a report, specifically the output of a measure command. Each report should then be saved to the ontology's transformation output directory.
Errors in URIs are caught during KGX transform and throw a WARNING, but may not be added to the output TSV.
Example:
WARNING:rdflib.term:https://w3id.org/biolink/vocab/Dicty Phenotypes does not look like a valid URI, trying to serialize this will break.
The issue here is the space in the URI.
Despite presence of attributes like hasSTY
, nodes with corresponding semantic types are not being assigned the corresponding Biolink category.
The STY
ontology is being correctly mapped:
id category name description provided_by
STY:T058 biolink:Activity Semantic Types Ontology
STY:T057 biolink:Activity Semantic Types Ontology
STY:T056 biolink:Activity Semantic Types Ontology
STY:T055 biolink:Behavior Semantic Types Ontology
STY:T054 biolink:Behavior Semantic Types Ontology
STY:T053 biolink:Behavior Semantic Types Ontology
STY:T052 biolink:Activity Semantic Types Ontology
Yet these don't end up in the UMLS ontologies:
$ more MEDDRA_20_nodes.tsv
id category name description provided_by
MEDDRA:10007469 biolink:OntologyClass Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007467 biolink:OntologyClass Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007468 biolink:OntologyClass Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007465 biolink:OntologyClass Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007466 biolink:OntologyClass Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007463 biolink:OntologyClass Medical Dictionary for Regulatory Activities Terminology (MedDRA)
This is more of an issue for kg-bioportal
, but it can be addressed here.
Some CURIES contain pound signs (no, I will not call them hash signs, etc etc), e.g. in PR
:
OBO:EnsemblBacteria#_BruAb1_2100
or
OBO:dictyBase#_DDB_G0276875
This only becomes an issue once we hit the cat-merge step in kg-bioportal
, at which point everything after the # is truncated, leaving redundant-looking CURIEs like OBO:dictyBase
. That's not really a bug, and it can be fixed by assigning the specific prefixes for these cases.
Generating mappings between BioPortal ontologies and OBO ontologies will provide additional strategies for integration. LOOM mappings capture some of this space, but require lexical similarity between labels. We would like to recognize potential logical mappings between ontologies as well.
A few options:
A small but non-zero number of the ontology transforms can't be parsed by pandas
properly. This is probably caught by one or another of the existing validations but when it gets to the kg-bioportal merge step this becomes an issue like the following:
15:44:08 Traceback (most recent call last):
15:44:08 File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
15:44:08 result = (True, func(*args, **kwds))
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 809, in parse_source
15:44:08 transformer.transform(input_args)
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 303, in transform
15:44:08 self.process(source_generator, sink)
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
15:44:08 for rec in source:
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 184, in parse
15:44:08 for chunk in file_iter:
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1187, in __next__
15:44:08 return self.get_chunk()
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1284, in get_chunk
15:44:08 return self.read(nrows=size)
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
15:44:08 index, columns, col_dict = self._engine.read(nrows)
15:44:08 File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
15:44:08 data = self._reader.read(nrows)
15:44:08 File "pandas/_libs/parsers.pyx", line 787, in pandas._libs.parsers.TextReader.read
15:44:08 File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
15:44:08 File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
15:44:08 File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
15:44:08 pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 6, saw 9
Most of these errors are due to #32 so the solution is to re-transform, but the pandas error does not specify which ontology led to the error. A pre-screening in this repo would be helpful: just load each graph file into pandas and warn loudly if it doesn't parse.
And not just a few ontologies, either - see how many can be combined (omitting larger ontologies such as NCBITAXON, GAZ, and DRON) as a merged graph.
Essentially anticipating #30 (Aim 2.3.b)
The process of transforming CCO (Cell Cycle Ontology) goes like this:
Starting on ../Bioportal/4store-export-2022-07-20/data/f0/ff/fbb225ff0f97d7737e854fd2d48d
BioPortal metadata not found for CCO_6 - will retrieve.
Accessing https://data.bioontology.org/ontologies/CCO/...
<Response [200]>
Accessing https://data.bioontology.org/ontologies/CCO/latest_submission...
<Response [200]>
Retrieved metadata for CCO (Cell Cycle Ontology)
ROBOT: relax CCO_6
Relaxing /tmp/tmpxpybbnaf to transformed/ontologies/CCO/CCO_6_relaxed.json...
Traceback (most recent call last):
File "/home/harry/BioPortal-to-KGX/run.py", line 138, in <module>
run()
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/harry/BioPortal-to-KGX/run.py", line 113, in run
transform_status = do_transforms(
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 233, in do_transforms
if relax_ontology(robot_path,
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/robot_utils.py", line 65, in relax_ontology
robot_command(
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 1524, in __call__
return RunningCommand(cmd, call_args, stdin, stdout, stderr)
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 788, in __init__
self.wait()
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 845, in wait
self.handle_command_exit_code(exit_code)
File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 869, in handle_command_exit_code
raise exc
sh.SignalException_SIGKILL:
RAN: /home/harry/BioPortal-to-KGX/robot relax --input /tmp/tmpxpybbnaf --output transformed/ontologies/CCO/CCO_6_relaxed.json --vvv
STDOUT:
2022-10-25 19:26:23,373 DEBUG org.obolibrary.robot.IOHelper - Loading ontology /tmp/tmpxpybbnaf with catalog file null
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading file META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi... (257202 more, please see e.stdout)
STDERR:
An unrelated but still present issue: robot's verbosity flag is -vvv
, not --vvv
.
Run the robot command on its own, and this happens:
[many debug lines later]
DEBUG Saving ontology as OboGraphs JSON Syntax with to IRI file:/home/harry/BioPortal-to-KGX/transformed/ontologies/CCO/CCO_6_relaxed.json
OBO GRAPH ERROR Could not convert ontology to OBO Graph (see https://github.com/geneontology/obographs)
For details see: http://robot.obolibrary.org/errors#obo-graph-error
java.io.IOException: errors#OBO GRAPH ERROR Could not convert ontology to OBO Graph (see https://github.com/geneontology/obographs)
at org.obolibrary.robot.IOHelper.saveOntologyFile(IOHelper.java:1722)
at org.obolibrary.robot.IOHelper.saveOntology(IOHelper.java:846)
at org.obolibrary.robot.CommandLineHelper.maybeSaveOutput(CommandLineHelper.java:667)
at org.obolibrary.robot.RelaxCommand.execute(RelaxCommand.java:113)
at org.obolibrary.robot.CommandManager.executeCommand(CommandManager.java:244)
at org.obolibrary.robot.CommandManager.execute(CommandManager.java:188)
at org.obolibrary.robot.CommandManager.main(CommandManager.java:135)
at org.obolibrary.robot.CommandLineInterface.main(CommandLineInterface.java:61)
This ontology hasn't been updated in 8 years so it may be skippable.
Metadata retrieval for some ontologies encounters the following:
Starting on /home/harry/Bioportal/4store-export-2022-02-02/data/8e/be/8e3726e465b55e80c205dad4f43e
KGX validation log present: kgx_validate_VODANA-UG-OPD_1.log
ROBOT report(s) present: robot.report
Transform already present for VODANA-UG-OPD_1
Validating graph files can be parsed...
Graph file transformed/ontologies/VODANA-UG-OPD/VODANA-UG-OPD_1_edges.tsv parses OK.
BioPortal metadata not found for VODANA-UG-OPD_1 - will retrieve.
Accessing https://data.bioontology.org/ontologies/VODANA-UG-OPD/...
<Response [403]>
Accessing https://data.bioontology.org/ontologies/VODANA-UG-OPD/latest_submission...
<Response [403]>
Traceback (most recent call last):
File "run.py", line 73, in <module>
run()
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "run.py", line 61, in run
transform_status = do_transforms(data_filepaths, kgx_validate, robot_validate, pandas_validate,
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 158, in do_transforms
onto_md = bioportal_metadata(dataname, ncbo_key)
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/bioportal_utils.py", line 53, in bioportal_metadata
md[md_type] = content[md_type]
KeyError: 'submissionId'
The content is empty so we shouldn't assign anything to md
.
This is an issue left over from a recent PR (#78) - nodecount
is referenced in pandas_validate_transform()
but isn't assigned a value unless the input definitely follows the nodes/edges naming convention, there's a parsing error, or the data is empty.
The following prefixes need to be updated:
Click options to select or exclude one or more ontologies, by name, would be helpful for avoiding the really large ones (like NCBITAXON) when they don't require further work or should be handled on their own.
The goal stated in the title is more or less complete, so the remaining tasks here are:
When running a fresh transform like the following:
python run.py --input ../Bioportal/4store-export-2022-07-20/data/ --get_bioportal_metadata --ncbo_key [key] --remap_types --write_curies
the process fails upon trying to load the SSSOM maps:
Looking for records in ../Bioportal/4store-export-2022-07-20/data/
976 files found.
Setting up ROBOT...
ROBOT path: /home/harry/BioPortal-to-KGX/robot
ROBOT evironment variables: -Xmx12g -XX:+UseG1GC
Loading type maps from mappings/
WARNING:root:No prefix map provided (not recommended), trying to use defaults..
Traceback (most recent call last):
File "run.py", line 81, in <module>
run()
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "run.py", line 69, in run
transform_status = do_transforms(data_filepaths, kgx_validate, robot_validate, pandas_validate,
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 121, in do_transforms
this_table = read_sssom_table(os.path.join(MAPPING_DIR,filepath))
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/deprecation.py", line 260, in _inner
return function(*args, **kwargs)
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom/parsers.py", line 84, in read_sssom_table
return parse_sssom_table(
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom/parsers.py", line 183, in parse_sssom_table
msdf = from_sssom_dataframe(
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom/parsers.py", line 382, in from_sssom_dataframe
mlist.append(_prepare_mapping(Mapping(**mdict)))
File "<string>", line 42, in __init__
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom_schema/datamodel/sssom_schema.py", line 258, in __post_init__
self.MissingRequiredField("mapping_justification")
File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/linkml_runtime-1.2.0rc3-py3.8.egg/linkml_runtime/utils/yamlutils.py", line 246, in MissingRequiredField
raise ValueError(f"{field_name} must be supplied")
ValueError: mapping_justification must be supplied
As the stack trace indicates, this is due to a missing required field in the maps as per the sssom_schema.
They need to include mapping_justification.
See ncbo/kg-bioportal#38 (comment)
and example output in
https://kg-hub.berkeleybop.io/kg-bioportal/onto_status.yaml
When using the kgx_validate
flag, this error appears:
Error while validating CCO_6: validate() got an unexpected keyword argument 'stream'
This causes KGX-based validation to fail.
The problem is the call to kgx.cli.validate
in functions.py
- it does, in fact, try to pass stream=True
to the validate function, but the function doesn't have that arg.
We have many nodes assigned biolink:NamedThing
category by default, but these could be assigned other categories if a mapping is available, e.g. UMLS semantic types.
See API call: https://data.bioontology.org/mappings?ontologies=SNOMEDCT,BIOLINK
(but there is a bug limiting this for now)
As of the Jul 20 2022 BP dump, the transform for ISSVA
fails after pandas validation.
Will try to reproduce.
The type mappings ensuring that HGNC nodes are assigned biolink:Gene
is seemingly not working.
This could be due to HGNC CURIEs taking the form HGNC:HGNC_XXXX
rather than HGNC:XXXX
.
It should be the latter - see https://bioregistry.io/registry/hgnc.
This aim is related to 2.1.b (#27) and 2.2.a (#22) in that we have already observed ways in which BioPortal ontologies may belong to one or more of the following groups:
This Aim includes the following tasks:
For each ontology, determine how many:
Of those ontologies with non-root Biolink classes,
Many transforms appear to be missing descriptions, IRIs, and possibly other fields populated in the previous set of transforms.
Will need to verify the JSON -> TSV step is populating fields as expected, particularly name
and description
.
As of fc140ce, the transform for CANONT_1
proceeds like this:
ROBOT: relax CANONT_1
Relaxing /tmp/tmpln0gf78b to transformed/ontologies/CANONT/CANONT_1_relaxed.json...
Complete.
KGX transform CANONT_1
[KGX][cli_utils.py][ transform_source] INFO: Processing source 'CANONT_1_relaxed.json'
Traceback (most recent call last):
File "run.py", line 33, in <module>
run()
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "run.py", line 28, in run
do_transforms(data_filepaths)
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 119, in do_transforms
kgx.cli.transform(inputs=[relaxed_outpath],
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 586, in transform
transform_source(
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 895, in transform_source
transformer.transform(input_args, output_args)
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 247, in transform
self.process(source_generator, intermediate_sink)
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
for rec in source:
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 62, in parse
yield from chain(n, e)
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 86, in read_nodes
yield self.read_node(n)
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 124, in read_node
category = self.get_category(curie, node)
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 256, in get_category
element = self.toolkit.get_element_by_mapping(category)
File "/home/harry/kg-env/lib/python3.8/site-packages/bmt/toolkit.py", line 1057, in get_element_by_mapping
mappings = self.get_all_elements_by_mapping(identifier)
File "/home/harry/kg-env/lib/python3.8/site-packages/bmt/toolkit.py", line 1274, in get_all_elements_by_mapping
self.generator.namespaces.uri_for(identifier), set()
File "/home/harry/kg-env/lib/python3.8/site-packages/linkml_runtime/utils/namespaces.py", line 183, in uri_for
raise ValueError(f"{TypedNode.yaml_loc(uri_or_curie)}Unknown CURIE prefix: {prefix}")
ValueError: Unknown CURIE prefix: file
This is unhandled and causes the set of transforms to fail.
The issue is local file references (mistakenly) included in the ontology. In the obojson they look like this:
...
"id" : "http://purl.obolibrary.org/obo/ID_0000052",
"meta" : {
"basicPropertyValues" : [ {
"pred" : "http://www.geneontology.org/formats/oboInOwl#hasOBONamespace",
"val" : "file:C:/Documents and Settings/Andre/My Documents/OBO/Ontology_Scholastic.obo"
}
...
Easy solution is to catch the ValueError and skip this transform.
Assigning a namespace could fix the issue but seems to be outside the scope of a transform.
See also https://bioportal.bioontology.org/ontologies/CANONT/?p=classes&conceptid=root
And set up GH Actions to run tests.
In the Bioportal 4store dump, some graphs (e.g., BTO_ONTOLOGY
) appear to be completely empty.
This is not the case for the corresponding ontology pages on the Bioportal site (e.g., https://bioportal.bioontology.org/ontologies/BTO_ONTOLOGY/?p=summary).
So where are those classes in the data dump?
Seven ontologies do not transform to KGX TSV due to presence of a value resembling a CURIE with a file
prefix.
These are otherwise valid, if somewhat less-than-descriptive, values, and may be modified to permit the transform to continue.
Passing each ontology through ROBOT would offer the immediate benefit of yielding a report, plus it will enable more consistent handling of situations more complex than the average set of is_a relationships.
In KG-OBO the process involves a robot relax
for all ontologies (some also got a merge & convert but primarily to resolve imports - these shouldn't be present) done as follows:
https://github.com/Knowledge-Graph-Hub/kg-obo/blob/9768432876deda512af088586bf1b1289251e85d/kg_obo/transform.py#L726-L745
kg_obo_logger.info(f"ROBOT preprocessing: relax {ontology_name}")
print(f"ROBOT preprocessing: relax {ontology_name}")
temp_suffix = f"_{ontology_name}_relaxed.owl"
tfile_relaxed = tempfile.NamedTemporaryFile(delete=False,suffix=temp_suffix)
if not relax_owl(robot_path, tfile.name,tfile_relaxed.name,robot_env):
kg_obo_logger.error(f"ROBOT relaxing of {ontology_name} failed - skipping.")
print(f"ROBOT relaxing of {ontology_name} failed - skipping.")
tfile_relaxed.close()
continue
tfile_relaxed.close()
before_count = get_file_length(tfile.name)
after_count = get_file_length(tfile_relaxed.name)
kg_obo_logger.info(f"Before relax: {before_count} lines. After relax: {after_count} lines.")
print(f"Before relax: {before_count} lines. After relax: {after_count} lines.")
if after_count == 0:
kg_obo_logger.error(f"ROBOT relaxing of {ontology_name} yielded an empty result!")
print(f"ROBOT relaxing of {ontology_name} yielded an empty result!")
continue #Need to skip this one or we will upload empty results
A similar strategy, plus a report as seen here (https://github.com/Knowledge-Graph-Hub/kg-obo/blob/9768432876deda512af088586bf1b1289251e85d/kg_obo/robot_utils.py#L112) would be useful.
Don't want to waste time re-transforming the same thing repeatedly.
Check the output directory first to see if it already contains non-empty node and edgelists.
The full extent of node and edge properties used in Biolink do not explicitly map to Biolink property slots, but can be mapped - probably a good use case for SSSOM. An example - nodes and edges from the Asthma Ontology:
$ head AO_2_nodes.tsv
id category name provided_by :http://data.bioontology.org/metadata/def/mappingLoom :http://data.bioontology.org/metadata/def/prefLabel :http://data.bioontology.org/metadata/prefixIRI comment type
http://childhealthservicemodels.eu/asthma#MOCHA_0300 biolink:NamedThing national asthma program tmps_d74ivz nationalasthmaprogram national asthma program asthma:MOCHA_0300 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082 biolink:NamedThing management tmps_d74ivz management management asthma:MOCHA-Asthma_000082 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056 biolink:NamedThing extrinsic asthma with status asthmaticus tmps_d74ivz extrinsicasthmawithstatusasthmaticus extrinsic asthma with status asthmaticus asthma:MOCHA-ADHD_000056 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000 biolink:NamedThing general tmps_d74ivz general general asthma:MOCHA-ADHD_000000 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000162 biolink:NamedThing passive smoking exposure tmps_d74ivz passivesmokingexposure passive smoking exposure asthma:MOCHA-Asthma_000162 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000063 biolink:NamedThing environmental tmps_d74ivz environmental environmental asthma:MOCHA-ADHD_000063 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000052 biolink:NamedThing intrinsic asthma with status asthmaticus tmps_d74ivz intrinsicasthmawithstatusasthmaticus intrinsic asthma with status asthmaticus asthma:MOCHA-ADHD_000052 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000283 biolink:NamedThing primary care asthma register tmps_d74ivz primarycareasthmaregister primary care asthma register asthma:MOCHA-Asthma_000283 owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000077 biolink:NamedThing other tmps_d74ivz other other asthma:MOCHA-Asthma_000077 owl:Class
For one, the provided_by isn't really informative since it's just using the name of the temp file used during transformation.
The :http://data.bioontology.org/metadata/prefix
prefix slots aren't explicitly Biolink-compatible.
And the edges:
subject predicate object relation knowledge_source
http://childhealthservicemodels.eu/asthma#MOCHA_0300 :http://data.bioontology.org/metadata/def/mappingSameURI http://childhealthservicemodels.eu/asthma#MOCHA_0300 :http://data.bioontology.org/metadata/def/mappingSameURI tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA_0300 biolink:subclass_of http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082 rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082 :http://data.bioontology.org/metadata/def/mappingSameURI http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082 :http://data.bioontology.org/metadata/def/mappingSameURI tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082 biolink:subclass_of http://childhealthservicemodels.eu/asthma#MOCHA-EPI_0000037 rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056 :http://data.bioontology.org/metadata/def/mappingSameURI http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056 :http://data.bioontology.org/metadata/def/mappingSameURI tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056 biolink:subclass_of http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000 rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000 :http://data.bioontology.org/metadata/def/mappingSameURI http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000 :http://data.bioontology.org/metadata/def/mappingSameURI tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000 biolink:subclass_of http://childhealthservicemodels.eu/asthma#MOCHA-EPI_0000033 rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000162 :http://data.bioontology.org/metadata/def/mappingSameURI http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000162 :http://data.bioontology.org/metadata/def/mappingSameURI tmps_d74ivz
This is more of a matter of mapping predicates - biolink:subclass_of
is fine but :http://data.bioontology.org/metadata/def/mappingSameURI
can be replaced (and in these cases, it isn't even doing anything)
Rename to BioPortal-to-KGX
to reflect project relationship
Tasks for this Aim are as follows.
During the transformation step, assign Biolink categories to nodes and edges based on:
kgx
alreadyAlso correct errors violating kgx
format and Biolink:
The transform for SNOMEDCT
only makes it as far as the JSON stage.
At 821M it's pretty big, so this may be a case of ROBOT hitting a memory limit (though NCBITAXON
is larger on disk and completed under the same conditions, so who knows what is up)
In a run of the following
python run.py --input [path to data] --write_curies --remap_types --get_bioportal_metadata --ncbo_key [key]
some transforms appear to translate just fine until the final step, at which point they are not written to TSV, and here's an example with BCO
:
Starting on /home/harry/Bioportal/4store-export-2022-02-02/data/46/1e/11adde7246a87e69de4d22d0b112
ROBOT report(s) present: robot.report
KGX validation log present: kgx_validate_BCO_11.log
BioPortal metadata not found for BCO_11 - will retrieve.
Accessing https://data.bioontology.org/ontologies/BCO/...
<Response [200]>
Accessing https://data.bioontology.org/ontologies/BCO/latest_submission...
<Response [200]>
Retrieved metadata for BCO (Biological Collections Ontology)
File for BCO_11 is empty! Writing placeholder.
In this case, the only contents of the output directory for BCO
are:
BCO_11
BCO_11_relaxed.json
kgx_validate_BCO_11.log
robot.measure
robot.report
Edge and nodefiles are not present.
I've also seen this happen with DISDRIV and COGAT and a few others. Should try a fresh (completely from scratch) transform on these.
Transforming VODANA-UG_1
completes, but validating it raises an error:
ROBOT: relax VODANA-UG_1
Relaxing /tmp/tmp2k1_xre6 to transformed/ontologies/VODANA-UG/VODANA-UG_1_relaxed.json...
Complete.
KGX transform VODANA-UG_1
[KGX][cli_utils.py][ transform_source] INFO: Processing source 'VODANA-UG_1_relaxed.json'
Validating...
Traceback (most recent call last):
File "run.py", line 36, in <module>
run()
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "run.py", line 28, in run
transform_status = do_transforms(data_filepaths, kgx_validate)
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 163, in do_transforms
validate_transform(outdir)
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 194, in validate_transform
json_dump((kgx.cli.validate(inputs=tx_filepaths,
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 218, in validate
transformer.transform(
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 240, in transform
self.process(source_generator, sink)
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
for rec in source:
File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 171, in parse
file_iter = pd.read_csv(
File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1231, in _make_engine
return mapping[engine](f, **self.options)
File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
There's a simple reason - the edge file is empty.
The validator function should check if a file is empty before passing it to KGX.
Re: Bioportal meeting on May 9, 2022
We would like to know mappings between ontology IDs and IRI prefixes used in Bioportal.
So what are the IRI prefixes for each?
Index and assemble in a TSV.
Specifically:
Validating OPDT_1
produces an IndexError - seemingly due to some issue parsing a filename, but both OPDT_1_edges.tsv and OPDT_1_nodes.tsv appear to be present.
ROBOT: relax OPDT_1
Relaxing /tmp/tmphip66s_b to transformed/ontologies/OPDT/OPDT_1_relaxed.json...
Complete.
KGX transform OPDT_1
[KGX][cli_utils.py][ transform_source] INFO: Processing source 'OPDT_1_relaxed.json'
Validating...
Traceback (most recent call last):
File "run.py", line 44, in <module>
run()
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "run.py", line 36, in run
transform_status = do_transforms(data_filepaths, kgx_validate)
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 179, in do_transforms
validate_transform(outdir)
File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 204, in validate_transform
tx_filename = os.path.basename(tx_filepaths[0])
IndexError: list index out of range
Enrich graph-format BioPortal ontologies with the following:
I noticed that the most recent merged graph merges entries from ATO and ATOL:
ATO:0000368 biolink:NamedThing startle response|Hydromantes (Gistel 1848) None ATO_2_nodes.tsv|ATOL_2_nodes.tsv
This is unexpected as ATOL should have the CURIE prefix ATOL, but sure enough, it got assigned ATO instead:
$ head ATOL_2_nodes.tsv
id category name description provided_by
ATO:0000261 biolink:NamedThing milk fatty acid iso C18:0 concentration Animal Trait Ontology for Livestock
ATO:0001593 biolink:NamedThing efficiency of phenylalanine utilization Animal Trait Ontology for Livestock
ATO:0001592 biolink:NamedThing efficiency of methionine utilization Animal Trait Ontology for Livestock
I suspect this is overzealous prefix mapping. There may be similar instances among similarly named BP ontos, such as MA and MAT.
The curies
package c/o @cthoyt should be able to more elegantly convert prefixes to CURIEs.
See https://github.com/cthoyt/curies
Example:
converter = Converter.from_reverse_prefix_map({
"http://purl.obolibrary.org/obo/CHEBI_": "CHEBI",
"https://www.ebi.ac.uk/chebi/searchId.do?chebiId=": "CHEBI",
"http://purl.obolibrary.org/obo/MONDO_": "MONDO",
})
converter.expand("CHEBI:138488")
'http://purl.obolibrary.org/obo/CHEBI_138488'
converter.compress("http://purl.obolibrary.org/obo/CHEBI_138488")
'CHEBI:138488'
converter.compress("https://www.ebi.ac.uk/chebi/searchId.do?chebiId=138488")
'CHEBI:138488'
The following ontologies encountered some kind of resolvable error during transformation of the Jul 20 2022 BP dump.
This means they may be missing nodes or edges; it usually indicates some issue with transforming to JSON or parsing the source.
Ontology | Issue |
---|---|
ADMF | Empty, metadata only |
ARO | Empty, but seemingly valid version on BP |
BFLC | Empty, with RDF errors on BP |
BPFORMS | Empty, renders incorrectly on BP |
CEDARPC | Empty, metadata only |
CHEMBIO | Empty, metadata only |
CST | Empty, with RDF errors on BP |
CS | Empty, with RDF errors on BP |
CTP | Empty, metadata only |
DCO | Empty, metadata only |
DMTO | Empty, with RDF errors on BP |
ECTO | Empty, renders incorrectly on BP |
FASTO | Empty, with RDF errors on BP |
FB-BT | ❓ Unknown - seems OK on BP |
HIV | Empty, with RDF errors on BP |
HMADO | Empty, metadata only |
INFRA-EN | Empty, metadata only |
LC-CARRIERS | Empty, with RDF errors on BP |
MARC-RELATORS | Empty, with RDF errors on BP |
MIRNAO | Empty, metadata only |
OCDM | Empty, with RDF errors on BP |
PPLC | Empty, with RDF errors on BP |
PRANAYTC | Empty, metadata only - is this spam? |
QUDT2-1 | Empty, with RDF errors on BP |
QUDT | A whole bunch of parsing issues on BP |
SOIL-PROF | Empty, with RDF errors on BP |
TDT | Empty, metadata only - is this spam? |
TEST_IDKWHATIMDO | Test only |
VODANETHIOPIA_OR | Empty, with RDF errors on BP |
Develop methods for enriching the KGX-format ontologies with BioPortal metadata (i.e., from the http://data.bioontology.org/ API).
In general, this could apply to any ontology - there are simply some properties we'd like to have (e.g., name, date last updated, or authors) but don't retrieve from the 4store. The metadata may be available from another location, so at minimum we'd like some sort of mapping containing these values.
The Bioportal site provides access to historical ontology versions, but these are not present in the 4store.
How can we access them?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.