ncbo / bioportal-to-kgx Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 18.58 MB

Assemble a BioPortal Knowledge Graph

License: BSD 3-Clause "New" or "Revised" License

Python 81.72% Shell 18.28%

bioportal-to-kgx's People

Contributors

Stargazers

Watchers

Forkers

socioprophet

bioportal-to-kgx's Issues

Find alternate solution to SSSOM maps for adding Biolink categories

A few SSSOM maps are OK, but they are overkill when the entirety of a large ontology needs to be mapped to the same category.
There's likely a more type entity-based strategy to be had here, or define the type entity in its own map (don't really like that latter options in terms of source of truth management, but it would work).

Ontology parsing error adds extra column

Parsing ABD results in the edges including the following:

urn:uuid:c95fa7fd-b1f0-4f35-83d4-4fce72d8bbc0	http://brd.bsvgateway.org/api/organism/?id=620	biolink:subclass_of	http://brd.bsvgateway.org/api/organism/	rdfs:subClassOf	BioPortal	Anthology of Biosurveillance Diseases
urn:uuid:7a7d352b-0ebb-42a4-a950-fcf914d72fd4	http://brd.bsvgateway.org/api	ransmission/	biolink:subclass_of	owl:Thing	rdfs:subClassOf	BioPortal	Anthology of Biosurveillance Diseases
urn:uuid:b9b23088-74d8-45be-b1d8-9ef0085b1a8d	http://brd.bsvgateway.org/api/organism/?id=628	biolink:subclass_of	http://brd.bsvgateway.org/api/organism/	rdfs:subClassOf	BioPortal	Anthology of Biosurveillance Diseases

The middle line is the issue - this breaks any merge in which these edges are included, raising a pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 1017, saw 8

It looks like there's some truncation/extra whitespace in the id. This ontology has a class named Transmission so this is likely something with the prefix http://brd.bsvgateway.org/api/transmission/
ABD has some other weirdness too, with lots of Error classes. See on Bioportal at https://bioportal.bioontology.org/ontologies/ABD/?p=classes&conceptid=root

Omit biolink:OntologyClass when assigning new node category

The biolink:OntologyClass is too abstract to be useful when we have another node category available, so when re-mapping, remove that category.

Transform for `RDL` fails

Transform for RDL fails:

Traceback (most recent call last):
  File "run.py", line 81, in <module>
    run()
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run.py", line 69, in run
    transform_status = do_transforms(data_filepaths, kgx_validate, robot_validate, pandas_validate, 
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 170, in do_transforms
    metadata = (header.split(NAMESPACE))[1]
IndexError: list index out of range

KGX validation logs not created, only sent to STDOUT/STDERR

A frustrating issue, because I momentarily had a solution, but then it slipped through my fingers like fine sand...
anyway, calling kgx.cli.validate in functions.py as follows:

   tx_filename = os.path.basename(tx_filepaths[0])
    tx_name = "_".join(tx_filename.split("_", 2)[:2])
    parent_dir = os.path.dirname(tx_filepaths[0])
    log_path = os.path.join(parent_dir,f'kgx_validate_{tx_name}.log')
    try:
        errors = kgx.cli.validate(inputs=tx_filepaths,
                    input_format="tsv",
                    input_compression=None,
                    stream=True,
                    output=log_path)
        if len(errors) > 0: # i.e. there are any real errors
            print(f"KGX found errors in graph files. See {log_path}")
        else:
            print(f"KGX found no errors in {tx_name}.")
    except TypeError as e:
        print(f"Error while validating {tx_name}: {e}")

will only output validation errors to STDOUT.
A file is created at log_path but not modified in any way - it remains empty.
Strangely enough, this behavior is consistent no matter what value is passed to the output parameter.

Likely related to Knowledge-Graph-Hub/knowledge-graph-hub.github.io#13
and also mentioned here biolink/kgx#309

Looking at kgx, the validate function certainly does open a file for writing:
https://github.com/biolink/kgx/blob/496934e0427bad695d3fe8ee05111140418fb200/kgx/cli/cli_utils.py#L249-L252
but then it looks like that defaults to stderr in write_report:
https://github.com/biolink/kgx/blob/496934e0427bad695d3fe8ee05111140418fb200/kgx/error_detection.py#L157-L178
unless a level is provided, which isn't an option through the CLI as far as I can tell.

ROBOT preprocessing fails due to null value

The ROBOT relax fails because of the following error when parsing DOID_617:

ROBOT encountered an error:

  RAN: /home/harry/BioPortal-to-KGX/robot relax --input /tmp/tmpq3t8cyv0 --output transformed/ontologies/DOID/DOID_617_relaxed.json --vvv

  STDOUT:
2022-03-07 14:01:21,081 DEBUG org.obolibrary.robot.IOHelper - Loading ontology /tmp/tmpq3t8cyv0 with catalog file null
2022-03-07 14:01:21,082 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading file META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-03-07 14:01:21,083 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-03-07 14:01:21,083 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-03-07 14:01:21,083 DEBUG org.semanticweb.owlapi... (24347 more, please see e.stdout)

  STDERR:
java.lang.NullPointerException: Cannot invoke "String.toString()" because "lv" is null
        at org.geneontology.obographs.owlapi.FromOwl.generateGraph(FromOwl.java:409)
        at org.geneontology.obographs.owlapi.FromOwl.generateGraphDocument(FromOwl.java:63)
        at org.obolibrary.robot.IOHelper.saveOntologyFile(IOHelper.java:1684)
        at org.obolibrary.robot.IOHelper.saveOntology(IOHelper.java:838)
        at org.obolibrary.robot.CommandLineHelper.maybeSaveOutput(CommandLineHelper.java:671)
        at org.obolibrary.robot.RelaxCommand.execute(RelaxCommand.java:113)
        at org.obolibrary.robot.CommandManager.executeCommand(CommandManager.java:248)
        at org.obolibrary.robot.CommandManager.execute(CommandManager.java:192)
        at org.obolibrary.robot.CommandManager.main(CommandMa... (97 more, please see e.stderr)
ROBOT relax of DOID_617 failed - skipping.

The next transform continues as expected, but DOID doesn't get transformed.

Get ROBOT reports

ROBOT is used to pre-process each ontology, but it should also produce a report, specifically the output of a measure command. Each report should then be saved to the ontology's transformation output directory.

Malformed URIs lost during KGX transform

Errors in URIs are caught during KGX transform and throw a WARNING, but may not be added to the output TSV.
Example:

WARNING:rdflib.term:https://w3id.org/biolink/vocab/Dicty Phenotypes does not look like a valid URI, trying to serialize this will break.

The issue here is the space in the URI.

UMLS semantic types not being applied

Despite presence of attributes like hasSTY, nodes with corresponding semantic types are not being assigned the corresponding Biolink category.
The STY ontology is being correctly mapped:

id      category        name    description     provided_by
STY:T058        biolink:Activity                        Semantic Types Ontology
STY:T057        biolink:Activity                        Semantic Types Ontology
STY:T056        biolink:Activity                        Semantic Types Ontology
STY:T055        biolink:Behavior                        Semantic Types Ontology
STY:T054        biolink:Behavior                        Semantic Types Ontology
STY:T053        biolink:Behavior                        Semantic Types Ontology
STY:T052        biolink:Activity                        Semantic Types Ontology

Yet these don't end up in the UMLS ontologies:

$ more MEDDRA_20_nodes.tsv 
id      category        name    description     provided_by
MEDDRA:10007469 biolink:OntologyClass                   Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007467 biolink:OntologyClass                   Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007468 biolink:OntologyClass                   Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007465 biolink:OntologyClass                   Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007466 biolink:OntologyClass                   Medical Dictionary for Regulatory Activities Terminology (MedDRA)
MEDDRA:10007463 biolink:OntologyClass                   Medical Dictionary for Regulatory Activities Terminology (MedDRA)

Fix prefixes for OBO:[external resource]

This is more of an issue for kg-bioportal, but it can be addressed here.
Some CURIES contain pound signs (no, I will not call them hash signs, etc etc), e.g. in PR:

OBO:EnsemblBacteria#_BruAb1_2100

OBO:dictyBase#_DDB_G0276875

This only becomes an issue once we hit the cat-merge step in kg-bioportal, at which point everything after the # is truncated, leaving redundant-looking CURIEs like OBO:dictyBase. That's not really a bug, and it can be fixed by assigning the specific prefixes for these cases.

Aim 2.2.d. Develop and apply methods for mapping BioPortal ontology elements to OBO ontologies.

Generating mappings between BioPortal ontologies and OBO ontologies will provide additional strategies for integration. LOOM mappings capture some of this space, but require lexical similarity between labels. We would like to recognize potential logical mappings between ontologies as well.

A few options:

Use boomer to resolve complex mappings - See https://github.com/INCATools/boomer
Static approach: run over all kgx files
Dynamic approach (like Misha’s demo): iteratively walk out from a concept and run boomer over the subgraph

Add validation step for pandas parsing

A small but non-zero number of the ontology transforms can't be parsed by pandas properly. This is probably caught by one or another of the existing validations but when it gets to the kg-bioportal merge step this becomes an issue like the following:

15:44:08  Traceback (most recent call last):
15:44:08    File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
15:44:08      result = (True, func(*args, **kwds))
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 809, in parse_source
15:44:08      transformer.transform(input_args)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 303, in transform
15:44:08      self.process(source_generator, sink)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
15:44:08      for rec in source:
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 184, in parse
15:44:08      for chunk in file_iter:
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1187, in __next__
15:44:08      return self.get_chunk()
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1284, in get_chunk
15:44:08      return self.read(nrows=size)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
15:44:08      index, columns, col_dict = self._engine.read(nrows)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
15:44:08      data = self._reader.read(nrows)
15:44:08    File "pandas/_libs/parsers.pyx", line 787, in pandas._libs.parsers.TextReader.read
15:44:08    File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
15:44:08    File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
15:44:08    File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
15:44:08  pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 6, saw 9

Most of these errors are due to #32 so the solution is to re-transform, but the pandas error does not specify which ontology led to the error. A pre-screening in this repo would be helpful: just load each graph file into pandas and warn loudly if it doesn't parse.

Try a merge across Bioportal ontology transforms

And not just a few ontologies, either - see how many can be combined (omitting larger ontologies such as NCBITAXON, GAZ, and DRON) as a merged graph.

Essentially anticipating #30 (Aim 2.3.b)

Error in relaxing Cell Cycle Ontology, BERO, and PR

The process of transforming CCO (Cell Cycle Ontology) goes like this:

Starting on ../Bioportal/4store-export-2022-07-20/data/f0/ff/fbb225ff0f97d7737e854fd2d48d
BioPortal metadata not found for CCO_6 - will retrieve.
Accessing https://data.bioontology.org/ontologies/CCO/...
<Response [200]>
Accessing https://data.bioontology.org/ontologies/CCO/latest_submission...
<Response [200]>
Retrieved metadata for CCO (Cell Cycle Ontology)
ROBOT: relax CCO_6
Relaxing /tmp/tmpxpybbnaf to transformed/ontologies/CCO/CCO_6_relaxed.json...
Traceback (most recent call last):
  File "/home/harry/BioPortal-to-KGX/run.py", line 138, in <module>
    run()
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/BioPortal-to-KGX/run.py", line 113, in run
    transform_status = do_transforms(
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 233, in do_transforms
    if relax_ontology(robot_path,
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/robot_utils.py", line 65, in relax_ontology
    robot_command(
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 1524, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 788, in __init__
    self.wait()
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 845, in wait
    self.handle_command_exit_code(exit_code)
  File "/home/harry/.cache/pypoetry/virtualenvs/bioportal-to-kgx-BG6p1jeu-py3.9/lib/python3.9/site-packages/sh.py", line 869, in handle_command_exit_code
    raise exc
sh.SignalException_SIGKILL: 

  RAN: /home/harry/BioPortal-to-KGX/robot relax --input /tmp/tmpxpybbnaf --output transformed/ontologies/CCO/CCO_6_relaxed.json --vvv

  STDOUT:
2022-10-25 19:26:23,373 DEBUG org.obolibrary.robot.IOHelper - Loading ontology /tmp/tmpxpybbnaf with catalog file null
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading file META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi.utilities.Injector - Loading URL for service jar:file:/home/harry/BioPortal-to-KGX/robot.jar!/META-INF/services/org.semanticweb.owlapi.model.OWLOntologyManager
2022-10-25 19:26:23,374 DEBUG org.semanticweb.owlapi... (257202 more, please see e.stdout)

  STDERR:

An unrelated but still present issue: robot's verbosity flag is -vvv, not --vvv.

Run the robot command on its own, and this happens:

[many debug lines later]
DEBUG Saving ontology as OboGraphs JSON Syntax with to IRI file:/home/harry/BioPortal-to-KGX/transformed/ontologies/CCO/CCO_6_relaxed.json
OBO GRAPH ERROR Could not convert ontology to OBO Graph (see https://github.com/geneontology/obographs)
For details see: http://robot.obolibrary.org/errors#obo-graph-error
java.io.IOException: errors#OBO GRAPH ERROR Could not convert ontology to OBO Graph (see https://github.com/geneontology/obographs)
        at org.obolibrary.robot.IOHelper.saveOntologyFile(IOHelper.java:1722)
        at org.obolibrary.robot.IOHelper.saveOntology(IOHelper.java:846)
        at org.obolibrary.robot.CommandLineHelper.maybeSaveOutput(CommandLineHelper.java:667)
        at org.obolibrary.robot.RelaxCommand.execute(RelaxCommand.java:113)
        at org.obolibrary.robot.CommandManager.executeCommand(CommandManager.java:244)
        at org.obolibrary.robot.CommandManager.execute(CommandManager.java:188)
        at org.obolibrary.robot.CommandManager.main(CommandManager.java:135)
        at org.obolibrary.robot.CommandLineInterface.main(CommandLineInterface.java:61)

This ontology hasn't been updated in 8 years so it may be skippable.

Metadata retrieval for some ontologies returns status 403, causing KeyError

Metadata retrieval for some ontologies encounters the following:

Starting on /home/harry/Bioportal/4store-export-2022-02-02/data/8e/be/8e3726e465b55e80c205dad4f43e
KGX validation log present: kgx_validate_VODANA-UG-OPD_1.log
ROBOT report(s) present: robot.report
Transform already present for VODANA-UG-OPD_1
Validating graph files can be parsed...
Graph file transformed/ontologies/VODANA-UG-OPD/VODANA-UG-OPD_1_edges.tsv parses OK.
BioPortal metadata not found for VODANA-UG-OPD_1 - will retrieve.
Accessing https://data.bioontology.org/ontologies/VODANA-UG-OPD/...
<Response [403]>
Accessing https://data.bioontology.org/ontologies/VODANA-UG-OPD/latest_submission...
<Response [403]>
Traceback (most recent call last):
  File "run.py", line 73, in <module>
    run()
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run.py", line 61, in run
    transform_status = do_transforms(data_filepaths, kgx_validate, robot_validate, pandas_validate,
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 158, in do_transforms
    onto_md = bioportal_metadata(dataname, ncbo_key)
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/bioportal_utils.py", line 53, in bioportal_metadata
    md[md_type] = content[md_type]
KeyError: 'submissionId'

The content is empty so we shouldn't assign anything to md.

UnboundLocalError: local variable 'nodecount' referenced before assignment

This is an issue left over from a recent PR (#78) - nodecount is referenced in pandas_validate_transform() but isn't assigned a value unless the input definitely follows the nodes/edges naming convention, there's a parsing error, or the data is empty.

Some prefixes need updates as of Jul 20 2022 BP dump

The following prefixes need to be updated:

HGNC (see #62 - update should fix that)
MIXS
PATO (ontology itself has valid CURIEs but imports may not)

Provide option to select or exclude specific ontologies from transformation

Click options to select or exclude one or more ontologies, by name, would be helpful for avoiding the really large ones (like NCBITAXON) when they don't require further work or should be handled on their own.

Aim 2.1.b. Provide reports regarding validation of KGX-format ontologies.

The goal stated in the title is more or less complete, so the remaining tasks here are:

Enumerate UMLS semantic types used in BioPortal
Determine extent of mapping between UMLS semantic types and Biolink types
Use results of KGX validations to determine where BioPortal type mappings may be improved, and in what way

Aim 2.3.a. Evaluate extent of expected integration between BioPortal ontologies.

Determine points of integration in merged KG-Bioportal.
Identify areas of consistent non-native class usage (e.g., does ontology X import ontology Y in whole or in part, and is that reflected in the KG)
Enumerate ontologies not expected to integrate beyond trivial ways (e.g., sharing BFO axioms, RDF types, etc)

ValueError: mapping_justification must be supplied

When running a fresh transform like the following:

python run.py --input ../Bioportal/4store-export-2022-07-20/data/ --get_bioportal_metadata --ncbo_key [key] --remap_types --write_curies

the process fails upon trying to load the SSSOM maps:

Looking for records in ../Bioportal/4store-export-2022-07-20/data/
976 files found.
Setting up ROBOT...
ROBOT path: /home/harry/BioPortal-to-KGX/robot
ROBOT evironment variables: -Xmx12g -XX:+UseG1GC
Loading type maps from mappings/
WARNING:root:No prefix map provided (not recommended), trying to use defaults..
Traceback (most recent call last):
  File "run.py", line 81, in <module>
    run()
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run.py", line 69, in run
    transform_status = do_transforms(data_filepaths, kgx_validate, robot_validate, pandas_validate, 
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 121, in do_transforms
    this_table = read_sssom_table(os.path.join(MAPPING_DIR,filepath))
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/deprecation.py", line 260, in _inner
    return function(*args, **kwargs)
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom/parsers.py", line 84, in read_sssom_table
    return parse_sssom_table(
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom/parsers.py", line 183, in parse_sssom_table
    msdf = from_sssom_dataframe(
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom/parsers.py", line 382, in from_sssom_dataframe
    mlist.append(_prepare_mapping(Mapping(**mdict)))
  File "<string>", line 42, in __init__
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/sssom_schema/datamodel/sssom_schema.py", line 258, in __post_init__
    self.MissingRequiredField("mapping_justification")
  File "/home/harry/bioportal-to-kgx-env/lib/python3.8/site-packages/linkml_runtime-1.2.0rc3-py3.8.egg/linkml_runtime/utils/yamlutils.py", line 246, in MissingRequiredField
    raise ValueError(f"{field_name} must be supplied")
ValueError: mapping_justification must be supplied

As the stack trace indicates, this is due to a missing required field in the maps as per the sssom_schema.
They need to include mapping_justification.

Produce stats yaml with status and node/edge count of each ontology source

See ncbo/kg-bioportal#38 (comment)
and example output in
https://kg-hub.berkeleybop.io/kg-bioportal/onto_status.yaml

validate() got an unexpected keyword argument 'stream'

When using the kgx_validate flag, this error appears:

Error while validating CCO_6: validate() got an unexpected keyword argument 'stream'

This causes KGX-based validation to fail.

The problem is the call to kgx.cli.validate in functions.py - it does, in fact, try to pass stream=True to the validate function, but the function doesn't have that arg.

Retrieve mappings from Bioportal for Biolink cats

We have many nodes assigned biolink:NamedThing category by default, but these could be assigned other categories if a mapping is available, e.g. UMLS semantic types.

See API call: https://data.bioontology.org/mappings?ontologies=SNOMEDCT,BIOLINK
(but there is a bug limiting this for now)

Transform for `ISSVA` fails

As of the Jul 20 2022 BP dump, the transform for ISSVA fails after pandas validation.
Will try to reproduce.

HGNC type maps incorrect

The type mappings ensuring that HGNC nodes are assigned biolink:Gene is seemingly not working.
This could be due to HGNC CURIEs taking the form HGNC:HGNC_XXXX rather than HGNC:XXXX.
It should be the latter - see https://bioregistry.io/registry/hgnc.

Aim 2.2.b. Evaluate extent to which BioPortal ontologies adhere to Biolink model types and classes.

This aim is related to 2.1.b (#27) and 2.2.a (#22) in that we have already observed ways in which BioPortal ontologies may belong to one or more of the following groups:

Those with graph format errors, originating from the initial ontology and/or the translation process
Those without clear mappings to Biolink classes (e.g., all node types are biolink:OntologyClass) but potential for more detailed category assignment
Those with UMLS semantic type assignments; these may indicate appropriate Biolink classes

This Aim includes the following tasks:

For each ontology, determine how many:

nodes have non-root Biolink classes.
edges have non-root Biolink classes.

Of those ontologies with non-root Biolink classes,

Which are used?
Are they appropriate for the ontology, not counting imports?

Missing IRIs and metadata from many transforms

Many transforms appear to be missing descriptions, IRIs, and possibly other fields populated in the previous set of transforms.
Will need to verify the JSON -> TSV step is populating fields as expected, particularly name and description.

ValueError: Unknown CURIE prefix: file

As of fc140ce, the transform for CANONT_1 proceeds like this:

ROBOT: relax CANONT_1
Relaxing /tmp/tmpln0gf78b to transformed/ontologies/CANONT/CANONT_1_relaxed.json...
Complete.
KGX transform CANONT_1
[KGX][cli_utils.py][    transform_source] INFO: Processing source 'CANONT_1_relaxed.json'
Traceback (most recent call last):
  File "run.py", line 33, in <module>
    run()
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run.py", line 28, in run
    do_transforms(data_filepaths)
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 119, in do_transforms
    kgx.cli.transform(inputs=[relaxed_outpath],
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 586, in transform
    transform_source(
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 895, in transform_source
    transformer.transform(input_args, output_args)
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 247, in transform
    self.process(source_generator, intermediate_sink)
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
    for rec in source:
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 62, in parse
    yield from chain(n, e)
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 86, in read_nodes
    yield self.read_node(n)
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 124, in read_node
    category = self.get_category(curie, node)
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/obograph_source.py", line 256, in get_category
    element = self.toolkit.get_element_by_mapping(category)
  File "/home/harry/kg-env/lib/python3.8/site-packages/bmt/toolkit.py", line 1057, in get_element_by_mapping
    mappings = self.get_all_elements_by_mapping(identifier)
  File "/home/harry/kg-env/lib/python3.8/site-packages/bmt/toolkit.py", line 1274, in get_all_elements_by_mapping
    self.generator.namespaces.uri_for(identifier), set()
  File "/home/harry/kg-env/lib/python3.8/site-packages/linkml_runtime/utils/namespaces.py", line 183, in uri_for
    raise ValueError(f"{TypedNode.yaml_loc(uri_or_curie)}Unknown CURIE prefix: {prefix}")
ValueError: Unknown CURIE prefix: file

This is unhandled and causes the set of transforms to fail.
The issue is local file references (mistakenly) included in the ontology. In the obojson they look like this:

...
"id" : "http://purl.obolibrary.org/obo/ID_0000052",
      "meta" : {
        "basicPropertyValues" : [ {
          "pred" : "http://www.geneontology.org/formats/oboInOwl#hasOBONamespace",
          "val" : "file:C:/Documents and Settings/Andre/My Documents/OBO/Ontology_Scholastic.obo"
        }
...

Easy solution is to catch the ValueError and skip this transform.
Assigning a namespace could fix the issue but seems to be outside the scope of a transform.

Add tests

And set up GH Actions to run tests.

Some graphs are empty

In the Bioportal 4store dump, some graphs (e.g., BTO_ONTOLOGY) appear to be completely empty.
This is not the case for the corresponding ontology pages on the Bioportal site (e.g., https://bioportal.bioontology.org/ontologies/BTO_ONTOLOGY/?p=summary).
So where are those classes in the data dump?

Resolve transforms failing due to `Unknown CURIE prefix: file`

Seven ontologies do not transform to KGX TSV due to presence of a value resembling a CURIE with a file prefix.
These are otherwise valid, if somewhat less-than-descriptive, values, and may be modified to permit the transform to continue.

Include ROBOT processing step

Passing each ontology through ROBOT would offer the immediate benefit of yielding a report, plus it will enable more consistent handling of situations more complex than the average set of is_a relationships.

In KG-OBO the process involves a robot relax for all ontologies (some also got a merge & convert but primarily to resolve imports - these shouldn't be present) done as follows:
https://github.com/Knowledge-Graph-Hub/kg-obo/blob/9768432876deda512af088586bf1b1289251e85d/kg_obo/transform.py#L726-L745

            kg_obo_logger.info(f"ROBOT preprocessing: relax {ontology_name}")
            print(f"ROBOT preprocessing: relax {ontology_name}")
            temp_suffix = f"_{ontology_name}_relaxed.owl"
            tfile_relaxed = tempfile.NamedTemporaryFile(delete=False,suffix=temp_suffix)
            if not relax_owl(robot_path, tfile.name,tfile_relaxed.name,robot_env):
                kg_obo_logger.error(f"ROBOT relaxing of {ontology_name} failed - skipping.")
                print(f"ROBOT relaxing of {ontology_name} failed - skipping.")
                tfile_relaxed.close()
                continue
            tfile_relaxed.close()


            before_count = get_file_length(tfile.name)
            after_count = get_file_length(tfile_relaxed.name)
            kg_obo_logger.info(f"Before relax: {before_count} lines. After relax: {after_count} lines.")
            print(f"Before relax: {before_count} lines. After relax: {after_count} lines.")


            if after_count == 0:
                kg_obo_logger.error(f"ROBOT relaxing of {ontology_name} yielded an empty result!")
                print(f"ROBOT relaxing of {ontology_name} yielded an empty result!")
                continue #Need to skip this one or we will upload empty results

A similar strategy, plus a report as seen here (https://github.com/Knowledge-Graph-Hub/kg-obo/blob/9768432876deda512af088586bf1b1289251e85d/kg_obo/robot_utils.py#L112) would be useful.

Don't perform a transform if the products already exist

Don't want to waste time re-transforming the same thing repeatedly.
Check the output directory first to see if it already contains non-empty node and edgelists.

Re-map node and edge properties to Biolink slots

The full extent of node and edge properties used in Biolink do not explicitly map to Biolink property slots, but can be mapped - probably a good use case for SSSOM. An example - nodes and edges from the Asthma Ontology:

$ head AO_2_nodes.tsv
id      category        name    provided_by     :http://data.bioontology.org/metadata/def/mappingLoom   :http://data.bioontology.org/metadata/def/prefLabel     :http://data.bioontology.org/metadata/prefixIRI comment type
http://childhealthservicemodels.eu/asthma#MOCHA_0300    biolink:NamedThing      national asthma program tmps_d74ivz     nationalasthmaprogram   national asthma program asthma:MOCHA_0300               owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082   biolink:NamedThing      management      tmps_d74ivz     management      management      asthma:MOCHA-Asthma_000082              owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056     biolink:NamedThing      extrinsic asthma with status asthmaticus        tmps_d74ivz     extrinsicasthmawithstatusasthmaticus    extrinsic asthma with status asthmaticus   asthma:MOCHA-ADHD_000056         owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000     biolink:NamedThing      general tmps_d74ivz     general general asthma:MOCHA-ADHD_000000                owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000162   biolink:NamedThing      passive smoking exposure        tmps_d74ivz     passivesmokingexposure  passive smoking exposure        asthma:MOCHA-Asthma_000162              owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000063     biolink:NamedThing      environmental   tmps_d74ivz     environmental   environmental   asthma:MOCHA-ADHD_000063                owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000052     biolink:NamedThing      intrinsic asthma with status asthmaticus        tmps_d74ivz     intrinsicasthmawithstatusasthmaticus    intrinsic asthma with status asthmaticus   asthma:MOCHA-ADHD_000052         owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000283   biolink:NamedThing      primary care asthma register    tmps_d74ivz     primarycareasthmaregister       primary care asthma register    asthma:MOCHA-Asthma_000283         owl:Class
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000077   biolink:NamedThing      other   tmps_d74ivz     other   other   asthma:MOCHA-Asthma_000077              owl:Class

For one, the provided_by isn't really informative since it's just using the name of the temp file used during transformation.
The :http://data.bioontology.org/metadata/prefix prefix slots aren't explicitly Biolink-compatible.
And the edges:

subject predicate       object  relation        knowledge_source
http://childhealthservicemodels.eu/asthma#MOCHA_0300    :http://data.bioontology.org/metadata/def/mappingSameURI        http://childhealthservicemodels.eu/asthma#MOCHA_0300    :http://data.bioontology.org/metadata/def/mappingSameURI   tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA_0300    biolink:subclass_of     http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082   rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082   :http://data.bioontology.org/metadata/def/mappingSameURI        http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082   :http://data.bioontology.org/metadata/def/mappingSameURI    tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000082   biolink:subclass_of     http://childhealthservicemodels.eu/asthma#MOCHA-EPI_0000037     rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056     :http://data.bioontology.org/metadata/def/mappingSameURI        http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056     :http://data.bioontology.org/metadata/def/mappingSameURI    tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000056     biolink:subclass_of     http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000     rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000     :http://data.bioontology.org/metadata/def/mappingSameURI        http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000     :http://data.bioontology.org/metadata/def/mappingSameURI    tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-ADHD_000000     biolink:subclass_of     http://childhealthservicemodels.eu/asthma#MOCHA-EPI_0000033     rdfs:subClassOf tmps_d74ivz
http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000162   :http://data.bioontology.org/metadata/def/mappingSameURI        http://childhealthservicemodels.eu/asthma#MOCHA-Asthma_000162   :http://data.bioontology.org/metadata/def/mappingSameURI    tmps_d74ivz

This is more of a matter of mapping predicates - biolink:subclass_of is fine but :http://data.bioontology.org/metadata/def/mappingSameURI can be replaced (and in these cases, it isn't even doing anything)

Rename repo

Rename to BioPortal-to-KGX to reflect project relationship

Aim 2.2.c. Develop and apply methods for mapping BioPortal ontology elements to Biolink model.

Tasks for this Aim are as follows.

During the transformation step, assign Biolink categories to nodes and edges based on:

UMLS semantic types
Known namespaces beyond those prefixes recognized by kgx already
Ontology metadata, where possible

Also correct errors violating kgx format and Biolink:

Replace URL node IDs with CURIEs specific to each ontology

Developments in kgx, OAK and Boomer may assist.

SNOMEDCT transform fails

The transform for SNOMEDCT only makes it as far as the JSON stage.
At 821M it's pretty big, so this may be a case of ROBOT hitting a memory limit (though NCBITAXON is larger on disk and completed under the same conditions, so who knows what is up)

Some transforms fail during translation to TSV, but write placeholder

In a run of the following

python run.py --input [path to data] --write_curies --remap_types --get_bioportal_metadata --ncbo_key [key]

some transforms appear to translate just fine until the final step, at which point they are not written to TSV, and here's an example with BCO:

Starting on /home/harry/Bioportal/4store-export-2022-02-02/data/46/1e/11adde7246a87e69de4d22d0b112
ROBOT report(s) present: robot.report
KGX validation log present: kgx_validate_BCO_11.log
BioPortal metadata not found for BCO_11 - will retrieve.
Accessing https://data.bioontology.org/ontologies/BCO/...
<Response [200]>
Accessing https://data.bioontology.org/ontologies/BCO/latest_submission...
<Response [200]>
Retrieved metadata for BCO (Biological Collections Ontology)
File for BCO_11 is empty! Writing placeholder.

In this case, the only contents of the output directory for BCO are:

BCO_11
BCO_11_relaxed.json
kgx_validate_BCO_11.log
robot.measure
robot.report

Edge and nodefiles are not present.

I've also seen this happen with DISDRIV and COGAT and a few others. Should try a fresh (completely from scratch) transform on these.

Aim 2.3.b. Assemble BioPortal KG with KGX

Assemble initial graph, omitting larger ontologies
Assemble full set with all Bioportal ontologies
Assemble graph with re-mapped Biolink categories/slots

Validation fails with pandas.errors.EmptyDataError on at least one ontology

Transforming VODANA-UG_1 completes, but validating it raises an error:

ROBOT: relax VODANA-UG_1
Relaxing /tmp/tmp2k1_xre6 to transformed/ontologies/VODANA-UG/VODANA-UG_1_relaxed.json...
Complete.
KGX transform VODANA-UG_1
[KGX][cli_utils.py][    transform_source] INFO: Processing source 'VODANA-UG_1_relaxed.json'
Validating...
Traceback (most recent call last):
  File "run.py", line 36, in <module>
    run()
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run.py", line 28, in run
    transform_status = do_transforms(data_filepaths, kgx_validate)
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 163, in do_transforms
    validate_transform(outdir)
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 194, in validate_transform
    json_dump((kgx.cli.validate(inputs=tx_filepaths,
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 218, in validate
    transformer.transform(
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 240, in transform
    self.process(source_generator, sink)
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
    for rec in source:
  File "/home/harry/kg-env/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 171, in parse
    file_iter = pd.read_csv(
  File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1231, in _make_engine
    return mapping[engine](f, **self.options)
  File "/home/harry/kg-env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

There's a simple reason - the edge file is empty.
The validator function should check if a file is empty before passing it to KGX.

Find the set of all IRI prefixes used across Bioportal

Re: Bioportal meeting on May 9, 2022

We would like to know mappings between ontology IDs and IRI prefixes used in Bioportal.
So what are the IRI prefixes for each?
Index and assemble in a TSV.

Aim 2.2.a. Evaluate consistency of relation and entity types used within BioPortal ontologies.

Specifically:

Get total counts of KGX validation error by type across all transforms (e.g., "X transforms have at least one error of type Y")
- Which transforms did not work?
- How many of those transformation issues are resolvable?
- How many of those transformation issues are not worth resolving in this repository (e.g., something structurally wrong with ontology on Bioportal; entry may be duplicated, test, or otherwise broken)
Enumerate usage of entity types by ontology (e.g., "X ontologies use biolink:OntologyClass)
Enumerate usage of relation types by ontology (e.g., "X ontologies use biolink:subclass_of)

IndexError during validation

Validating OPDT_1 produces an IndexError - seemingly due to some issue parsing a filename, but both OPDT_1_edges.tsv and OPDT_1_nodes.tsv appear to be present.

ROBOT: relax OPDT_1
Relaxing /tmp/tmphip66s_b to transformed/ontologies/OPDT/OPDT_1_relaxed.json...
Complete.
KGX transform OPDT_1
[KGX][cli_utils.py][    transform_source] INFO: Processing source 'OPDT_1_relaxed.json'
Validating...
Traceback (most recent call last):
  File "run.py", line 44, in <module>
    run()
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "run.py", line 36, in run
    transform_status = do_transforms(data_filepaths, kgx_validate)
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 179, in do_transforms
    validate_transform(outdir)
  File "/home/harry/BioPortal-to-KGX/bioportal_to_kgx/functions.py", line 204, in validate_transform
    tx_filename = os.path.basename(tx_filepaths[0])
IndexError: list index out of range

Aim 2.1.c. Develop and apply methods for enriching KGX-format ontologies with BioPortal metadata

Enrich graph-format BioPortal ontologies with the following:

Metadata from the http://data.bioontology.org/ API
Metadata from validation reports
Any other metadata supporting other Aims?

Collision between ATO and ATOL, potentially others of similar names

I noticed that the most recent merged graph merges entries from ATO and ATOL:

ATO:0000368     biolink:NamedThing      startle response|Hydromantes (Gistel 1848)      None    ATO_2_nodes.tsv|ATOL_2_nodes.tsv

This is unexpected as ATOL should have the CURIE prefix ATOL, but sure enough, it got assigned ATO instead:

$ head ATOL_2_nodes.tsv
id      category        name    description     provided_by
ATO:0000261     biolink:NamedThing      milk fatty acid iso C18:0 concentration         Animal Trait Ontology for Livestock
ATO:0001593     biolink:NamedThing      efficiency of phenylalanine utilization         Animal Trait Ontology for Livestock
ATO:0001592     biolink:NamedThing      efficiency of methionine utilization            Animal Trait Ontology for Livestock

I suspect this is overzealous prefix mapping. There may be similar instances among similarly named BP ontos, such as MA and MAT.

Convert prefix conversion to use `curies` package

The curies package c/o @cthoyt should be able to more elegantly convert prefixes to CURIEs.

See https://github.com/cthoyt/curies

Example:

converter = Converter.from_reverse_prefix_map({

    "http://purl.obolibrary.org/obo/CHEBI_": "CHEBI",

    "https://www.ebi.ac.uk/chebi/searchId.do?chebiId=": "CHEBI",

    "http://purl.obolibrary.org/obo/MONDO_": "MONDO",

})

converter.expand("CHEBI:138488")
'http://purl.obolibrary.org/obo/CHEBI_138488'

converter.compress("http://purl.obolibrary.org/obo/CHEBI_138488")
'CHEBI:138488'

converter.compress("https://www.ebi.ac.uk/chebi/searchId.do?chebiId=138488")
'CHEBI:138488'

General transformation failures for the July 20 2022 data

The following ontologies encountered some kind of resolvable error during transformation of the Jul 20 2022 BP dump.
This means they may be missing nodes or edges; it usually indicates some issue with transforming to JSON or parsing the source.

Ontology	Issue
ADMF	Empty, metadata only
ARO	Empty, but seemingly valid version on BP
BFLC	Empty, with RDF errors on BP
BPFORMS	Empty, renders incorrectly on BP
CEDARPC	Empty, metadata only
CHEMBIO	Empty, metadata only
CST	Empty, with RDF errors on BP
CS	Empty, with RDF errors on BP
CTP	Empty, metadata only
DCO	Empty, metadata only
DMTO	Empty, with RDF errors on BP
ECTO	Empty, renders incorrectly on BP
FASTO	Empty, with RDF errors on BP
FB-BT	❓ Unknown - seems OK on BP
HIV	Empty, with RDF errors on BP
HMADO	Empty, metadata only
INFRA-EN	Empty, metadata only
LC-CARRIERS	Empty, with RDF errors on BP
MARC-RELATORS	Empty, with RDF errors on BP
MIRNAO	Empty, metadata only
OCDM	Empty, with RDF errors on BP
PPLC	Empty, with RDF errors on BP
PRANAYTC	Empty, metadata only - is this spam?
QUDT2-1	Empty, with RDF errors on BP
QUDT	A whole bunch of parsing issues on BP
SOIL-PROF	Empty, with RDF errors on BP
TDT	Empty, metadata only - is this spam?
TEST_IDKWHATIMDO	Test only
VODANETHIOPIA_OR	Empty, with RDF errors on BP

Retrieve ontology-specific metadata

Develop methods for enriching the KGX-format ontologies with BioPortal metadata (i.e., from the http://data.bioontology.org/ API).

In general, this could apply to any ontology - there are simply some properties we'd like to have (e.g., name, date last updated, or authors) but don't retrieve from the 4store. The metadata may be available from another location, so at minimum we'd like some sort of mapping containing these values.

Would like to have older ontology versions, too

The Bioportal site provides access to historical ontology versions, but these are not present in the 4store.
How can we access them?