Code Monkey home page Code Monkey logo

ccdhmodel's People

Contributors

balhoff avatar bfurner avatar cmungall avatar gaurav avatar hsolbrig avatar joeflack4 avatar jooho-lee-kim avatar mbrush avatar monicacecilia avatar sujaypatil96 avatar turbomam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ccdhmodel's Issues

problem with generated OWL file

The OWL API parser complains when trying to read ccdhmodel.owl.ttl:

jim (main)$ robot convert -i owl/ccdhmodel.owl.ttl -o ccdhmodel.ofn
2021-07-09 22:17:39,644 ERROR org.semanticweb.owlapi.rdf.rdfxml.parser.OWLRDFConsumer - Entity not properly recognized, missing triples in input? http://org.semanticweb.owlapi/error#Error1 for type Class
2021-07-09 22:17:39,652 ERROR org.semanticweb.owlapi.rdf.rdfxml.parser.OWLRDFConsumer - Entity not properly recognized, missing triples in input? http://org.semanticweb.owlapi/error#Error2 for type Class
2021-07-09 22:17:39,652 ERROR org.semanticweb.owlapi.rdf.rdfxml.parser.OWLRDFConsumer - Entity not properly recognized, missing triples in input? http://org.semanticweb.owlapi/error#Error3 for type Class
2021-07-09 22:17:39,652 ERROR org.semanticweb.owlapi.rdf.rdfxml.parser.OWLRDFConsumer - Entity not properly recognized, missing triples in input? http://org.semanticweb.owlapi/error#Error4 for type Class
2021-07-09 22:17:39,654 ERROR org.semanticweb.owlapi.rdf.rdfxml.parser.OWLRDFConsumer - Entity not properly recognized, missing triples in input? http://org.semanticweb.owlapi/error#Error5 for type Class
2021-07-09 22:17:39,654 ERROR org.semanticweb.owlapi.rdf.rdfxml.parser.OWLRDFConsumer - Entity not properly recognized, missing triples in input? http://org.semanticweb.owlapi/error#Error6 for type Class
# many more lines...

I thought this might be a linkml issue, but I tried the biolink-model OWL file and didn't see this problem.

Incorporate DST Recommended Metadata Elements into CRDCH

  • Gender: Subject.gender (enum)
    implement genotypic sex and phenotypic sex and gender attributes directly on Subject - to capture the most recent observation/identification of these characteristics. If we end up needing to capture longitudinal sex/gender data, we may need to use Observations for this. We can also consider capturing it on Subject nd Research Subject - where the later is the sex/gender that applies during a Study.
  • Principal Investigator Name: ResearchProject.principal_investigator
  • Subject age at specimen collection
    Shahim will explore the possibility of making use of the TimePoint entity defined in the CDM
  • File format and File type
    Shahim to review CRDCH.Document and FHIR approaches

Add support for models in Excel sheets as well

We currently support generating the CRDC-H model in LinkML from Google Sheets, since that is our primary development environment. @fragosog has requested that we add support for the model in Excel: cancerDHC/tools#32 (comment)

This might not actually be necessary, since it looks likely that we'll transition from Google Sheets to developing directly in YAML sooner rather than later. If development remains in Google Sheets for several months, however, we should definitely consider adding Excel support as well. This will also be useful if we ever want to separate the sheet2linkml component into its own software product.

We need to enable search within the CRDC-H documentation site

In addition to the baked-in search, which appears to not be full-text search, it would probably be good to have a single page that includes all of the attribute mappings so that nodes can search for their attributes in the CCDH documentation.

Add support for displaying multiple versions of the documentation on Github Pages

While all versions of the CRDC-H Model will be available from the GitHub repository, we should also take steps to ensure that they can all be accessed from the GitHub Pages associated with this account. One way of doing this would be to use Mike -- this is specifically designed for maintaining multiple versions of a MkDocs directory on gh-pages.

Once we have this implemented, we could potentially look into having a dev version of the CRDC-H model that is continuously updated (by a GitHub action that runs hourly, say), so that DMH and others can edit their Google Sheets and see pretty quickly how that updates the model description here. In the future, they could edit the model in YAML in a PR and have a branch-specific version generated alongside the PR for testing.

Figure out where to put the Google-Sheets-to-LinkML translation script

The Google Sheet to LinkML script is cdm_biolinkml_loader.py is currently in the HOT-Ecosystem/crdc-node-models repository. I think eventually we will want to settle in on one of three possible workflows:

  1. We can move the cdm_biolinkml_loader.py script to this repository, where it can be used by a Github Action that automatically regenerates the LinkML model schema (YAML) from the Google Sheets as needed. Issues relating to this transformation will be filed in this repository. Eventually, we will need to modify the crdc-node-models repository so that it uses the LinkML model files rather than the cdm_biolinkml_loader.py code.
  2. We can leave the cdm_biolinkml_loader.py script in the crdc-node-models repository. A Github Action in this repository will pull the script from there and use it to regenerate the LinkML model schema (YAML) from the Google Sheets as needed. Issues relating to this transformation will be filed in the crdc-node-models repository, and will be labeled to distinguish them from issues relating to the other CRDC node model generation code.
  3. We can move cdm_biolinkml_loader.py into its own repository, with a future plan of eventually turning it into a generic Google-Sheets-to-LinkML tool.

In all cases, further Github Actions stored in this repository can be used to generate the model documentation, JSON Schema, Python DataClasses and other entities from the LinkML model schema when that is updated. Note that we plan to reorganize this repository to follow the scheme set out in the LinkML template (#7).

I like the idea of moving that script here (i.e. workflow 1 above), so that all the code relating to the CRDC-H model is kept in one place. I don't think workflow 3 makes sense for now, since this tool is probably not going to be close to a generic solution any time soon. Workflow 2 has the advantage of being the least amount of work for now, but I think it might be confusing to people who are unsure where they should be filing issues, i.e. whether the problem is in the LinkML generation code (which would be filed in the crdc-node-models repository) or in the documentation being generated from the LinkML model (which would be filed here). Therefore, I think we should go with workflow 1.

@balhoff @jiaola What do you think?

Get the ccdhmodel python code into PyPI as a package

Possibly by taking advantage of the new linkml/linkml-model-template

It shouldn't be necessary to manually copy ccdhmodel.py and ccdhmodel.yaml into cancerDHC/example-data

From @gaurav:
The main things we want from the new template are:

  • standardization: the goal is that any other LinkML user should be able to look at our repository and figure out pretty quickly where the artifact they want is located, or how to modify the schema -> artifact build process.
  • pick up @hsolbrig's publish-to-pypi code, so that the generated Python module can be published to PyPi.

Ubuntu versioning syntax over semantic versioning

The Linux Ubuntu OS follows time based releases rather than feature driven ones, which implement the semantic versioning syntax.

The Ubuntu versioning convention is the YY.MM format where Y is the year and M is the month of that year when that release was created. You can see the release process documented for a reference tool here. The proposal with this issue is to follow time based releases, for example, if we were to create a release today, it would be 21.08.

As for releases we would have to coordinate the sync between Github tags and the version on PyPI but that's a topic for another issue.

Feel free to discuss the pros and cons of both versioning approaches on this issue.

Finalize IRI identifiers for CRDC-H model as well as nodes

We currently use a number of dummy prefixes in the CCDH model:

prefixes:
  linkml: https://w3id.org/linkml/
  ccdh: https://example.org/ccdh/
  NCIT: http://purl.obolibrary.org/obo/NCIT_
  GDC: http://example.org/gdc/
  PDC: http://example.org/pdc/
  ICDC: http://example.org/icdc/
  HTAN: http://example.org/htan/

We should replace these with actual IRIs.

For the CCDH IRIs, we should probably register a ccdh or crdc-h or crdch prefix at w3id.org and use that.

As per the Identifier Recommendations, the CRDC prefix will be at https://w3id.org/crdc/ and this will be used as e.g. subject crdc:su0000001 (for a subject), crdc:st000002 (for a study), and so on. So it might make sense to reserve a two-letter code for the model (dm?) and make properties based on that, but I think we'd prefer e.g. ccdh:BodySite__site rather than crdc:dm0000431.

For the node IRIs, this is primarily a convenience tool so we can use LinkML mapping fields, which use CURIE mappings. We could ask LinkML for non-CURIE mapping fields and use those instead, or we could try to find actual IRIs that make sense (e.g. for Sample.sample_type, we can construct the pretty odd IRI https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample&anchor=sample_type to look up its documentation). We could also ask GDC to mint identifiers for their properties.

We actually have another odd possibility for node properties: many of them are present in the caDSR and the NCI Thesaurus, so instead of GDC:sample.sample_type we could say caDSR:3111302v2.0 or NCIT:C70713.

Rendering issues in the CCDH model documentation

Tasks noted by @mbrush from the May 27 phone meeting:

  • All of the referring entities in the 'Referenced by class' section are "None" - e.g. see Specimen page
  • Similarly, the Domain in the 'Domain and Range' section of slot pages is always set to "None" - e.g. see the analyte_type slot page.
  • Question - why is '0..1' cardinality rendered "OPT" . . . can we make this "0..1" to be consistent with how 0..m cardinality is rendered?
  • Source mappings: use 'exact match' for direct mappings in the "Source Mappings" column. Use 'close match' in for indirect mappings in the "Indirect Mappings" column.
  • Enumeration tables will have only two columns populated - 'Text' will hold the free-text string of the value, and 'Description' will hold a definition of the value (if one exists in NCIt or in the spreadsheet of CCDH-defined enums)
  • Include content form the 'Comments' column in the spreadsheet.

Clean up tags generated during PR #75

PR #75 generated a number of tags, which I don't think we plan to publish as actual CRDCH model versions. Therefore, this issue covers:

  • Adding a -dev suffix to version numbers used in development so it's clear that these are not intended to be actual releases of the CRDCH model.
  • Deleting all the extraneous tags generated during development of PR #75, i.e. tags v0.2.2 to 0.3.2.

@turbomam I've assigned you since you're working on PR #75 for now, but feel free to assign it back to me if you'd like!

Allow users to access the Python Data Classes directly

The Python Data Classes are currently generated in a single file at python/ccdhmodel.py. This cannot be accessed directly via pip.

I believe that the new LinkML model template currently being tested (#7) will provide this automatically, so we should probably try to implement that first and see if it meets our needs. If not, we will need to include a setup.py file that provides information on how to install the Python Data Classes. This will not only allow users to install the Data Classes via GitHub, but will also allow us to publish the Data Classes to PyPI, making it even easier for users to use them.

Rename Diagnosis.dimensional_measure

Diagnosis.dimensional_measure subsumes a few source fields from GDC and PDC that capture the largest dimension of the primary tumor. The CRDC-H attribute is meant to capture zero or more dimensional observations about the primary tumor, but the attribute is poorly named. The attribute will be renamed Diagnosis.primary_tumor_dimensional_measures.

Smita Hastak raised this issue in an email exchange with Matt Brush and Brian Furner on July 7 2021.

Add a LICENSE file to this repository

What license should the model be licensed to?

The code can probably be licensed under the MIT license without any issue.

We should check this with the Oversight Team before we make a final decision.

Finalize short name for the CRDC-H model

It would be great to refer to it as the "CRDC-H" model, but unfortunately this is not a valid name in Python. We could use CRDC_H or CRDCH as the Python name, but it might be easiest to refer to it as the "CCDH" model everywhere. Would that be okay?

Misc documentation site questions and requests

Creating a list of minor documentation UI issues that are not critical for the June 2 release - @gaurav / others can partition into separate tickets as needed at a later time.

  1. Why is 'range' not capitalized, but other field names describing attributes on Entity pages are capitalized ('Description', 'Examples')?
  2. I noted that in the Referenced by Class section of entity pages, for 0..* cardinalities, the asterisk appears after the target class name. e.g. "Diagnosis ➞metastatic_site 0.. BodySite*" on the page here.

Python version

There are inconsistent minimum Python version in various files. This is kinda in flux as I prepare the PRs for branches issue_84_reintegrate and issue-69-pypi. As part of that or immediately after it... change everything to 3.x? 3.7? 3.8?

reintegrate regenerate-from-googlesheets with template-based PyPI publishing

rentegrate #75 and the regenerate-from-googlesheets functionality. Probably take the regenerate-from-googlesheets functionality from main.

Right now, #75 is broken because I just moved generators/ in from rename-to-crdch

Two of the issues from rename-to-crdch will have to be reintegrated some other way

  1. when the model needs to be included in a class or slot name, it should be crdch (or something like that, not ccdh
  2. simpler PyYAML methods my be preferable over LinkML methods

Also, this could be an opportunity to just consolidate all Pipfiles.

Figure out the best way to display "Referenced by class" listings

Every class has a number of "Referenced by class" section in the Markdown documentation that indicates other classes that that refer to this class. For example, Specimen indicates that Diagnosis and Specimen has properties with a range of Specimen.

The display of this class doesn't make me happy. Currently, we display it as e.g.:

  • Diagnosis ➞related_specimen 0..* Specimen

While LinkML displays it as (see e.g. LinkML.ClassDefinition):

  • ClassDefinition class_definition➞apply_to 0..* ClassDefinition
  • SchemaDefinition classes 0..* ClassDefinition

I think a better display would be:

  • Diagnosis.related_specimen → Specimen 0..*

What do you all think?

Add content to 'home' page at /ccdhmodel/v0.2/home/

The page https://cancerdhc.github.io/ccdhmodel/v0.2/home/ is empty. A request has been made to populate with introductory information about CRDC-H. I would like to add the information shown below so that it is displayed in the 'home' page.

Is the information in the file https://github.com/cancerDHC/ccdhmodel/blob/main/src/docs/home.md automatically generated? Or will manually editing and creating a PR be sufficient? And if so, will it also be persistent to other versions?


The Harmonized CRDC Data Model (CRDC-H)

The goal of the Center for Cancer Data Harmonization (CCDH) is to support the harmonization of equivalent data elements in disparate models across NCI’s Cancer Research Data Commons (CRDC) Repositories (nodes) to enable cross-node querying and multi-modal analytics. Individual nodes’ data models have been developed largely independently to fit specific data types and/or use cases. The CCDH is tasked with defining a shared data model for use across the CRDC, leveraging existing standards where possible to support interoperability with external data.

The CCDH Harmonized Data Model (CRDC-H) and its terminological infrastructure are being designed to meet the needs of systems like the Cancer Data Aggregator (CDA) that support integrated search and metadata-based analyses across datasets in the CRDC. We view the CRDC-H as a continuously-evolving artifact. To become and remain useful, the CRDC-H must be able to evolve and extend to meet new needs, while at the same time representing a constant semantic anchor for existing content.

The version 1.0 release of the CRDC-H is a point in time along that model evolution, covering administrative, biospecimen, and clinical data entities from multiple data commons; namely, PDC, GDC, ICDC, and HTAN. The CRDC-H is natively expressed in the LinkML modeling language, allowing us to leverage the existing LinkML tool ecosystem, which includes tools for generating a number of useful artifacts, including model documentation, representations of the model in CSV and OWL, representations used for validating data such as JSON Schema and ShEx, and artifacts for interfacing with other technologies such as GraphQL and JSON-LD. The CRDC-H model repository contains tools for converting the spreadsheets where CRDC-H content is developed into formal LinkML, and holds the resulting LinkML model and its downstream artifacts for public use. By locating the CRDC-H LinkML model here, we can also leverage GitHub tools such as issue tracking and pull requests to provide versioning and maintain a history of changes to the model over time.

Make configuration clearer in the generators/google-sheets/README.md file

Is this asking for a Google Sheet sheet identifier?

Is it looking for my google_api_credentials? Are those stored in some un-synced config file? Or an environment variable?

% make regen-google-sheets
cd generators/google-sheets && pipenv install --dev && pipenv run python sheet2linkml.py && cp output/CDM_Dictionary_v1_Active.yaml ../../src/schema/ccdhmodel.yaml && cd -
Installing dependencies from Pipfile.lock (739554)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1/1 — 00:00:06
Traceback (most recent call last):
  File "/Users/MAM/Documents/gitrepos/ccdhmodel/generators/google-sheets/sheet2linkml.py", line 8, in <module>
    cli.main()
  File "/Users/MAM/.local/share/virtualenvs/google-sheets-KnQ9vl1c/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/MAM/.local/share/virtualenvs/google-sheets-KnQ9vl1c/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/MAM/.local/share/virtualenvs/google-sheets-KnQ9vl1c/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/MAM/.local/share/virtualenvs/google-sheets-KnQ9vl1c/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/MAM/Documents/gitrepos/ccdhmodel/generators/google-sheets/sheet2linkml/cli.py", line 47, in main
    google_api_credentials = config.Settings().google_api_credentials
  File "pydantic/env_settings.py", line 36, in pydantic.env_settings.BaseSettings.__init__
  File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Settings
cdm_google_sheet_id
  field required (type=value_error.missing)
make: *** [regen-google-sheets] Error 1

Fix links to LinkML types

We currently have CCDH types (such as ccdh_Decimal) in the model, which are child types of LinkML types (in this case, linkml:Decimal). Unfortunately, the links in the Markdown documentation from the CCDH type to the LinkML type is broken: instead of linking to https://cancerdhc.github.io/ccdhmodel/dev/types/Decimal/, it links to https://cancerdhc.github.io/ccdhmodel/dev/types/CcdhDecimal/types/Decimal.md instead, which does not work.

(The .md looks incorrect there, but if it wasn't for the duplicated /types/.../types in the URL, it would be resolved correctly and turned into a link to an index.html file.)

This is probably a bug in LinkML's Markdown generation, but it needs to be isolated and reported upstream.

Add example data to the CCDH Model repository

It would be nice to have some example data in this repository, both as part of the documentation (we could actually convert it to Markdown and display it with the docs!) and as a continuous integration test. As per cancerDHC/tools#28, we should have example data in several formats and ensure that they can all be validated.

There is code for doing this included as part of the new LinkML Template Repository, so we should get the code behind this for free once we've implemented #7.

make regen-google-sheets in rename-to-crdch: dct[prop] "'function' object is not subscriptable"

(ccdhmodel) MAM@MAM-M74 ccdhmodel % git status 
On branch rename-to-crdch
Your branch is up to date with 'origin/rename-to-crdch'.

nothing to commit, working tree clean

(ccdhmodel) MAM@MAM-M74 ccdhmodel % make regen-google-sheets
...
✔ Successfully created virtual environment! 
Virtualenv location: /Users/MAM/.local/share/virtualenvs/ccdhmodel-tQ2ssRf-
Installing dependencies from Pipfile.lock (179083)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 92/92 — 00:00:48
cd generators/google-sheets && pipenv run python sheet2linkml.py && cp output/CDM_Dictionary_v1_Active.yaml ../../src/schema/crdch.yaml && cd -
Fri Aug 13 08:40:13 2021 [cli.py] INFO: Google Sheet loaded: GSheetModel with an underlying Google Sheet titled "CDM_Dictionary_v1 (Active)" containing 76 worksheets
...
Fri Aug 13 08:42:45 2021 [enum.py] INFO: Generating LinkML for Enum named TimePoint.eventType from worksheet "O_CCDH Enums" containing 3 values
Fri Aug 13 08:42:45 2021 [enum.py] INFO: Generating LinkML for Enum named TobbaccoExposureObservation.observation_type from worksheet "O_CCDH Enums" containing 10 values
fix_type_name(AlcoholExposureObservation, ClassDefinition(name='AlcoholExposureObservation', id_prefixes=[], definition_uri=None, aliases=[], local_names={}, mappings=[], exact_mappings=[], close_mappings=[], rel...
...
fix_type_name(AlcoholExposureObservation._if_missing, <function JsonObj._if_missing at 0x106034d30>, range)
Traceback (most recent call last):
  File "/Users/MAM/Documents/gitrepos/cleanroom/ccdhmodel/generators/google-sheets/sheet2linkml.py", line 8, in <module>
    cli.main()
  File "/Users/MAM/.local/share/virtualenvs/ccdhmodel-tQ2ssRf-/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/MAM/.local/share/virtualenvs/ccdhmodel-tQ2ssRf-/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/MAM/.local/share/virtualenvs/ccdhmodel-tQ2ssRf-/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/MAM/.local/share/virtualenvs/ccdhmodel-tQ2ssRf-/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/MAM/Documents/gitrepos/cleanroom/ccdhmodel/generators/google-sheets/sheet2linkml/cli.py", line 105, in main
    yaml_dumper.dump(model.as_linkml(crdch_root), f)
  File "/Users/MAM/Documents/gitrepos/cleanroom/ccdhmodel/generators/google-sheets/sheet2linkml/source/gsheetmodel/gsheetmodel.py", line 323, in as_linkml
    fix_type_name(f"{entity.name}.{attrName}", attr, "range")
  File "/Users/MAM/Documents/gitrepos/cleanroom/ccdhmodel/generators/google-sheets/sheet2linkml/source/gsheetmodel/gsheetmodel.py", line 312, in fix_type_name
    value = dct[prop]
TypeError: 'function' object is not subscriptable
make: *** [regen-google-sheets] Error 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.