phenopackets / phenopacket-schema Goto Github PK

View Code? Open in Web Editor NEW

73.0 12.0 28.0 1.48 MB

Repository for the GA4GH phenopacket schema

Home Page: https://phenopacket-schema.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Java 96.84% Python 1.78% Shell 1.38%

phenopackets schema protobuf monarchinitiative phenotypes ehr-phenotyping ga4gh

phenopacket-schema's People

Contributors

Stargazers

Watchers

phenopacket-schema's Issues

Variant.position element comment is probably obsolete?

// Sequence position at which the change is located, in the specified coordinate system.
    int32 position = 4; //start, end...

Why start and end? this is not a repeated element?

evidence elelement -- can we discuss & document the intended usage?

For this element,

message Evidence {
    OntologyClass evidence_code = 1;
    ExternalReference reference = 2;
  }

I have written this

For example, in order to describe the evidence for a phenotypic observation that is derived from a publication, one might use
the ECO term `author statement from published clinical study used in manual assertion` <https://www.ebi.ac.uk/ols/ontologies/eco/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FECO_0006017>`_ and record a PubMed id in the reference field
(See :ref:`External Reference element`).

Is this correct? Are there any other examples we can provide here?

diseases

There is a typo: suspect conditions should be suspected conditions.

Add surgical procedure type for biosample

For instance, prostatocystectomy is used to remove bladder cancer, and a biopsy is taken from the resulting specimen. On the other hand, a smaller biopsy could be taken by cystsocopy.
There are NCI thesaurus terms to represent these procedures.

@cmungall @mellybelly @julesjacobsen @mbaudis @drseb

Decide whether generated python code should be distributed on pypi

The readthedocs is an excellent intro to how to use the java code, what is the status of generated python? Is the java canonical? Sounds like some groups (cc @drseb ) are using generated python. Should we generate python and release on pypi with each release, or is that taking on too much?

Remove Analysis

I propose we remove the Analysis message. This is only used in AttributeValue and I think is out of scope.

// An analysis contains an interpretation of one or several experiments. (e.g.
// SNVs, copy number variations, methylation status) together with information
// about the methodology used.
message Analysis {
    // Formats of id | name | description | accessions are described in the
    // documentation on general attributes and formats.
    string id = 1;

    string name = 2;

    string description = 3;

    // The time at which this record was created, in ISO 8601 format.
    string created = 4;

    // The time at which this record was last updated, in ISO 8601 format.
    string updated = 5;

    // The type of analysis.
    string type = 6;

    // The software run to generate this analysis.
    repeated string software = 7;

    // A map of additional analysis information.
    Attributes attributes = 9;
}

Date of birth as Timestamp is too precise

message Individual {
    string id = 1;
    
    // is this not too precise a field?
    google.protobuf.Timestamp date_of_birth = 2;
    Sex sex = 3;
    repeated Phenotype phenotypes = 4;
    repeated Disease diagnoses = 5;
}

Yes! However it has temporality baked-in, which is good. If we were to be less precise I guess year_of_birth and month_of_birth fields could be used, but this becomes a pain to model.

bug in documentation

This is not the way to include the package in a different project

<dependency>
    <groupId>org.phenopackets</groupId>
    <artifactId>phenopacket-schema</artifactId>
    <version>${project.version}</version>
</dependency>

This is because ${project.version} is defined for the project and not for the phenopackets schema.

Suggest revising the documentation to say

add the following line to the properties section of the pom file.

<phenopackets.version>0.1.0</phenopackets.version>

and then add the following to the dependencies section.

<dependency>
   <groupId>org.phenopackets</groupId>
   <artifactId>phenopacket-schema</artifactId>
   <version>${phenopackets.version}</version>
</dependency>

Python

Add documentation about how to use a Python library to read a phenopacket.

abstracting external references (from `Publication`)

phenopacket-schema/src/main/proto/org/phenopackets/schema/v1/core/base.proto

Line 91 in 64857fc

message Publication {

I would recommend abstracting this into a general object for external references. For ga4gh-metadata, we prototyped ExternalIdentifier:

// Identifier from a public database, describing a representation of
// the record, and the relation between record and external reference
// TODO: relation as OntologyTerm object (e.g. as IAO property).
// Example:
//  {
//    "relation" : "reported_in",
//    "identifier" : "pubmed:25752754"
//  }
//
message ExternalIdentifier {
  // The CURIE of the identifier.
  string identifier = 1;

  // The relation between identifier and the object ist was reported from.
  string relation = 2;
}

This could serve as a blueprint here, instead of the specific "Publication".

Add Treatment History

Could be important for future computational cancer cases

Cohort?

We had one before, we cold have one again.

Could be defined as an aggregated individual:

message Cohort {
  string id = 1;
  string description = 2;
  int size = 3;
  repeated Phenotype phenotypes = 4;
  repeated Cohort sub_groups = 5;
}

or maybe an actual set of individuals:

message Cohort {
  string id = 1;
  string description = 2;
  int size = 3;
  repeated Individual members = 4
  repeated Phenotype phenotypes = 5;
  repeated Cohort sub_groups = 6;
}

or something else.

Upgrade to protobuf 3.7.0

Consider update from version 3.5.1 to 3.6.1 of protobuf.

I believe this might fix the following error warning (see the change log https://github.com/protocolbuffers/protobuf/blob/master/CHANGES.txt).

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/home/robinp/.m2/repository/com/google/protobuf/protobuf-java/3.6.1/protobuf-java-3.6.1.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

ExternalReference unclear?

The External reference is not 100% clear. Here is the current definition

 message ExternalReference {
    // e.g. ISBN, PMID:123456, DOI:...
    string id = 1;
    string description = 2;
    }

Do we intend the id to be something like "PMID:12345" and the description to be something like "Patient B" from the same article? Can we provide a few examples of the intended use cases?

Remove Experiment?

I propose we remove the Experiment message. It is only used as part of AttributeValue. It seems a bit out of scope, however it might also have some utility for the Biosample message. In the latter case it might be better to re-model some of the fields as OntologyClass instead of string.

Thoughts @mcourtot @pnrobinson?

// An experimental preparation of a sample.
message Experiment {
    // The experiment ID
    string id = 1;

    // The name of the experiment.
    string name = 2;

    // A description of the experiment.
    string description = 3;

    // The time at which the record of this experiment was last updated.
    // Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
    string updated = 5;

    // The time at which this experiment was performed.
    // Granularity here is variable (e.g. date only).
    // Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42Z)
    string run_time = 6;

    // The molecule examined in this experiment. (e.g. genomics DNA, total RNA)
    string molecule = 7;

    // The experiment technique or strategy applied to the sample.
    // (e.g. whole genome sequencing, RNA-seq, RIP-seq)
    string strategy = 8;

    // The method used to enrich the target. (e.g. immunoprecipitation, size
    // fractionation, MNase digestion)
    string selection = 9;

    // The name of the library used as part of this experiment.
    string library = 10;

    // The configuration of sequenced reads. (e.g. Single or Paired).
    string library_layout = 11;

    // The instrument model used as part of this experiment.
    // This maps to sequencing technology in BAM.
    string instrument_model = 12;

    // The data file generated by the instrument.
    // Should this be `instrumentData` instead?
    File instrument_data_file = 13;

    // Name of the laboratory where this experiment was performed.
    string sequencing_center = 14;

    // The address coded as geolocation object of the laboratory.
    GeoLocation location = 18;

    // The platform unit used as part of this experiment. This is a
    // flowcell-barcode
    // or slide unique identifier.
    string platform_unit = 15;

    // A map of additional experiment information.
    Attributes attributes = 17;
}

biosamples

The current comment needs some attention, methinks.

// Biosample(s) derived from the patient or a collection of biosamples in isolation

With the recent revisions, we have decided that a PhenoPacket describes one individual. It seems to me that if we allow a Phenopacket to describe a collection of isolated biosamples, that it is scope creep, and many of the other fields do not make sense (e.g., patient describes one individual and not a cohort).

We would simplify things enormously if we make another element called biosample-collection that has the required metadata. Both the phenopackets and this object would then be interoperable, and we would greatly simplify the use of the PhenoPacket.

Suggested revision of the comment

// Biosample(s) derived from the patient

Medium-term: also consider creation of biosample-collection message.

age_at_collection

I am wondering if we can comment here or elsewhere that a PhenoPacket is meant to represent one moment in time (one particular examination, which may include past medical history), and that if we want to describe multiple visits at different time points, then different phenopackets should be used.
@julesjacobsen @mellybelly @cmungall @drseb

Consider changes to Phenopacket

Remove the individuals
put the Individual (subject) phenotypes at the phenopacket level
The entire description of families for Mendelian diagnostics case is done via the pedigree
This would go hand in hand with the proposed changes in Cohort #5 in which multiple individuals would be represented by repeated phenopacket

Age ontology class confusion

In the Age element we suggest that HPO terms can be used to represent age.

 //  age_class : { term_id : "HP:0003596", term : "Middle age onset" }
    OntologyClass age_class = 2;

But of course they do not represent age, they represent age of onset. Is there some other ontology we can recommend here? If not, we should remove this option for now because it will lead to errors in data entry.

Environment and EnvironmentalExposure

This was part of the original phenopacket idea, but has not made it into the new schema yet.

Difference between disease and phenotype messages?

https://github.com/phenopackets/phenopacket-schema/blob/master/src/main/proto/org/phenopackets/schema/v1/core/base.proto#L201

seem to indicate they are similar.

Disease requires onset?

Disease onset is used a lot by GEL clinicians as a means of filtering out potential diagnoses based on the proband's age.

message Disease {
    // The identifier of this disease e.g. MONDO:0007043, OMIM:101600, Orphanet:710, DOID:14705 (note these are all equivalent)
    string id = 1;
    // Textual name of the disease e.g. Pfeiffer syndrome
    string label = 2;
    // Free-text description
    string description = 3;
    // The mode of inheritance for this disease, in preference a sub-class of http://purl.obolibrary.org/obo/HP_0000005
    OntologyClass modeOfInheritance = 4;
    // The disease phenotype
    repeated Phenotype phenotypes = 5;
    //TODO: onset here?
    OntologyClass onset = 6;
}

created / updated

Shouldn't it be possible to store multiple updates? If I look here https://phenopackets-schema.readthedocs.io/en/latest/individual.html#updated it seem that only one update is possible.

External references vs external identifiers

Should the naming be consistent between individual and biosample?

Allow SPDI, HGVS and VCF for encoding of Variant

// Deleted bases on the forward strand - equivalent to the VCF REF
    string deletion = 5;
    // Inserted bases on the forward strand - equivalent to the VCF ALT
    string insertion = 6;

Can we go over exactly what we want people to do with these fields and document them?
I am not a SPDI expert, but I do not think that deletions require a string. According to the documentation, When a nucleotide deletion occurs, nothing is inserted. In SPDI syntax, InsertedSequence is simply an empty string. Thus, the deletion of the same A nucleotide as above is represented as:

Seq1:4:1:

Thus, is this a mistake, or do we want people to add EITHER SPDI OR VCF (which would be a recipe for disaster!!!)

Age

Can we document the format conventions for Age, as in:

Age.newBuilder().setAge("P48Y3M").build()

Does this just mean the proband (P) is 48 years and 3 months old? Where does the P come from? It seems the code is treating it as a string

Geolocation questions

Can people use curies from GeoNames or ENVO?

typo in taxonomic identifier comment

Currently

 // NCBI taxonomic identifier NCBITaxon e.g. NCBITaxon:9606 or NCBITaxon:1337
    // For resources where there may be more than one organism being studied it is advisable to indicate the taxonomic
    // identifier of that organism, to it's most specific level
    OntologyClass taxonomy = 7;

it's is incorrect, please remove the apostrophe.

Variant.sequence

 Variant {
    // genome assembly isn't strictly necessary if using the sequence accession as that will be unique for the assembly/sequence
    GenomeAssembly genome_assembly = 1;
    // Accession of the sequence e.g. NC_000010.10 (GRCh37 chromosome 10)
    string sequence =

I think it would be better to rename "sequence" to sequence_accession or accession_number, because I thought sequence would be a sequence of nucleotides -- the label is confusing.

HtsFile not showing genome assembly

If I create a phenopacket that includes this,

 HtsFile htsFile = HtsFile.newBuilder()
                .setHtsFormat(HtsFile.HtsFormat.VCF)
                .setGenomeAssembly(org.phenopackets.schema.v1.core.GenomeAssembly.GRCH_37)
                // in this case the proband identifier is different to the sample
                // identifier used in the VCF file
                .putIndividualToSampleIdentifiers(proband.getId(), "manuel")
                .setFile(File.newBuilder().setPath("/home/robinp/data/exomiser-cli-9.0.1/examples/Pfeiffer.vcf").build())
                .build();

then the output is missing the part about "GRCH_37"

    "htsFormat": "VCF",
    "individualToSampleIdentifiers": {
      "manuel": "manuel"
    },
    "file": {
      "path": "/home/robinp/data/exomiser-cli-9.0.1/examples/Pfeiffer.vcf"
    }
  }],

Is my code wrong or what is the way of indicating the genome build?

Use open location code instead of current location

Open Location Code aka PlusCodes if you want the glossy overview, looks to be a nicely engineered solution to the issue of location.

From their README:

Open Location Code is a technology that gives a way of encoding location into a form that is easier to use than latitude and longitude. The codes generated are called plus codes, as their distinguishing attribute is that they include a "+" character.

The technology is designed to produce codes that can be used as a replacement for street addresses, especially in places where buildings aren't numbered or streets aren't named.

Plus codes represent an area, not a point. As digits are added to a code, the area shrinks, so a long code is more precise than a short code.

Codes that are similar are located closer together than codes that are different.

A location can be converted into a code, and a code can be converted back to a location completely offline.

There are no data tables to lookup or online services required. The algorithm is publicly available and can be used without restriction.

Currently location is defined like this:

// A GeoLocation object provides information about a geographic position
// related to a record. Examples could be:
//  - an address, e.g. of a lab performing an analysis
//  - provenance of an individual, obfuscated to a larger order administrative
//    entity (Suffolk, U.K.)
//  - a lat/long/alt position where an environmental sample was collected
//   The geographic point object uses the default units from the DCMI point scheme
//  http://dublincore.org/documents/dcmi-point/
//  and avoids optional representation in non-standard units.
message GeoLocation {
    // a text representation, preferably using standard geographic identification
    // elements, of the corresponding latitude,longitude(,altitude)
    // This representation serves the purposes to
    //  - capture standard data entry parameters
    //  - provide a sanity check for latitude,longitude values
    // Example:
    //  - 34 Washington Blvd, Marina del Rey, CA  90292, United States
    //  - Str Marasesti 5, 300077 Timisoara, Romania
    //  - Heidelberg, Deutschland
    string label = 1;

    // an optional indication of the maximum precision to be derived from the
    // latitude,longitude values
    // Example:
    // Given a street address "Winterthurerstrasse 190, 8057 Zürich, Switzerland",
    // a privacy driven (destructive) obfuscation approach could recode this
    // to
    //  "latitude": 47.37, "longitude": 8.54
    //  while providing
    //  "precision":"city", "label": "Zürich, Switzerland"
    // ... indicating that the original location could correspond to any
    // latitude,longitude point value inside the administrative boundaries of
    // the city of Zürich, Switzerland
    string precision = 2;

    // signed decimal degrees (North, relative to Equator)
    double latitude = 3;

    // signed decimal degrees (East, relative to IERS Reference Meridian)
    double longitude = 4;

    // optional, e.g. for environmental samples
    double altitude = 5;
}

It could, using the plus code, be represented like this for the Zurich example:

{
  "geoLocation": {
    "plusCode": "8FVC9G00+"
  }
}

Which resolves to this area: https://plus.codes/8FVC9G00+

The only snag with this is that if the thing you're trying to fit into a box doesn't fully fit, then what? In this example Zurick straddles approximately four boxes at this scale:

https://plus.codes/8FVCCG00+
https://plus.codes/8FVCCH00+
https://plus.codes/8FVC9G00+
https://plus.codes/8FVC9H00+

but the next box up is too big. Even so, these are computable compared to the existing 'precision' string, which is completely undefined. If the goal is purely approximation and obfuscation, then the pluscode will work better then the existing setup.

<dependency>
  <groupId>com.google.openlocationcode</groupId>
  <artifactId>openlocationcode</artifactId>
  <version>1.0</version>
  <type>pom</type>
</dependency>

Add variant list to biosample

These would represent somatic variants in a cancer sample, for instance @cmungall @mellybelly @julesjacobsen @mbaudis

Examples

Time definitions

IMO for some attributes it is dangerous to have them defined with strong dependency on the specific serialisation.

The typical case here are the time definitions, which refer to the internally handled timestamp:

phenopacket-schema/src/main/proto/org/phenopackets/schema/v1/core/base.proto

Line 12 in 2626158

google.protobuf.Timestamp created = 1;

This does not give any indication to the user of how to I/O time.

Time formats are something where a single, clean definition exists, and which has been well documented in the metadata schema. I am strongly for defining time (poits, spans e.g. ages ...) as string/ISO8601, and leave the deparsing to the I/O handlers.

This is just one example where we easy implementation NE clear definition.

update libraries in POM file

For instance, com.google.guava:guava:19.0 could be updated to 27.1-jre.
We should use the maven versions plugin to bring all of the dependencies up to date.

Remove AttributeValue

This element is complicated and we should provide a number of examples of the intended use cases (currently unclear). Also, item 9 is commented out and should either be removed or repaired prior to the official release.

sex of individual -- comment is perhaps too narrow

Currently the comment for the sex element in individual states the following

A PATO term for the biological sex of the individual
    // http://purl.obolibrary.org/obo/PATO_0000047
OntologyClass sex = 5;

I think that biological sex and phenotypic sex are nearly synonymous, but it is confusing that we state biological sex in the comment and the PATO term is phenotypic sex.

Also, there are situations where one might want to indicate the genetic sex of an individual and this would be possible by using a different ontology term. I have written the following description in the documentation, do we agree on this?

Phenopackets make use of ontology terms to denote the sex of an individual. We recommend using
`PATO terms <https://www.ebi.ac.uk/ols/ontologies/pato/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000047>`_ for this purpose. There are various ways to define sex including phenotypically and genetically, and Phenopackets allows
any ontology term that represents sex in any way to be used. The PATO terms `female <https://www.ebi.ac.uk/ols/ontologies/pato/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000383>`_ and `male <https://www.ebi.ac.uk/ols/ontologies/pato/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000384>`_ represent phenotypic sex. There may be situations when it is more appropriate to use a term such as `male genotypic sex <https://www.ebi.ac.uk/ols/ontologies/pato/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0020001>`_.

FHIR

The rare disease FHIR example seems to use a lot of FHIR code to construct the FHIR message. Can we work on creating the FHIR export function directly from the Phenopacket of the main rare diseas example?

Build currently fails

I just cloned the repo and started a ./mvnw clean install, which leads to two errors:

[INFO] Changes detected - recompiling the module!
[INFO] Compiling 24 source files to /Users/koehlers/git/phenopacket-schema/target/test-classes
[ERROR] error reading /Users/koehlers/.m2/repository/org/apache/commons/commons-lang3/3.6/commons-lang3-3.6.jar; invalid LOC header (bad signature)

The other one is in Tests:

[INFO] Running org.phenopackets.schema.v1.FhirTest
[ERROR] Tests run: 2, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.055 s <<< FAILURE! - in org.phenopackets.schema.v1.FhirTest
[ERROR] org.phenopackets.schema.v1.FhirTest.testCancerExample()  Time elapsed: 0.051 s  <<< FAILURE!
java.lang.NoClassDefFoundError: org/apache/commons/lang3/StringUtils
	at org.hl7.fhir.r4.model.IdType.setValue(IdType.java:420)
	at org.hl7.fhir.r4.model.Resource.setId(Resource.java:133)
	at org.phenopackets.schema.v1.examples.CancerFhirExample.cancerBundle(CancerFhirExample.java:25)
	at org.phenopackets.schema.v1.examples.TestExamples.cancerBundle(TestExamples.java:42)
	at org.phenopackets.schema.v1.FhirTest.testCancerExample(FhirTest.java:19)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang3.StringUtils
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 18 more

[ERROR] org.phenopackets.schema.v1.FhirTest.testRareDiseaseExample()  Time elapsed: 0.001 s  <<< FAILURE!
java.lang.NoClassDefFoundError: org/apache/commons/lang3/StringUtils
	at org.hl7.fhir.r4.model.IdType.setValue(IdType.java:420)
	at org.hl7.fhir.r4.model.Resource.setId(Resource.java:133)
	at org.phenopackets.schema.v1.examples.RareDiseaseFhirExample.rareDiseaseBundle(RareDiseaseFhirExample.java:24)
	at org.phenopackets.schema.v1.examples.TestExamples.rareDiseaseBundle(TestExamples.java:28)
	at org.phenopackets.schema.v1.FhirTest.testRareDiseaseExample(FhirTest.java:25)

MmeNodeTest broken

I had to comment out the test case to get it to work

Ask PDX group about most appropriate terms for tumor progression

Currently we have

   // Is the specimen tissue from the primary tumor, a metastasis or a recurrence?
    // Most likely a child term of NCIT:C7062 (Neoplasm by Special Category)
    // NCIT:C3677 (Benign Neoplasm)
    // NCIT:C84509 (Primary Malignant Neoplasm)
    // NCIT:C95606 (Second Primary Malignant Neoplasm)
    // NCIT:C3261 (Metastatic Neoplasm)
    // NCIT:C4813 (Recurrent Malignant Neoplasm)
    OntologyClass tumor_progression = 18;

'Placeholder' doc for genes is confusing

I think the word Placeholder is confusing in the comment for the genes element:

// Placeholder for gene concepts - could be used for either describing candidate genes or causative genes. The
// resources using these fields should define what this represents in their context.

suggested revision

This field defines a list of one or more genes that can be used to denote a collection of 
candidate genes or causative genes. The resources using these fields should define what this
represents in their context.

Discuss separation of positive from negative phentoypes

@cmungall we had discussions on that in the context of GO and HPO. We decided that it makes sense to clearly separate positive from negative annotations, because it is otherwise likely that users will just iterate the list of phenotypes and forget to check if the phenotype is actually negated. This is why we have separate annotation files.
It seems to me that this danger exists in the current phenopacket schema as well. https://phenopackets-schema.readthedocs.io/en/latest/phenotype.html#phenotype-element
I.e. users will iterate through the list of phenotypes and would have to check for each phenotype if negated is true. I would favour a solution where users have to iterate the positive and the negative phenotypes separately. Opinions? Is this doable?

Rename Phenotype to PhenotypicFeature

What are we trying to convey? Originally Phenotype was defined as a collection of positive and negative OntologyTerms.

// Rename to PhenotypicFeature?
message Phenotype {
    string description = 1;
    
    // perhaps we can simplify this and have a single class and a single boolean flag?
    repeated OntologyClass types = 2;
    repeated OntologyClass negated_types = 3;
    
    // the values of this will come from the HPO onset hierarchy
    // i.e. subclasses of HP:0003674
    OntologyClass onset = 4;
    
    // subclasses of HP:0012823 ! Clinical modifier 
    repeated OntologyClass modifiers = 5;
    OntologyClass assay = 6;
    Evidence evidence = 7;
    //Measurement
    //Environment
    //Frequency? Do we want to be able to encode and HPO annotation?
}

This was used like this in an Individual

message Individual {
    string id = 1;
    google.protobuf.Timestamp date_of_birth = 2;
    Sex sex = 3;
    repeated Phenotype phenotypes = 4;
    repeated Disease diagnoses = 5;
}

@cmungall what alternative are you proposing?

Should Gene become GeneticFeature?

//perhaps this would be better as a GeneticFeature which would wrap 1-N variants. This would make it easier to group comp-het alleles.
message Gene {
    // a CURIE with a prefix that is NCBIGene, HGNC, ENSEMBL, UCSC e.g. HGNC:347, ENSG00000120705, uc003ldc.6
    // https://www.genenames.org/cgi-bin/gene_symbol_report?q=data/hgnc_data.php&hgnc_id=HGNC:3477
    string id = 1;
    // The official gene symbol as designated by the organism gene nomenclature committee e.g. ETF1 or Etf1
    string symbol = 2;
    // Taxonomy id from the NCBI e.g. NCBI:txid9606
    // https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606
    int32 ncbi_taxon_id = 3;
}

//SPDI format https://www.ncbi.nlm.nih.gov/variation/notation/
message Variant {
    //do we want to consider explicitly defining strand and zero/one-based coordinates?
    string sequence = 1;
    int32 position = 2;
    string deletion = 3;
    string insertion = 4;
    //genotype of this variant using the GENO ontology
    OntologyClass genotype = 5;
    //do we want to capture variant consequence with SO terms? This would also require the transcript.
}

Becomes...

//perhaps this would be better as a GeneticFeature which would wrap 1-N variants. This would make it easier to group comp-het alleles.
message GeneticFeature {
    // a CURIE with a prefix that is NCBIGene, HGNC, ENSEMBL, UCSC e.g. HGNC:347, ENSG00000120705, uc003ldc.6
    // https://www.genenames.org/cgi-bin/gene_symbol_report?q=data/hgnc_data.php&hgnc_id=HGNC:3477
    string gene_id = 1;
    // The official gene symbol as designated by the organism gene nomenclature committee e.g. ETF1 or Etf1
    string gene_symbol = 2;
    // Taxonomy id from the NCBI e.g. NCBI:txid9606
    // https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606
    int32 ncbi_taxon_id = 3;
    repeated Variant variants = 4;
    //genotype of this variant using the GENO ontology
    OntologyClass genotype = 5;
}

in this case I've moved the genotype into the GeneticFeature. Perhaps better would be to use both the Gene and the Variant messages and simply create a new Genotype message which could be attached to an individual?

message Genotype {
    Gene gene = 1;
    repeated Variant alleles = 2;
    //genotype of this variant using the GENO ontology
    OntologyClass genotype = 3;
}

Add links/documentation on README for browsing schema

This is the main pinned repo for phenopackets, the README is quite developer centric. Should have links to proto schema, examples and ideally some kind of visual display

Remove AttributeValues and Attributes

We seem to be defining both a list and a map

message AttributeValues {
    repeated AttributeValue values = 1;
}

message Attributes {
map<string, AttributeValues>
}

This seems confusing and will make client code more complicated. It seems it might be better to demand a map whose key represents what the attribute is about? Otherwise can we specify when the list should be used and when the map should be used?

Individuals genetically related to the patient - e.g. mother, father, siblings or members of a cohort.
It is expected that a pedigree be included if this and the patient fields are included.

In general, members of a cohort are not genetically related to one another, and so I would propose the following revision

Individuals related in some way to the patient. For instance, the individuals may 
be genetically related or may be members of a cohort. If this field is used, then 
it is expected that a pedigree will be included for genetically related individuals 
for use cases such as genomic diagnostics. If a phenopacket is being used to 
describe one member of a cohort, then in general one phenopacket will be 
created for each of the individuals in the cohort. In general, applications that 
process such files are expected to perform Q/C to ensure data consistency, 
but this is not and cannot be enforced on the level of the protobuf message itself.

phenopackets / phenopacket-schema Goto Github PK

phenopacket-schema's People

Contributors

Stargazers

Watchers

Forkers

phenopacket-schema's Issues

Recommend Projects

Recommend Topics

Recommend Org