Code Monkey home page Code Monkey logo

tripal_eutils's Introduction

Build Status

Documentation Status

DOI

Tripal Eutils

Card

This module connects to the NCBI EUtils API to load in accessions for the Assembly, BioProject, and BioSample databases. Primary, as well as linked, records are loaded into Chado.

Please see the Documentation website for more information, as well as installation and usage instructions.

Requirements

This module requires Chado version 1.3 or greater.

Copyright notice

The Tripal EUtilities "logo" is derived from the collectible card game Hearthstone, copyright © Blizzard Entertainment, Inc. Hearthstone® is a registered trademark of Blizzard Entertainment, Inc. Tripal is not affiliated or associated with or endorsed by Hearthstone® or Blizzard Entertainment, Inc. Art is a public domain image courtesy of Karen Hatzigeorgiou .

tripal_eutils's People

Contributors

almasaeed2010 avatar bradfordcondon avatar dependabot[bot] avatar dsenalik avatar ferrisx4 avatar mpoelchau avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tripal_eutils's Issues

tagmapper - cache cvterm lookups

#27 (review)

@almasaeed2010 noted that we lookup the terms for each and proposed cacheing the lookups which sounds like a great idea to me.

As an alternative, the mapper could instead just send the accession (local:sex for example) so the lookup doesnt happen unless we encounter that term... but if we encounter most terms in most lookup (which we do) it would be smarter to pre-cache like you suggest.

need formatters and repositories to inherit some of the same stuff

i just created a PR that adds teh below function to formatters

public function getNCBIDB(string $db_name) {
$name = "NCBI {$db_name}";
$db = db_query(
'SELECT * FROM chado.db WHERE UPPER(name) = :name',
[':name' => strtoupper($name)]
)->fetchObject();
if ($db) {
return $db;
}
$db = db_query(
'SELECT * FROM chado.db WHERE UPPER(name) = :name',
[':name' => strtoupper($db_name)]
)->fetchObject();
if ($db) {
return $db;
}
return FALSE;
}

its redundant with a function for repositories. we need to make a new class for shared stuff.

idea, should they both actually extend the same class?
Or perhaps each parent class extends a base class with the sahred stuff?

linking organism to analysis

my analysis repo is going to link organism to analysis, BUT, what happens if they dont have the organism_analysis table!?

should we require manage_analyses? Or install the table ourselves---> then we won't have the field for displaying the link...

  /**
   * Insert into organism_analysis, or return existing link.
   *
   * @param $organism
   * Full chado.organism record.
   *
   * @return mixed
   * @throws \Exception
   */
  public function linkOrganism($organism) {

    if (!chado_table_exists('organism_analysis')) {
      throw new Exception('The organism_analysis linker table doesnt exist.  No way to link this organism.');
    }

    $result = db_select('chado.organism_analysis', 't')
      ->addField('t.organism_analysis_id')
      ->condition('t.organism_id', $organism->organism_id)
      ->condition('t.analysis_id', $this->base_record_id)
      ->execute()
      ->fetchField();

    if (!$result) {

      $result = db_insert('chado.organism_analysis')
        ->fields([
          'organism_id' => $organism->organism_id,
          'analysis_id' => $this->base_record_id,
        ])
        ->execute();
    }

    if (!$result) {
      throw new Exception('Could not link organism to analysis.');
    }

    return $result;
  }

where should the reporting of what would be created happen?

screen shot 2018-12-10 at 4 33 14 pm
right now we do some basic key value parsing to build some tables. but we need to

  • ensure that the records are only created upon submission, not when previewing
  • properly handle whatever parsed XML comes out of the XML parser.

My inclination is -

add a public method to set the EUtils class to "preview" mode.

Then, instead of spawning a repository class to insert records, it instatiates a new class whose responsibility is to convert the output of the XML parser into a drupal table.

The main highlights should be

  • what will the base record look like?
  • what records will get linked? contacts, pubs, etc
  • what properties will be added?
  • what secondary records will get created?

reference XML sets

GOAL:

@bradfordcondon will supply...

provide a set of seed XML responses to base our parsing logic. We want at a minimum 2 XML files per type (biosample, assembly... bioproject (subject to change)) per use case domain (trees, crops, bacteria, bugs, animals)...

8xml/ ncbi type, as proposed this 24 xml files....

@almasaeed2010 & @bradfordcondon will both preview. Compare with the tagset listed at https://docs.google.com/spreadsheets/d/10G9IgmD8Yn5ZG0LY8AMeLMyoJAvi7ipgC8ZQHrUMi2U/edit#gid=1768962945

NOTE: ATTRIBUTES we can plan on dealing with as we did for biomaterials in the tripal analysis expression module, except we might want to further standardize CVterm usage. IE dont necessarily want an interface of term mapping for each upload..... (double check each content type is using the attributes tag for these things which are going lceanly map to chado properties).

From this, we're going to figure ...

a) what goes into chado
b) most open-ended/flexible way to parse XML to do this...

input field

Why provide a field? Because without it, we can't hook into HQ very easily. Maybe we'll think of another way and not provide a field, i dunno.

the field should -

base field

simply the value of the corresponding NCBI accession

formatter

not necessary since hte acession will show up indbxref.

widget

input text field for the accession, with a "search" button.

Pressing search should display a summary of the record that will be imported if provided.

The form should then rebuild either A) with the fields populated (problematic for linked fields such as contacts, dbxrefs, properties that dont exist yet, etc), or b) simply displaying the summary in the widget. On submit, we then deal with the other fields, either filling them out by hand, or creating and publishing the chado record...

option b) sounds better, but really makes me wonder if we shouldnt be trying to go the field route at all.

biosample name is problematic

we didnt have this issue previously because name was provided by the user.

look at this created biomaterial:

  ["biomaterial_id"]=>
  string(3) "657"
  ["taxon_id"]=>
  NULL
  ["biosourceprovider_id"]=>
  NULL
  ["dbxref_id"]=>
  NULL
  ["name"]=>
  string(7) " Leaves"
  ["description"]=>
  string(0) ""
}

the problem is, it derives from this biosample:

<?xml version="1.0"?>
<BioSampleSet><BioSample access="public" publication_date="2015-05-21T17:45:06.623" last_update="2015-10-05T16:33:37.567" submission_date="2015-05-21T17:45:06.703" id="3704235" accession="SAMN03704235">   <Ids>     <Id db="BioSample" is_primary="1">SAMN03704235</Id>     <Id db_label="Sample name"> Leaves</Id>     <Id db="SRA">SRS957431</Id>   </Ids>   <Description>     <Title>Plant sample from Juglans regia</Title>     <Organism taxonomy_id="51240" taxonomy_name="Juglans regia">       <OrganismName>Juglans regia</OrganismName>     </Organism>   </Description>   <Owner>     <Name>UCDavis</Name>     <Contacts>       <Contact email="[email protected]">         <Name>           <First>Pedro Jose</First>           <Last>Martinez-Garcia</Last>         </Name>       </Contact>     </Contacts>   </Owner>   <Models>     <Model>Plant</Model>   </Models>   <Package display_name="Plant; version 1.0">Plant.1.0</Package>   <Attributes>     <Attribute attribute_name="cultivar" harmonized_name="cultivar" display_name="cultivar">Chandler</Attribute>     <Attribute attribute_name="age" harmonized_name="age" display_name="age">18</Attribute>     <Attribute attribute_name="isolation source" harmonized_name="isolation_source" display_name="isolation source">plant at flowering time</Attribute>     <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">USA: California, Davis</Attribute>     <Attribute attribute_name="tissue" harmonized_name="tissue" display_name="tissue">leaves</Attribute>   </Attributes>   <Links/>   <Status status="live" when="2015-05-21T17:45:06.702"/> </BioSample> </BioSampleSet>

improve testing on XML parsers

a) all xml parsers are tested in one file. let's split the specific testing off into class-specific tests.

b) the biomaterial test only checks that keys exist. the keys can easily exist and be null. however, not all keys will be set for all files, so need to take an approach more like the assembly tests and in the provider include what the output of that file should look like.

proposal for handling xml -> chado mappings

child of #8

heres the idea.

The XML parser is split into creating the base record, dbxrefs, linked records, and props. (and whatever other stuff we need).

The base record stuff is hard coded. We look for a hardcoded attribute for each column of the base record, with advanced logic to check for all possible attributes and use the "best" one.

dbxrefs are also hard-coded.

linked records.... im not there yet. let's ignore for now.

for everything else: it looks up the tag in an API. the API returns if the tag should be ignored, added as a prop, or something else.

We have a schema that stores:

ALL encountered tags. It keeps the tag name, the the ncbi db type for that tag, and if the tag is assigned to a term or not. If it's assigned, it's just the cvtermid for easy lookup. We also have a list of all the matching possible cvterms that arent necessarily assigned (probably a seperate, mview type table).

how does the schema get populated? read on...

schema population

We have a job that reads an XML file and compiles all the attribute tags: each tag is stored in the schema as unassigned. It then looks each one up in your chado.cvterm. All exact and "close enough" matches go in the possible matches schema. The admin then goes to an admin area and sees a list of all XML terms with matches. From there they can "assign" the attribute, which means when the XML gets parsed for real, it will create a property. If no attribute is assigned a term, it gets ignored. If no terms match an attribute, they are instructed to find one, with a button to automatically create a local term instead.

Furthermore, on install, we can hardcode some suggest attribute -> cvterm mappings. This is tricky because everyone's site is different, but maybe there are some attributes we would expect in ALL biosamples across plants animals fungi etc.

When someone imports a new XML, it can be configured to ignore new attribute tags (but add them to its schema as an unmatched, ignored attribute) OR to abort the load -> the admin can then assign a term and re-attempt the load.

inconsistent data returned from different services

esearch, fetch, docsum.... be careful! The tags from one method wont match the tags for another, even so far as the same info might have different labels...

Do a writeup/plan so we know what to expect.

assembly: which xml tags should be added as properties/_cvterms?

looking at https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/assembly for examples.

below are example tags from 751381 that arent dealt with via dbxrefs/linked records

      <AssemblyType>haploid</AssemblyType>
      <AssemblyClass>haploid</AssemblyClass>
      <AssemblyStatus>Scaffold</AssemblyStatus>
      <WGS>LVXX01</WGS>
 <Coverage>99</Coverage>
      <PartialGenomeRepresentation>false</PartialGenomeRepresentation>
      <Primary>4681358</Primary>
      <AssemblyDescription/>
      <ReleaseLevel>Major</ReleaseLevel>
      <ReleaseType>Major</ReleaseType>
      <AsmReleaseDate_GenBank>2016/06/01 00:00</AsmReleaseDate_GenBank>
      <AsmReleaseDate_RefSeq>2017/07/14 00:00</AsmReleaseDate_RefSeq>
      <SeqReleaseDate>2016/06/01 00:00</SeqReleaseDate>
      <AsmUpdateDate>2017/07/19 00:00</AsmUpdateDate>
      <SubmissionDate>2016/06/01 00:00</SubmissionDate>
      <LastUpdateDate>2017/07/19 00:00</LastUpdateDate>
      <SubmitterOrganization>Rubber Research Institute</SubmitterOrganization>
      <RefSeq_category>representative genome</RefSeq_category>
      <AnomalousList>
      </AnomalousList>
      <ExclFromRefSeq>
      </ExclFromRefSeq>
      <PropertyList>
        <string>full-genome-representation</string>
        <string>has-chloroplast</string>
        <string>has_annotation</string>
        <string>latest</string>
        <string>latest_genbank</string>
        <string>latest_refseq</string>
        <string>refseq_has_annotation</string>
        <string>representative</string>
        <string>wgs</string>
</PropertyList>

additionally, we have all of the STATS tags.

<Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat> <Stat category="chromosome_count" sequence_tag="all">0</Stat> <Stat category="contig_count" sequence_tag="all">48315</Stat>
 <Stat category="contig_l50" sequence_tag="all">6073</Stat> <Stat category="contig_n50" sequence_tag="all">60046</Stat> 
<Stat category="non_chromosome_replicon_count" sequence_tag="all">1</Stat> <Stat category="replicon_count" sequence_tag="all">1</Stat> 
<Stat category="scaffold_count" sequence_tag="all">7453</Stat> <Stat category="scaffold_count" sequence_tag="placed">1</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">7452</Stat> <Stat category="scaffold_l50" sequence_tag="all">320</Stat>
 <Stat category="scaffold_n50" sequence_tag="all">1281786</Stat> <Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat> </Stats>

right now i collect each one combining the category and tag so for example, scaffold_count_all, scaffold_count_placed, etc. would we want ALL of these as properties?

creating linked records

screen shot 2018-11-29 at 3 07 28 pm

If a biosample is crosslinked in an assembly, we really want to create teh biosample.

@mestato thinks this pretty definite. And then the biomaterial module provides the biomaterial value we attached to the analysis....

linking logic and master linking status

goal- lay out the linking logic for all cases so its obvious what can be genercized

linking organism

biosample: base field (taxon_id)
project: same as analysis, we need a new custom linker, because project_feature exists.
assembly (analysis): custom linker table organism_analysis

project biosample

need new linker table?
one is coming for chado 1.4 . see #80

assembly biosample

assembly - we have a choice. we can link as prescribed by MAGE (biomaterial ->assay -> acquisition -> quantification -> analysis), or we can link directly with a new linker table.

project assembly

project analysis linker table exists :)

linking contact

  • project: project_contact

  • biomaterial: base field (biosourceprovider_id)

-assembly: no direct linker. maybe the project contact is enough? otherwise quantiifcation has a operator_id which is a contcat (so analysis ->quantification -> operator_id)

linking pub

  • analysis_pub
  • project_pub

-biomaterial - who knows. do biosamples HAVE publications typically? if not we're off the hook.

bioproject: which XML tags would be added as additional properties

look in https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/bioprojects for examples.

project description

some of these things go in the base table (description. name and title i combine for the project.name because the name/title are not reliably named).

<ProjectDescr>
            <Name>Juglans regia</Name>
            <Title>Juglans regia Genome sequencing</Title>
            <Description>Juglans regia cultivar:Chandler</Description>
            <ExternalLink label="Dendrome">
                <URL>http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/Reju/</URL>
            </ExternalLink>
            <ExternalLink label="JHU CCB ftp">
                <URL>ftp://ftp.ccb.jhu.edu/pub/dpuiu/Walnut/English_walnut/</URL>
            </ExternalLink>
            <Publication id="9023104" status="ePublished">
                <Reference/>
                <DbType>ePubmed</DbType>
            </Publication>
            <Publication id="27145194" status="ePublished">
                <Reference/>
                <DbType>ePubmed</DbType>
            </Publication>
            <ProjectReleaseDate>2015-10-22T00:00:00Z</ProjectReleaseDate>
            <Relevance>
                <Other>yes</Other>
            </Relevance>
        </ProjectDescr>

publication

publications? ie <Publication id="9023104" status="ePublished">
would be added via _pub linker in chado.

project type

the project type area has lots of information: the organism itself will be added via linker but some of the otehr info will not...

<ProjectType>
            <ProjectTypeSubmission>
                <Target capture="eWhole" material="eGenome" sample_scope="eMonoisolate">
                    <Organism species="51240" taxID="51240">
                        <OrganismName>Juglans regia</OrganismName>
                        <Supergroup>eEukaryotes</Supergroup>
                        <BiologicalProperties>
                            <Environment>
                                <OptimumTemperature>C</OptimumTemperature>
                            </Environment>
                        </BiologicalProperties>
                        <Organization>eMulticellular</Organization>
                        <Reproduction>eSexual</Reproduction>
                    </Organism>
                </Target>
                <Method method_type="eSequencing"/>
                <Objectives>
                    <Data data_type="eSequence"/>
                </Objectives>
                <IntendedDataTypeSet>
                    <DataType>genome sequencing</DataType>
                </IntendedDataTypeSet>
                <ProjectDataTypeSet>
                    <DataType>Genome sequencing and assembly</DataType>
                </ProjectDataTypeSet>
            </ProjectTypeSubmission>
        </ProjectType>

submission info

<Submission last_update="2015-07-27" submission_id="SUB972265" submitted="2015-07-27">
        <Description>
            <!-- Submitter information has been removed -->
            <Organization role="owner" type="institute" url="http://ccb.jhu.edu/">
                <Name>Johns Hopkins University</Name>
                <!-- Contact information has been removed -->
            </Organization>
            <Access>public</Access>
        </Description>
        <Action action_id="SUB972265-1"/>
        <Action action_id="SUB972265-3"/>
    </Submission>

standardizing controlled vocabulary mapping vs allowing flexibility for unkown tags and sites changing mapping

to discuss further with @mpoelchau and @childers

Problem:

NCBI doesnt provide ontology mappings for attributes.Monica has done lots of work going through all the attributes we are interested in. Now we need to assign them to terms. Our broad options are create an ncbi custom ontology or map terms to existing ontologies. I'm always a fan of using existing terms if possible, as that's tripal's approach.... although maybe since we're talking about NCBI we should be communicating with them.

Assuming we go ahead mapping terms, we then have to conisder how this module will associate the xml attributes with cvterms for properties.

Possible implementation: tag terms as associated with ncbi xml tag?

We could use cvtermprop, or just a custom table, to associate xml tags with cvterms. we then let users update that themselves and/or provide an interface to do so.

integrate major classes into a single end user command

we agree querying and returning a record should look something like this:

  public function getProjects($projects) {
    $return = [];
    foreach ($projects as $project) {

      $db = 'bioproject';
      $project = (new EUtils())->get($db, $project);
      $return[] = $project;
    }

    return $return;
  }

that would mean Eutils's get method can look something like this:

class EUtils {

  public function get($db, $accession) {

    $provider = $this->getResourceProvider($db);

    $xml = $provider->xml();

    $parser = new EUtilsXMLParser($db);
    $data = $parser->parse($xml);
    $repository = new EutilsRepository($db);

    $record = $repository->create($data);

    return $record;

  }

but the EutilsRepository is an abstract class...

we need a repository factory.

Also, i note that if the record already exists in the db, you make a query when you dont really need to. Maybe before we query we should check to see if the record already exists.

move generic methods to eutilisrepositoryinterface?

biosample repository has a lot of methods that i want ot copy wholesale for bioproject: getAccessionByName, getDB for example. Let's move them into the parent class. Good indicator is if it doesnt explicitly touch the base table ie biomaterial table, we can move it.

functional testing: biosamples

test accession: 744358 https://www.ncbi.nlm.nih.gov/biosample/744358

Test using the form input.

created objects

organism

select * from chado.organism;
2574	B. mutus	Bos	mutus

biomaterial

select * from chado.biomaterial;
2349	2574	455		SAMN00744358

analysis

none created.

biomaterial properties

select * from chado.biomaterialprop bp INNER JOIN chado.cvterm cvt ON  cvt.cvterm_id = bp.type_id ;
biomaterialprop_id	biomaterial_id	type_id	value	rank	cvterm_id	cv_id	name	definition	dbxref_id	is_obsolete	is_relationshiptype
6187	2349	9724	yakQH1	0	9724	1748	breed		23510	0	0
6188	2349	11997	BGI-yakQH1	0	11997	1748	submitter_provided_accession		27861	00
6189	2349	10277	<BioSample submission_date="2011-10-26T05:31:04.493" last_update="2013-10-31T11:18:50.160" publication_date="2012-04-12T15:08:48.567" access="public" id="744358" accession="SAMN00744358">   <Ids>     <Id db="BioSample" is_primary="1">SAMN00744358</Id>     <Id db="BGI" db_label="Sample name">BGI-yakQH1</Id>     <Id db="SRA">SRS269061</Id>   </Ids>   <Description>     <Title>Bos mutus</Title>     <Organism taxonomy_id="72004" taxonomy_name="Bos mutus"/>     <Comment>       <Paragraph>Bos mutus yakQH1</Paragraph>     </Comment>   </Description>   <Owner>     <Name abbreviation="BGI">Beijing Genome Institute</Name>   </Owner>   <Models>     <Model>Generic</Model>   </Models>   <Package display_name="Generic">Generic.1.0</Package>   <Attributes>     <Attribute attribute_name="breed" harmonized_name="breed" display_name="breed">yakQH1</Attribute>   </Attributes>   <Status status="live" when="2012-05-14T08:37:47.960"/> </BioSample>	0	10277	2	full_ncbi_xml		24507	00
(3 rows)

dbxrefs

select * from chado.biomaterial_dbxref bdx INNER JOIN chado.dbxref dx ON dx.dbxref_id = bdx.dbxref_id;
biomaterial_dbxref_id	biomaterial_id	dbxref_id	dbxref_id	db_id	accession	version	description
1838	2349	30446	30446	1733	SAMN00744358
1839	2349	34712	34712	4263	SRS269061

contacts

455		Beijing Genome Institute

projects

none created

warnings

Undefined variable: site_name in TaxonomyImporter->initTree() (line 275 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Undefined variable: num_handled in TaxonomyImporter->run() (line 239 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Use of undefined constant NCBITaxon - assumed 'NCBITaxon' in TaxonomyImporter->findOrganism() (line 517 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Trying to get property of non-object in TaxonomyImporter->findOrganism() (line 554 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Undefined index: contacts in EUtilsRepository->createContact() (line 294 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/repositories/EUtilsRepository.inc).

fatal error with the ncbiDB

When previewing 2261463

Notice: Undefined index: db in EUtilsBioSampleFormatter->format() (line 82 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsBioSampleFormatter.inc).
TypeError: Argument 1 passed to EUtilsFormatter::getNCBIDB() must be of the type string, null given, called in /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsBioSampleFormatter.inc on line 82 in EUtilsFormatter->getNCBIDB() (line 29 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsFormatter.inc).

documentation time

for dev...
write a parser that extends X. write a formatter that extends Y. we hardcode the selectbox in the admin form, addi t there. the db name MUST MATCH the ncbi name.... you need to install a new db into chado.db....

for user...
not much for now. when we add fields, we'll need docs for that.

should ftp linkouts be added as properties? if so which ones?

see the key/tags to expect below. You can have genbank/refseq for the whole file, as well as the assembly report and the stats report.

 ["Assembly_rpt"]=>
    string(120) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/411/555/GCF_001411555.1_wgs.5d/GCF_001411555.1_wgs.5d_assembly_report.txt"
    ["GenBank"]=>
    string(77) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/411/555/GCA_001411555.1_wgs.5d"
    ["RefSeq"]=>
    string(77) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/411/555/GCF_001411555.1_wgs.5d"
    ["Stats_rpt"]=>
    string(119) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/411/555/GCF_001411555.1_wgs.5d/GCF_001411555.1_wgs.5d_assembly_stats.txt"

keep in mind these ftp links are in the record's raw XML dump too so users will find it there if they are enterprising.

refseq and genbank ftp links seem like no brainers to me for convenient data download.

my first thought is to add them as dbxrefs somehow so they show up as linkouts, but i dont know that it could build the query properly/consistently.

Maybe theres another Chado table thats better suited im not thinking of.

flexibly parsing attributes, child values, child attributes, while dealing with key name overlap

Major challenges:

  • Data keys may not be consistent.
  • Location of data keys may not be consistent.

consider the below parser. its way too specific. It wants certain keys to exist in certain places.

I need to get a broad set of input XMLs and devise a strategy from what I see.

I think that the best appraoch may be to thoroughly go through every child and attribute, looking for keys that match a set of triggers... then, once we've built that, we try to figure out which one the "best" value is, and deal iwth overlapping values etc...

  private function bioproject_project($xml) {

    $info = [];

    //dont expect parent attributes to matter
    // $attributes = $xml->attributes();

    $children = $xml->children();

    foreach ($children as $key => $child) {

      switch ($key) {

        case 'ProjectDescr':
          //Information about the project itself.  Includes title, description

        break;

        case 'ProjectType':
          //Includes organism, metadata for project.


        $target = $child->ProjectTypeSubmission->Target;

        if (!$target){
          break;
        }

       $organism =  $target->Organism;

        if (!$organism){
          break;
        }

        $attributes = $organism->attributes();

        //What about other children and their attributes?

        $info['type']['organism']['taxID'] = $attributes['taxID'];


          break;

        case 'ProjectID':
          //Accession info for the project.  Should match what was submitted, thats about it.

          break;

        case 'default':
          //Unexpected tag.  throw an error.

          tripal_log(t("Unexpected tag: !key", ['!key' => $key]));
          return FALSE;
      }
    }

    return $info;

  }

overloading ncbi and api keys

carefully read
https://www.ncbi.nlm.nih.gov/books/NBK25497/

On December 1, 2018, NCBI will begin enforcing the use of API keys that will offer enhanced levels of supported access to the E-utilities. After that date, any site (IP address) posting more than 3 requests per second to the E-utilities without an API key will receive an error message. By including an API key, a site can post up to 10 requests per second by default. Higher rates are available by request (vog.hin.mln.ibcn@seitilitue). Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). After creating the key, users should include it in each E-utility request by assigning it to the new api_key parameter.

Example request including an API key:
esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE12345

Example error message if rates are exceeded:
{"error":"API rate limit exceeded","count":"11"}

So we could add admin support for providing the user's API key (i think its user and not 3rd party software ie ours).

handling linked organisms

organisms are linked in the db most reliably via the NCBITAXON ID:

biomaterial

 <Organism taxonomy_id="3981" taxonomy_name="Hevea brasiliensis">
        <OrganismName>Hevea brasiliensis</OrganismName>
      </Organism>

assembly

<Organism>Canis lupus familiaris (dog)</Organism>
	<SpeciesTaxid>9612</SpeciesTaxid>
	<SpeciesName>Canis lupus</SpeciesName>

project:

<Organism species="72004" taxID="72004">
                        <OrganismName>Bos mutus</OrganismName>
                        <Strain>yakQH1</Strain>
                        <Supergroup>eEukaryotes</Supergroup>
                    </Organism>

Furthermore remember organism is a required field for biosample.

Options:

  • require organism t obe provided by the user. Not unreasonable.
  • automatically pull the linked organism. In this case, we have to ask a) what should we do with organisms not found in the database? Import them? We tend to use organisms as intentional categories, so adding one without the admin being very cognizant of it sounds like a poor idea.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.