nal-i5k / tripal_eutils Goto Github PK

ncbi loader via the eutils interface

License: GNU General Public License v3.0

PHP 100.00%

tripal tripal3 tripal3-compatible ncbi ncbi-assembly ncbi-biosamples eutils hacktoberfest2022

tripal_eutils's Introduction

Tripal Eutils

This module connects to the NCBI EUtils API to load in accessions for the Assembly, BioProject, and BioSample databases. Primary, as well as linked, records are loaded into Chado.

Please see the Documentation website for more information, as well as installation and usage instructions.

Requirements

This module requires Chado version 1.3 or greater.

Copyright notice

The Tripal EUtilities "logo" is derived from the collectible card game Hearthstone, copyright © Blizzard Entertainment, Inc. Hearthstone® is a registered trademark of Blizzard Entertainment, Inc. Tripal is not affiliated or associated with or endorsed by Hearthstone® or Blizzard Entertainment, Inc. Art is a public domain image courtesy of Karen Hatzigeorgiou .

tripal_eutils's People

Contributors

Stargazers

Watchers

Forkers

tenncp statonlab dsenalik

tripal_eutils's Issues

organism creation: which organism should be used when multiple taxIDs are linked?

take this example from Fragaria vesca: <Organism species="57918" taxID="101020">

57918 is Fragaria vesca, 101020 is subspecies vesca

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=57918
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=101020

I would think the correct approach is to link the taxID only.

mapping compatiblity with other community members: ethy

Ethy has provided us with a map of her boporject and biosample import:

https://github.com/GMOD/Chado/files/2672426/genome_schema.pdf

Do we see any problems or incompatibilities with this?

Are we going to have to create a separate project to house assembly? I'll try to keep us abreast of whats going on in the issue here GMOD/Chado#76

offering full XML as browseable field

@mestato suggested offering the full XML as a browseable field instead of just a download
https://codebeautify.org/xmlviewer

https://files.slack.com/files-pri/T3PCSESF2-FEMEPL75G/screen_shot_2018-12-06_at_1.53.10_pm.png

build generic php-based query to euitils

camel instead of snake

EVERYTHING in the classes should be camel case. except for variables.

tagmapper - cache cvterm lookups

#27 (review)

@almasaeed2010 noted that we lookup the terms for each and proposed cacheing the lookups which sounds like a great idea to me.

As an alternative, the mapper could instead just send the accession (local:sex for example) so the lookup doesnt happen unless we encounter that term... but if we encounter most terms in most lookup (which we do) it would be smarter to pre-cache like you suggest.

need formatters and repositories to inherit some of the same stuff

i just created a PR that adds teh below function to formatters

tripal_eutils/includes/formatters/EUtilsFormatter.inc

Lines 29 to 49 in 85f9308

    
           public function getNCBIDB(string $db_name) { 
        
             $name = "NCBI {$db_name}"; 
        
             $db = db_query( 
        
               'SELECT * FROM chado.db WHERE UPPER(name) = :name', 
        
               [':name' => strtoupper($name)] 
        
             )->fetchObject(); 
        
             if ($db) { 
        
               return $db; 
        
             } 
        
             $db = db_query( 
        
               'SELECT * FROM chado.db WHERE UPPER(name) = :name', 
        
               [':name' => strtoupper($db_name)] 
        
             )->fetchObject(); 
        
             if ($db) { 
        
               return $db; 
        
             } 
        
             return FALSE; 
        
           }

its redundant with a function for repositories. we need to make a new class for shared stuff.

idea, should they both actually extend the same class?
Or perhaps each parent class extends a base class with the sahred stuff?

idea: what if eutils created DRAFT content for connected content using tripal_hq?

what if an analysis cites a biosample, so this module fetches the biosample and proposes creating a new one via HQ so it needs to be approved first?

ignored tags

I proposed ignoring the tags that dont add anything etc.

https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/618

in the i5k issue, i see that @mpoelchau proposed storing everything but only displaying a subset. We can do that, but im not a huge fan because we're going to bloat our content with many many many prop fields.

Need to come up with a strategy.

linking organism to analysis

my analysis repo is going to link organism to analysis, BUT, what happens if they dont have the organism_analysis table!?

should we require manage_analyses? Or install the table ourselves---> then we won't have the field for displaying the link...

  /**
   * Insert into organism_analysis, or return existing link.
   *
   * @param $organism
   * Full chado.organism record.
   *
   * @return mixed
   * @throws \Exception
   */
  public function linkOrganism($organism) {

    if (!chado_table_exists('organism_analysis')) {
      throw new Exception('The organism_analysis linker table doesnt exist.  No way to link this organism.');
    }

    $result = db_select('chado.organism_analysis', 't')
      ->addField('t.organism_analysis_id')
      ->condition('t.organism_id', $organism->organism_id)
      ->condition('t.analysis_id', $this->base_record_id)
      ->execute()
      ->fetchField();

    if (!$result) {

      $result = db_insert('chado.organism_analysis')
        ->fields([
          'organism_id' => $organism->organism_id,
          'analysis_id' => $this->base_record_id,
        ])
        ->execute();
    }

    if (!$result) {
      throw new Exception('Could not link organism to analysis.');
    }

    return $result;
  }

add test coverage to code climate reporting

https://docs.codeclimate.com/docs/configuring-test-coverage

configure phpunit to generate clover.xml when run on travis only
add CI integration to before_script and aftr_script of .travis file see here

note that parallel builds make test coverage reporting harder, somethign to consider for core. would be great if we could ignore all but 1 build.

where should the reporting of what would be created happen?

right now we do some basic key value parsing to build some tables. but we need to

ensure that the records are only created upon submission, not when previewing
properly handle whatever parsed XML comes out of the XML parser.

My inclination is -

add a public method to set the EUtils class to "preview" mode.

Then, instead of spawning a repository class to insert records, it instatiates a new class whose responsibility is to convert the output of the XML parser into a drupal table.

The main highlights should be

what will the base record look like?
what records will get linked? contacts, pubs, etc
what properties will be added?
what secondary records will get created?

EUtilsBioSampleRepository::createAccession is not compatible with base class

Warning: Declaration of EUtilsBioSampleRepository::createAccession($bio_sample, $accession) should 
be compatible with EUtilsRepository::createAccession($accession) in require_once() (line 24 of 
/Users/Almsaeed/Work/DevSites/Tripal/sites/all/modules/tripal_eutils/tripal_eutils.module).

reference XML sets

GOAL:

@bradfordcondon will supply...

provide a set of seed XML responses to base our parsing logic. We want at a minimum 2 XML files per type (biosample, assembly... bioproject (subject to change)) per use case domain (trees, crops, bacteria, bugs, animals)...

8xml/ ncbi type, as proposed this 24 xml files....

@almasaeed2010 & @bradfordcondon will both preview. Compare with the tagset listed at https://docs.google.com/spreadsheets/d/10G9IgmD8Yn5ZG0LY8AMeLMyoJAvi7ipgC8ZQHrUMi2U/edit#gid=1768962945

NOTE: ATTRIBUTES we can plan on dealing with as we did for biomaterials in the tripal analysis expression module, except we might want to further standardize CVterm usage. IE dont necessarily want an interface of term mapping for each upload..... (double check each content type is using the attributes tag for these things which are going lceanly map to chado properties).

From this, we're going to figure ...

a) what goes into chado
b) most open-ended/flexible way to parse XML to do this...

input field

Why provide a field? Because without it, we can't hook into HQ very easily. Maybe we'll think of another way and not provide a field, i dunno.

the field should -

base field

simply the value of the corresponding NCBI accession

formatter

not necessary since hte acession will show up indbxref.

widget

input text field for the accession, with a "search" button.

Pressing search should display a summary of the record that will be imported if provided.

The form should then rebuild either A) with the fields populated (problematic for linked fields such as contacts, dbxrefs, properties that dont exist yet, etc), or b) simply displaying the summary in the widget. On submit, we then deal with the other fields, either filling them out by hand, or creating and publishing the chado record...

option b) sounds better, but really makes me wonder if we shouldnt be trying to go the field route at all.

biosample name is problematic

we didnt have this issue previously because name was provided by the user.

look at this created biomaterial:

  ["biomaterial_id"]=>
  string(3) "657"
  ["taxon_id"]=>
  NULL
  ["biosourceprovider_id"]=>
  NULL
  ["dbxref_id"]=>
  NULL
  ["name"]=>
  string(7) " Leaves"
  ["description"]=>
  string(0) ""
}

the problem is, it derives from this biosample:

<?xml version="1.0"?>
<BioSampleSet><BioSample access="public" publication_date="2015-05-21T17:45:06.623" last_update="2015-10-05T16:33:37.567" submission_date="2015-05-21T17:45:06.703" id="3704235" accession="SAMN03704235">   <Ids>     <Id db="BioSample" is_primary="1">SAMN03704235</Id>     <Id db_label="Sample name"> Leaves</Id>     <Id db="SRA">SRS957431</Id>   </Ids>   <Description>     <Title>Plant sample from Juglans regia</Title>     <Organism taxonomy_id="51240" taxonomy_name="Juglans regia">       <OrganismName>Juglans regia</OrganismName>     </Organism>   </Description>   <Owner>     <Name>UCDavis</Name>     <Contacts>       <Contact email="[email protected]">         <Name>           <First>Pedro Jose</First>           <Last>Martinez-Garcia</Last>         </Name>       </Contact>     </Contacts>   </Owner>   <Models>     <Model>Plant</Model>   </Models>   <Package display_name="Plant; version 1.0">Plant.1.0</Package>   <Attributes>     <Attribute attribute_name="cultivar" harmonized_name="cultivar" display_name="cultivar">Chandler</Attribute>     <Attribute attribute_name="age" harmonized_name="age" display_name="age">18</Attribute>     <Attribute attribute_name="isolation source" harmonized_name="isolation_source" display_name="isolation source">plant at flowering time</Attribute>     <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">USA: California, Davis</Attribute>     <Attribute attribute_name="tissue" harmonized_name="tissue" display_name="tissue">leaves</Attribute>   </Attributes>   <Links/>   <Status status="live" when="2015-05-21T17:45:06.702"/> </BioSample> </BioSampleSet>

linked records only formatted to accessions

instead of just the accession id probably except the formatter to fetch the new record (without inserting) and display its base fields, for example.

supported databases

efetch doesnt support all databases.
https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.T._entrez_unique_identifiers_ui
notably, assembly is missing.

for such types, i thinkwe are forced to use, say, esummary instead of efetch.

improve testing on XML parsers

a) all xml parsers are tested in one file. let's split the specific testing off into class-specific tests.

b) the biomaterial test only checks that keys exist. the keys can easily exist and be null. however, not all keys will be set for all files, so need to take an approach more like the assembly tests and in the provider include what the output of that file should look like.

get analysis source URI for creating assembly -> analysis

we have source URI via the FTP field for the assembly

assembly -> chado analysis mapping: we need to download FTP stuff...

child of #15

assembly has core field information for analysis (algorith namely) in the FTP file. so, we need to

write an fTP class
fetch the files, search for the algorithm info.

proposal for handling xml -> chado mappings

child of #8

heres the idea.

The XML parser is split into creating the base record, dbxrefs, linked records, and props. (and whatever other stuff we need).

The base record stuff is hard coded. We look for a hardcoded attribute for each column of the base record, with advanced logic to check for all possible attributes and use the "best" one.

dbxrefs are also hard-coded.

linked records.... im not there yet. let's ignore for now.

for everything else: it looks up the tag in an API. the API returns if the tag should be ignored, added as a prop, or something else.

We have a schema that stores:

ALL encountered tags. It keeps the tag name, the the ncbi db type for that tag, and if the tag is assigned to a term or not. If it's assigned, it's just the cvtermid for easy lookup. We also have a list of all the matching possible cvterms that arent necessarily assigned (probably a seperate, mview type table).

how does the schema get populated? read on...

schema population

We have a job that reads an XML file and compiles all the attribute tags: each tag is stored in the schema as unassigned. It then looks each one up in your chado.cvterm. All exact and "close enough" matches go in the possible matches schema. The admin then goes to an admin area and sees a list of all XML terms with matches. From there they can "assign" the attribute, which means when the XML gets parsed for real, it will create a property. If no attribute is assigned a term, it gets ignored. If no terms match an attribute, they are instructed to find one, with a button to automatically create a local term instead.

Furthermore, on install, we can hardcode some suggest attribute -> cvterm mappings. This is tricky because everyone's site is different, but maybe there are some attributes we would expect in ALL biosamples across plants animals fungi etc.

When someone imports a new XML, it can be configured to ignore new attribute tags (but add them to its schema as an unmatched, ignored attribute) OR to abort the load -> the admin can then assign a term and re-attempt the load.

inconsistent data returned from different services

esearch, fetch, docsum.... be careful! The tags from one method wont match the tags for another, even so far as the same info might have different labels...

Do a writeup/plan so we know what to expect.

migs standard?

i like the idea of this module settings MIGs standards (or perhaps a child module).
https://www.ncbi.nlm.nih.gov/pubmed/18464787

http://rd-alliance.github.io/metadata-directory/standards/mibbi-minimum-information-biological-and-biomedical-investigations.html

http://wiki.gensc.org/index.php?title=MIGS/MIMS

assembly: which xml tags should be added as properties/_cvterms?

looking at https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/assembly for examples.

below are example tags from 751381 that arent dealt with via dbxrefs/linked records

      <AssemblyType>haploid</AssemblyType>
      <AssemblyClass>haploid</AssemblyClass>
      <AssemblyStatus>Scaffold</AssemblyStatus>
      <WGS>LVXX01</WGS>
 <Coverage>99</Coverage>
      <PartialGenomeRepresentation>false</PartialGenomeRepresentation>
      <Primary>4681358</Primary>
      <AssemblyDescription/>
      <ReleaseLevel>Major</ReleaseLevel>
      <ReleaseType>Major</ReleaseType>
      <AsmReleaseDate_GenBank>2016/06/01 00:00</AsmReleaseDate_GenBank>
      <AsmReleaseDate_RefSeq>2017/07/14 00:00</AsmReleaseDate_RefSeq>
      <SeqReleaseDate>2016/06/01 00:00</SeqReleaseDate>
      <AsmUpdateDate>2017/07/19 00:00</AsmUpdateDate>
      <SubmissionDate>2016/06/01 00:00</SubmissionDate>
      <LastUpdateDate>2017/07/19 00:00</LastUpdateDate>
      <SubmitterOrganization>Rubber Research Institute</SubmitterOrganization>
      <RefSeq_category>representative genome</RefSeq_category>
      <AnomalousList>
      </AnomalousList>
      <ExclFromRefSeq>
      </ExclFromRefSeq>
      <PropertyList>
        <string>full-genome-representation</string>
        <string>has-chloroplast</string>
        <string>has_annotation</string>
        <string>latest</string>
        <string>latest_genbank</string>
        <string>latest_refseq</string>
        <string>refseq_has_annotation</string>
        <string>representative</string>
        <string>wgs</string>
</PropertyList>

additionally, we have all of the STATS tags.

<Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat> <Stat category="chromosome_count" sequence_tag="all">0</Stat> <Stat category="contig_count" sequence_tag="all">48315</Stat>
 <Stat category="contig_l50" sequence_tag="all">6073</Stat> <Stat category="contig_n50" sequence_tag="all">60046</Stat> 
<Stat category="non_chromosome_replicon_count" sequence_tag="all">1</Stat> <Stat category="replicon_count" sequence_tag="all">1</Stat> 
<Stat category="scaffold_count" sequence_tag="all">7453</Stat> <Stat category="scaffold_count" sequence_tag="placed">1</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">7452</Stat> <Stat category="scaffold_l50" sequence_tag="all">320</Stat>
 <Stat category="scaffold_n50" sequence_tag="all">1281786</Stat> <Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat> </Stats>

right now i collect each one combining the category and tag so for example, scaffold_count_all, scaffold_count_placed, etc. would we want ALL of these as properties?

contact undefined when importing some biosamples

Notice: Undefined index: contacts in EUtilsRepository->createContact() (line 294 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/repositories/EUtilsRepository.inc).

child of #99

creating linked records

If a biosample is crosslinked in an assembly, we really want to create teh biosample.

@mestato thinks this pretty definite. And then the biomaterial module provides the biomaterial value we attached to the analysis....

linking logic and master linking status

goal- lay out the linking logic for all cases so its obvious what can be genercized

linking organism

biosample: base field (taxon_id)
project: same as analysis, we need a new custom linker, because project_feature exists.
assembly (analysis): custom linker table organism_analysis

project biosample

need new linker table?
one is coming for chado 1.4 . see #80

assembly biosample

assembly - we have a choice. we can link as prescribed by MAGE (biomaterial ->assay -> acquisition -> quantification -> analysis), or we can link directly with a new linker table.

project assembly

project analysis linker table exists :)

linking contact

project: project_contact
biomaterial: base field (biosourceprovider_id)

-assembly: no direct linker. maybe the project contact is enough? otherwise quantiifcation has a operator_id which is a contcat (so analysis ->quantification -> operator_id)

linking pub

analysis_pub
project_pub

-biomaterial - who knows. do biosamples HAVE publications typically? if not we're off the hook.

previous work

branch of workspace with code (spread between the datasets and cvterm modules?)

https://github.com/isdapps/i5k-tripal/compare/new_theme...metadata_retrieval

https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/533
https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/618

bioproject: which XML tags would be added as additional properties

look in https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/bioprojects for examples.

project description

some of these things go in the base table (description. name and title i combine for the project.name because the name/title are not reliably named).

<ProjectDescr>
            <Name>Juglans regia</Name>
            <Title>Juglans regia Genome sequencing</Title>
            <Description>Juglans regia cultivar:Chandler</Description>
            <ExternalLink label="Dendrome">
                <URL>http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/Reju/</URL>
            </ExternalLink>
            <ExternalLink label="JHU CCB ftp">
                <URL>ftp://ftp.ccb.jhu.edu/pub/dpuiu/Walnut/English_walnut/</URL>
            </ExternalLink>
            <Publication id="9023104" status="ePublished">
                <Reference/>
                <DbType>ePubmed</DbType>
            </Publication>
            <Publication id="27145194" status="ePublished">
                <Reference/>
                <DbType>ePubmed</DbType>
            </Publication>
            <ProjectReleaseDate>2015-10-22T00:00:00Z</ProjectReleaseDate>
            <Relevance>
                <Other>yes</Other>
            </Relevance>
        </ProjectDescr>

publication

publications? ie <Publication id="9023104" status="ePublished">
would be added via _pub linker in chado.

project type

the project type area has lots of information: the organism itself will be added via linker but some of the otehr info will not...

<ProjectType>
            <ProjectTypeSubmission>
                <Target capture="eWhole" material="eGenome" sample_scope="eMonoisolate">
                    <Organism species="51240" taxID="51240">
                        <OrganismName>Juglans regia</OrganismName>
                        <Supergroup>eEukaryotes</Supergroup>
                        <BiologicalProperties>
                            <Environment>
                                <OptimumTemperature>C</OptimumTemperature>
                            </Environment>
                        </BiologicalProperties>
                        <Organization>eMulticellular</Organization>
                        <Reproduction>eSexual</Reproduction>
                    </Organism>
                </Target>
                <Method method_type="eSequencing"/>
                <Objectives>
                    <Data data_type="eSequence"/>
                </Objectives>
                <IntendedDataTypeSet>
                    <DataType>genome sequencing</DataType>
                </IntendedDataTypeSet>
                <ProjectDataTypeSet>
                    <DataType>Genome sequencing and assembly</DataType>
                </ProjectDataTypeSet>
            </ProjectTypeSubmission>
        </ProjectType>

submission info

<Submission last_update="2015-07-27" submission_id="SUB972265" submitted="2015-07-27">
        <Description>
            <!-- Submitter information has been removed -->
            <Organization role="owner" type="institute" url="http://ccb.jhu.edu/">
                <Name>Johns Hopkins University</Name>
                <!-- Contact information has been removed -->
            </Organization>
            <Access>public</Access>
        </Description>
        <Action action_id="SUB972265-1"/>
        <Action action_id="SUB972265-3"/>
    </Submission>

standardizing controlled vocabulary mapping vs allowing flexibility for unkown tags and sites changing mapping

to discuss further with @mpoelchau and @childers

Problem:

NCBI doesnt provide ontology mappings for attributes.Monica has done lots of work going through all the attributes we are interested in. Now we need to assign them to terms. Our broad options are create an ncbi custom ontology or map terms to existing ontologies. I'm always a fan of using existing terms if possible, as that's tripal's approach.... although maybe since we're talking about NCBI we should be communicating with them.

Assuming we go ahead mapping terms, we then have to conisder how this module will associate the xml attributes with cvterms for properties.

Possible implementation: tag terms as associated with ncbi xml tag?

We could use cvtermprop, or just a custom table, to associate xml tags with cvterms. we then let users update that themselves and/or provide an interface to do so.

integrate major classes into a single end user command

we agree querying and returning a record should look something like this:

  public function getProjects($projects) {
    $return = [];
    foreach ($projects as $project) {

      $db = 'bioproject';
      $project = (new EUtils())->get($db, $project);
      $return[] = $project;
    }

    return $return;
  }

that would mean Eutils's get method can look something like this:

class EUtils {

  public function get($db, $accession) {

    $provider = $this->getResourceProvider($db);

    $xml = $provider->xml();

    $parser = new EUtilsXMLParser($db);
    $data = $parser->parse($xml);
    $repository = new EutilsRepository($db);

    $record = $repository->create($data);

    return $record;

  }

but the EutilsRepository is an abstract class...

we need a repository factory.

Also, i note that if the record already exists in the db, you make a query when you dont really need to. Maybe before we query we should check to see if the record already exists.

move generic methods to eutilisrepositoryinterface?

biosample repository has a lot of methods that i want ot copy wholesale for bioproject: getAccessionByName, getDB for example. Let's move them into the parent class. Good indicator is if it doesnt explicitly touch the base table ie biomaterial table, we can move it.

WGS is a database

the <WGS> tag is actually a dbxref

functional testing: biosamples

test accession: 744358 https://www.ncbi.nlm.nih.gov/biosample/744358

Test using the form input.

created objects

organism

select * from chado.organism;
2574	B. mutus	Bos	mutus

biomaterial

select * from chado.biomaterial;
2349	2574	455		SAMN00744358

analysis

none created.

biomaterial properties

select * from chado.biomaterialprop bp INNER JOIN chado.cvterm cvt ON  cvt.cvterm_id = bp.type_id ;
biomaterialprop_id	biomaterial_id	type_id	value	rank	cvterm_id	cv_id	name	definition	dbxref_id	is_obsolete	is_relationshiptype
6187	2349	9724	yakQH1	0	9724	1748	breed		23510	0	0
6188	2349	11997	BGI-yakQH1	0	11997	1748	submitter_provided_accession		27861	00
6189	2349	10277	<BioSample submission_date="2011-10-26T05:31:04.493" last_update="2013-10-31T11:18:50.160" publication_date="2012-04-12T15:08:48.567" access="public" id="744358" accession="SAMN00744358">   <Ids>     <Id db="BioSample" is_primary="1">SAMN00744358</Id>     <Id db="BGI" db_label="Sample name">BGI-yakQH1</Id>     <Id db="SRA">SRS269061</Id>   </Ids>   <Description>     <Title>Bos mutus</Title>     <Organism taxonomy_id="72004" taxonomy_name="Bos mutus"/>     <Comment>       <Paragraph>Bos mutus yakQH1</Paragraph>     </Comment>   </Description>   <Owner>     <Name abbreviation="BGI">Beijing Genome Institute</Name>   </Owner>   <Models>     <Model>Generic</Model>   </Models>   <Package display_name="Generic">Generic.1.0</Package>   <Attributes>     <Attribute attribute_name="breed" harmonized_name="breed" display_name="breed">yakQH1</Attribute>   </Attributes>   <Status status="live" when="2012-05-14T08:37:47.960"/> </BioSample>	0	10277	2	full_ncbi_xml		24507	00
(3 rows)

dbxrefs

select * from chado.biomaterial_dbxref bdx INNER JOIN chado.dbxref dx ON dx.dbxref_id = bdx.dbxref_id;
biomaterial_dbxref_id	biomaterial_id	dbxref_id	dbxref_id	db_id	accession	version	description
1838	2349	30446	30446	1733	SAMN00744358
1839	2349	34712	34712	4263	SRS269061

contacts

455		Beijing Genome Institute

projects

none created

warnings

Undefined variable: site_name in TaxonomyImporter->initTree() (line 275 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Undefined variable: num_handled in TaxonomyImporter->run() (line 239 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Use of undefined constant NCBITaxon - assumed 'NCBITaxon' in TaxonomyImporter->findOrganism() (line 517 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Trying to get property of non-object in TaxonomyImporter->findOrganism() (line 554 of /Users/bc/tripal/sites/all/modules/tripal/tripal_chado/includes/TripalImporter/TaxonomyImporter.inc).
Notice: Undefined index: contacts in EUtilsRepository->createContact() (line 294 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/repositories/EUtilsRepository.inc).

biosample: what additional attributes should be loaded in for a biomaterial?

please see
https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/biosamples
for examples

contacts? multiple contacts for owner etc?

attributes

these are all loaded in as props.

can we automate pulling the biosample property xml and version it /detect changes?

https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml

harmonized name not found in all biomaterials

importing https://www.ncbi.nlm.nih.gov/biosample/8097325

Notice: Undefined index: harmonized_name in EUtilsBioSampleFormatter->format() (line 40 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsBioSampleFormatter.inc).

Genericize EUtilsRequest to handle all HTTP related requests

see #36

We need to remove the required $db var from the constructor.

fatal error with the ncbiDB

When previewing 2261463

Notice: Undefined index: db in EUtilsBioSampleFormatter->format() (line 82 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsBioSampleFormatter.inc).
TypeError: Argument 1 passed to EUtilsFormatter::getNCBIDB() must be of the type string, null given, called in /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsBioSampleFormatter.inc on line 82 in EUtilsFormatter->getNCBIDB() (line 29 of /Users/bc/tripal/sites/all/modules/custom/tripal_eutils/includes/formatters/EUtilsFormatter.inc).

create and utilize linker tables that are in chado 1.4?

we should create the biomaterial_project table, and any other tables we need.

GMOD/Chado#55

the PR exists for chado 1.4 so very likely to end up in Chado 1.4

documentation time

for dev...
write a parser that extends X. write a formatter that extends Y. we hardcode the selectbox in the admin form, addi t there. the db name MUST MATCH the ncbi name.... you need to install a new db into chado.db....

for user...
not much for now. when we add fields, we'll need docs for that.

biosample attribute parser bug: value missing for all of record 2981385

should ftp linkouts be added as properties? if so which ones?

see the key/tags to expect below. You can have genbank/refseq for the whole file, as well as the assembly report and the stats report.

 ["Assembly_rpt"]=>
    string(120) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/411/555/GCF_001411555.1_wgs.5d/GCF_001411555.1_wgs.5d_assembly_report.txt"
    ["GenBank"]=>
    string(77) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/411/555/GCA_001411555.1_wgs.5d"
    ["RefSeq"]=>
    string(77) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/411/555/GCF_001411555.1_wgs.5d"
    ["Stats_rpt"]=>
    string(119) "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/411/555/GCF_001411555.1_wgs.5d/GCF_001411555.1_wgs.5d_assembly_stats.txt"

keep in mind these ftp links are in the record's raw XML dump too so users will find it there if they are enterprising.

refseq and genbank ftp links seem like no brainers to me for convenient data download.

my first thought is to add them as dbxrefs somehow so they show up as linkouts, but i dont know that it could build the query properly/consistently.

Maybe theres another Chado table thats better suited im not thinking of.

term for full raw XML property?

so you'd have a prop that says "click to download FULL XML".

two flavors of accession that dont always match up

assembly:
uid is 557018 vs accession is GCF_000184155.1

for otehr types, the accession = uid with a prefix, easier to support.

unanswered question, does a query with GCF_000184155.1 work!?

too many property widgets!

solutions:

Put widget in a fieldset.

use feature_cvterm instead

flexibly parsing attributes, child values, child attributes, while dealing with key name overlap

Major challenges:

Data keys may not be consistent.
Location of data keys may not be consistent.

consider the below parser. its way too specific. It wants certain keys to exist in certain places.

I need to get a broad set of input XMLs and devise a strategy from what I see.

I think that the best appraoch may be to thoroughly go through every child and attribute, looking for keys that match a set of triggers... then, once we've built that, we try to figure out which one the "best" value is, and deal iwth overlapping values etc...

  private function bioproject_project($xml) {

    $info = [];

    //dont expect parent attributes to matter
    // $attributes = $xml->attributes();

    $children = $xml->children();

    foreach ($children as $key => $child) {

      switch ($key) {

        case 'ProjectDescr':
          //Information about the project itself.  Includes title, description

        break;

        case 'ProjectType':
          //Includes organism, metadata for project.


        $target = $child->ProjectTypeSubmission->Target;

        if (!$target){
          break;
        }

       $organism =  $target->Organism;

        if (!$organism){
          break;
        }

        $attributes = $organism->attributes();

        //What about other children and their attributes?

        $info['type']['organism']['taxID'] = $attributes['taxID'];


          break;

        case 'ProjectID':
          //Accession info for the project.  Should match what was submitted, thats about it.

          break;

        case 'default':
          //Unexpected tag.  throw an error.

          tripal_log(t("Unexpected tag: !key", ['!key' => $key]));
          return FALSE;
      }
    }

    return $info;

  }

overloading ncbi and api keys

carefully read
https://www.ncbi.nlm.nih.gov/books/NBK25497/

On December 1, 2018, NCBI will begin enforcing the use of API keys that will offer enhanced levels of supported access to the E-utilities. After that date, any site (IP address) posting more than 3 requests per second to the E-utilities without an API key will receive an error message. By including an API key, a site can post up to 10 requests per second by default. Higher rates are available by request (vog.hin.mln.ibcn@seitilitue). Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). After creating the key, users should include it in each E-utility request by assigning it to the new api_key parameter.

Example request including an API key:
esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE12345

Example error message if rates are exceeded:
{"error":"API rate limit exceeded","count":"11"}

So we could add admin support for providing the user's API key (i think its user and not 3rd party software ie ours).

handling linked organisms

organisms are linked in the db most reliably via the NCBITAXON ID:

biomaterial

 <Organism taxonomy_id="3981" taxonomy_name="Hevea brasiliensis">
        <OrganismName>Hevea brasiliensis</OrganismName>
      </Organism>

assembly

<Organism>Canis lupus familiaris (dog)</Organism>
	<SpeciesTaxid>9612</SpeciesTaxid>
	<SpeciesName>Canis lupus</SpeciesName>

project:

<Organism species="72004" taxID="72004">
                        <OrganismName>Bos mutus</OrganismName>
                        <Strain>yakQH1</Strain>
                        <Supergroup>eEukaryotes</Supergroup>
                    </Organism>

Furthermore remember organism is a required field for biosample.

Options:

require organism t obe provided by the user. Not unreasonable.
automatically pull the linked organism. In this case, we have to ask a) what should we do with organisms not found in the database? Import them? We tend to use organisms as intentional categories, so adding one without the admin being very cognizant of it sounds like a poor idea.

	public function getNCBIDB(string $db_name) {

	$name = "NCBI {$db_name}";

	$db = db_query(
	'SELECT * FROM chado.db WHERE UPPER(name) = :name',
	[':name' => strtoupper($name)]
	)->fetchObject();
	if ($db) {
	return $db;
	}

	$db = db_query(
	'SELECT * FROM chado.db WHERE UPPER(name) = :name',
	[':name' => strtoupper($db_name)]
	)->fetchObject();
	if ($db) {
	return $db;
	}
	return FALSE;
	}

nal-i5k / tripal_eutils Goto Github PK

tripal_eutils's Introduction

Tripal Eutils

Requirements

Copyright notice

tripal_eutils's People

Contributors

Stargazers

Watchers

Forkers

tripal_eutils's Issues

base field

formatter

widget

schema population

linking organism

project biosample

assembly biosample

project assembly

linking contact

linking pub

project description

publication

project type

submission info

Possible implementation: tag terms as associated with ncbi xml tag?

created objects

organism

biomaterial

analysis

biomaterial properties

dbxrefs

contacts

projects

warnings

attributes

Recommend Projects

Recommend Topics

Recommend Org