Code Monkey home page Code Monkey logo

biosample-basex's People

Contributors

turbomam avatar

Stargazers

 avatar

Watchers

 avatar  avatar

biosample-basex's Issues

samples with missing Bioproject ids

For example, this sample:
https://www.ncbi.nlm.nih.gov/biosample/SAMN00000005/

has no Bioproject id in the harmonized attributes from the NCBI Biosample xml.

However, NCBI Bioproject does have links to Bioprojects for this sample, in fact 5 of them. However, these links are potentially not the same as an actual Bioproject id for that Biosample id, but rather something like 'related information' or 'same organism'.

What is odd is this example, is the sample is actually:
Generic sample from Biomphalaria glabrata

Which sounds more like a generic placeholder sample for that organism?

This is still a little unclear and other examples may reveal more/different information.

Rename SQLite tables?

all_attribs -> long_all_attribs
catted_wide_harmonized_attributes -> wide_harmonized_attribs
biosample_basex_merged -> merged_wide
non_harmonized_attributes -> non_attrib_metadata

Elephant in room: biosample_set.xml too big for one BaseX database now

biosample_set.xml

https://docs.basex.org/wiki/Statistics

FileSize #Files #Nodes #Attr #ENames #ANames #URIs
512 GiB(2^39 Bytes) 536'870'912(2^29) 2'147'483'648(2^31) no limit 32768(2^15) 32768(2^15) 256(2^8)

No limits on 'DbSize' (total space on disk?) or 'Height' (?)

Total biosample_set.xml nodes 2'345'750'066 vs 2'147'483'648 limit

Name                         Resources  Size         Input Path                                                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
biosample_set                1          33669367047  /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_through_12960136_ended.xml  
biosample_set_from_12960137  1          27995245425  /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_from_12960137_started.xml   

Database 'biosample_set_from_12960137' was opened in 368.67 ms.
Database Properties
 NAME: biosample_set_from_12960137
 SIZE: 27 GB
 NODES: 1054141679
 DOCUMENTS: 1
 BINARIES: 0
 TIMESTAMP: 2021-12-20T14:02:03.341Z
 UPTODATE: true

Resource Properties
 INPUTPATH: /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_from_12960137_started.xml
 INPUTSIZE: 26 GB
 INPUTDATE: 2021-12-20T13:19:44.317Z


Database 'biosample_set_from_12960137' was opened in 255.13 ms.
Database Properties
 NAME: biosample_set
 SIZE: 32 GB
 NODES: 1291608387
 DOCUMENTS: 1
 BINARIES: 0
 TIMESTAMP: 2021-12-20T14:57:57.172Z
 UPTODATE: true

Resource Properties
 INPUTPATH: /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_through_12960136_ended.xml
 INPUTSIZE: 31 GB
 INPUTDATE: 2021-12-20T13:17:26.604Z

Biosample fields to summarize for Montana

Bolded fields are not in the harmonized_wide (or harmonized_wide_repaired) table. Italicized fields are close matches to the requested field.

  • bacteria_carb_prod
  • collection_date
  • experimental_factor
  • growth_facil
  • investigation_type
  • link_addit_analys
  • microbial_biomass
  • samp_collect_device
  • samp_collec_method
  • samp_mat_process
  • samp_size
  • sample_name
  • sieving
  • size_frac
  • source_material_id
  • store_cond

From our previous discussion, here’s the terms we’d like to pull and evaluate how they’re being used.

  • sample_name: I want to know if people are using a unique ID or a human readable
  • store_cond: DNA or soil? It’s in the “soil” section, not nucleic acid
  • growth_facil: only in plants, I want to know if it’s been used for other smaple types & how
  • collection_date: how often do people include time
  • sieving: yes or no? or how do people detail it
  • bacteria_carb_prod: is this supposed to be respiration? What units fo people provide
  • link_addit_analys vs investigation type
  • microbial_biomass: carbon? Units?
  • experimental_factor: how is this used
  • samp_size: total? Or amount used in extraction
  • size_frac: sample? Or DNA?
  • Source_mat_id: to link biosample to ID? Or parent-child relationships?
  • Samp_collec_device: DNA or biosample?
  • Samp_collec_method: DNA or biosample?
  • Samp_mat_process: DNA or biosample?

Changes to biosample_non_harmonized_attributes_wide.xq query

Remove from

  • Query: biosample_non_harmonized_attributes_wide.xq
  • File: file biosample_non_harmonized_attributes_wide.tsv
  • Table: non_harmonized_attributes
    • emp500_principal_investigator
    • emp500_study_id
    • emp500_title
      will be in long all attributes docs/table

Remove for new minimal external ids/links policy:

  • dna_source
  • entrez_links
  • xref

Keep

  • sra_id
    • <Id db="SRA">SRS003378</Id>
  • doi
    • <Link type="url" label="DOI">http://dx.doi.org/10.1016/0967-0645(96)00005-7</Link>

Add bioproject id xref

  • <Link type="entrez" target="bioproject">40793</Link>
    most don't have a PRAJNA prefix... join to bioproject XML?

SRRs important for Bin Hu/EMP500 whole metagenomes (not amplicon studies)

Link types and targets

Labels and texts also available/common

4404408 entrez bioproject 
1679738 url  
  35866 entrez pubmed 
   9901 entrez omim 
    519 entrez biosample 
     62 entrez gds 
     57 entrez nuccore 
     16 entrez PUBMED 
      4 entrez PubMed 
      1 url publication 
      1 url GoldStamp Id 
      1 type target 
      1 entrez genome 

Add analogous tabulation of id@dbs?

  • long
  • could truncate to most common values

set up for production on large memory machines like NERSC cori

Include reporting of emp500 fields

am I backing myself into a corner wrt not running on computers with more modest RAM (32 GB?) in the future?

be very clear about differences from existing Perl approach. creates two TSVs: a long EAV of attributes and a wide table of non-attribute data. Casting the long EAV to wide is chunked but still needs lots of RAM. Then merging of the two wide tables is tricky. Currently doing in SQLite, but not happy with duplicated index column and soem other annoyances. Try in Python?!

add intermediate basex etc files for 'each' step in Makefile

We are currently clobbering the same .basex file for a few of the later steps in the Makefile. To improve the pipeline and keep better track of progress and issues we can think of which intermediate files would be useful to store along the way. Some of these could be tmp files that are deleted after the step completes.

log of xqueries that are really part of make process

  • attribute_plus_emp500_wide.xq
    • in comment
  • count_biosamples.xq
  • biosample_non_harmonized_attributes_wide.xq
  • biosample_harmonized_attributes_long.xq

  • chunk_harmonized_attributes_long.sh

  • make_wide_ha_chunks.py
  • cat_wide_ha_chunks.py

Create 'assumptions & limitations' doc for Biosample project

Incomplete list:

  • Only attributes with a harmonized name from the NCBI Biosample XML are included in the final data product.
  • About 50% of the sample records link to a NCBI Bioproject id.
  • Env package repair is against MIxS packages, best effort and includes confidence.

add a units table like the wide repaired?

Chris suggested populating the repaired table with value unit strings

Marcin concerned about finding preferred values and applicability of value unit strings to ML

subset whole database on a few repaired env packages for Data Good

Subsetting rows based on environmental package

create table if not exists non_attribute_metadata_sel_envs
as
select
	nam.*
from
	non_attribute_metadata nam
join harmonized_wide_repaired hwr 
	on
	nam.raw_id = hwr.raw_id
join harmonized_wide hw
	on
	nam.raw_id = hw.raw_id
where
	hwr.env_package in ('plant-associated', 'soil', 'sediment', 'water')
;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.