The biosample-basex from turbomam

Special handling for Keywords portion of paragraph element?

sample:

Keywords: GSC:MIxS;MIGS:5.0

use .env for Makefile parameters

like the project location, the basex binaries, and the basex data directory

Are triple pipe ||| delimiters justified?

I intended these to differentiate between

single pipes already present in the XML
XML records concatenated when converting to SQLite

samples with missing Bioproject ids

For example, this sample:
https://www.ncbi.nlm.nih.gov/biosample/SAMN00000005/

has no Bioproject id in the harmonized attributes from the NCBI Biosample xml.

However, NCBI Bioproject does have links to Bioprojects for this sample, in fact 5 of them. However, these links are potentially not the same as an actual Bioproject id for that Biosample id, but rather something like 'related information' or 'same organism'.

What is odd is this example, is the sample is actually:
Generic sample from Biomphalaria glabrata

Which sounds more like a generic placeholder sample for that organism?

This is still a little unclear and other examples may reveal more/different information.

switch to $d deletion for biosample_set_under

as opposed to /q followed by XXX

include ENVO semantic SQL tables in biosample SQLite

for example joining INSDC EnvO strings/IDs

normalize whitespace in INSDC input
lowercase

future: extract knowledge from non ASCII chars like mu

Use basex command 'info index' to display paths summary

Are there enough DOIs to justify a column in SQLite?

2021-10-11 biosample_set.xml
20 415 925 BioSamples
395 with at least one DOI

do we have minimal metadata rules for inclusion in NMDC

especially for EMP500?

Montana: not strictly following MIxS

zip reports from ha_highlights.py

distinguish raw id from "BIOSAMPLE:" + id

I don't have permission to move this to INCATools

We talked about moving this to INCATools, but I didn't see the green New button at https://github.com/orgs/INCATools/repositories

That makes me think I won't be able to move it

rename the python script that claims to concatenate but currently writes the wides to SQLite

make clean will nuke target/env_package_repair_curated.tsv

per Chris call repaired table `biosample` and the raw table `biosample_raw`

per Chris call repaired table biosample and the raw table biosample_raw

get exp types and descriptions from BioProject

env_package repair only considers env_package, not anything else like model

Rename SQLite tables?

all_attribs -> long_all_attribs
catted_wide_harmonized_attributes -> wide_harmonized_attribs
biosample_basex_merged -> merged_wide
non_harmonized_attributes -> non_attrib_metadata

Elephant in room: biosample_set.xml too big for one BaseX database now

biosample_set.xml

https://docs.basex.org/wiki/Statistics

FileSize	#Files	#Nodes	#Attr	#ENames	#ANames	#URIs
512 GiB(2^39 Bytes)	536'870'912(2^29)	2'147'483'648(2^31)	no limit	32768(2^15)	32768(2^15)	256(2^8)

No limits on 'DbSize' (total space on disk?) or 'Height' (?)

Total biosample_set.xml nodes 2'345'750'066 vs 2'147'483'648 limit

Name                         Resources  Size         Input Path                                                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
biosample_set                1          33669367047  /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_through_12960136_ended.xml  
biosample_set_from_12960137  1          27995245425  /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_from_12960137_started.xml   

Database 'biosample_set_from_12960137' was opened in 368.67 ms.
Database Properties
 NAME: biosample_set_from_12960137
 SIZE: 27 GB
 NODES: 1054141679
 DOCUMENTS: 1
 BINARIES: 0
 TIMESTAMP: 2021-12-20T14:02:03.341Z
 UPTODATE: true

Resource Properties
 INPUTPATH: /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_from_12960137_started.xml
 INPUTSIZE: 26 GB
 INPUTDATE: 2021-12-20T13:19:44.317Z

Database 'biosample_set_from_12960137' was opened in 255.13 ms.
Database Properties
 NAME: biosample_set
 SIZE: 32 GB
 NODES: 1291608387
 DOCUMENTS: 1
 BINARIES: 0
 TIMESTAMP: 2021-12-20T14:57:57.172Z
 UPTODATE: true

Resource Properties
 INPUTPATH: /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_through_12960136_ended.xml
 INPUTSIZE: 31 GB
 INPUTDATE: 2021-12-20T13:17:26.604Z

Biosample fields to summarize for Montana

Bolded fields are not in the harmonized_wide (or harmonized_wide_repaired) table. Italicized fields are close matches to the requested field.

bacteria_carb_prod
collection_date
experimental_factor
growth_facil
investigation_type
link_addit_analys
microbial_biomass
samp_collect_device
samp_collec_method
samp_mat_process
samp_size
sample_name
sieving
size_frac
source_material_id
store_cond

From our previous discussion, here’s the terms we’d like to pull and evaluate how they’re being used.

sample_name: I want to know if people are using a unique ID or a human readable
store_cond: DNA or soil? It’s in the “soil” section, not nucleic acid
growth_facil: only in plants, I want to know if it’s been used for other smaple types & how
collection_date: how often do people include time
sieving: yes or no? or how do people detail it
bacteria_carb_prod: is this supposed to be respiration? What units fo people provide
link_addit_analys vs investigation type
microbial_biomass: carbon? Units?
experimental_factor: how is this used
samp_size: total? Or amount used in extraction
size_frac: sample? Or DNA?
Source_mat_id: to link biosample to ID? Or parent-child relationships?
Samp_collec_device: DNA or biosample?
Samp_collec_method: DNA or biosample?
Samp_mat_process: DNA or biosample?

write directly to SQLite

https://docs.basex.org/wiki/SQL_Module

Changes to biosample_non_harmonized_attributes_wide.xq query

Remove from

Query: biosample_non_harmonized_attributes_wide.xq
File: file biosample_non_harmonized_attributes_wide.tsv
Table: non_harmonized_attributes
- emp500_principal_investigator
- emp500_study_id
- emp500_title
  will be in long all attributes docs/table

Remove for new minimal external ids/links policy:

dna_source
entrez_links
xref

Keep

sra_id
- <Id db="SRA">SRS003378</Id>
doi
- <Link type="url" label="DOI">http://dx.doi.org/10.1016/0967-0645(96)00005-7</Link>

Add bioproject id xref

<Link type="entrez" target="bioproject">40793</Link>
most don't have a PRAJNA prefix... join to bioproject XML?

SRRs important for Bin Hu/EMP500 whole metagenomes (not amplicon studies)

Link types and targets

Labels and texts also available/common

4404408 entrez bioproject 
1679738 url  
  35866 entrez pubmed 
   9901 entrez omim 
    519 entrez biosample 
     62 entrez gds 
     57 entrez nuccore 
     16 entrez PUBMED 
      4 entrez PubMed 
      1 url publication 
      1 url GoldStamp Id 
      1 type target 
      1 entrez genome

Add analogous tabulation of id@dbs?

long
could truncate to most common values

are there any column that are really numerical?

all columns except raw_id are TEXT now

consult MIxS vs empirical observations

cori requires module load python

create a table with percent non-empty per column from the wide view

set up for production on large memory machines like NERSC cori

Include reporting of emp500 fields

am I backing myself into a corner wrt not running on computers with more modest RAM (32 GB?) in the future?

be very clear about differences from existing Perl approach. creates two TSVs: a long EAV of attributes and a wide table of non-attribute data. Casting the long EAV to wide is chunked but still needs lots of RAM. Then merging of the two wide tables is tricky. Currently doing in SQLite, but not happy with duplicated index column and soem other annoyances. Try in Python?!

util/extract_harmonizeds.py: TypeError: 'NoneType' object cannot be interpreted as an integer

On my Ubuntu 20 desktop:

determining attributes
2022-01-03 11:21:20.011939
Traceback (most recent call last):
  File "util/extract_harmonizeds.py", line 58, in <module>
    starts = list(range(0, aa_max_id_res, chunk_size))
TypeError: 'NoneType' object cannot be interpreted as an integer

recreate non-attribute + harmonized attribute view

env_package repair: examine various taxon fields too?

add intermediate basex etc files for 'each' step in Makefile

We are currently clobbering the same .basex file for a few of the later steps in the Makefile. To improve the pipeline and keep better track of progress and issues we can think of which intermediate files would be useful to store along the way. Some of these could be tmp files that are deleted after the step completes.

fencepost errors in chunk_harmonized_attributes_long.xq

one has to be gte or lte

where xs:integer(
  $bs_id_val
) > $min_bs_id_val
and xs:integer(
  $bs_id_val
) < $max_bs_id_val

Add normalized 'env_package's to SQLite?

I have tried to infer the normalization in the past, but the low number of targets and high diversity of patterns has led me to do hand curation recently

log of xqueries that are really part of make process

attribute_plus_emp500_wide.xq
- in comment
count_biosamples.xq
biosample_non_harmonized_attributes_wide.xq
biosample_harmonized_attributes_long.xq

chunk_harmonized_attributes_long.sh

make_wide_ha_chunks.py
cat_wide_ha_chunks.py

Create 'assumptions & limitations' doc for Biosample project

Incomplete list:

Only attributes with a harmonized name from the NCBI Biosample XML are included in the final data product.
About 50% of the sample records link to a NCBI Bioproject id.
Env package repair is against MIxS packages, best effort and includes confidence.

Subsetting rows based on environmental package

create table if not exists non_attribute_metadata_sel_envs
as
select
	nam.*
from
	non_attribute_metadata nam
join harmonized_wide_repaired hwr 
	on
	nam.raw_id = hwr.raw_id
join harmonized_wide hw
	on
	nam.raw_id = hw.raw_id
where
	hwr.env_package in ('plant-associated', 'soil', 'sediment', 'water')
;

turbomam / biosample-basex Goto Github PK

biosample-basex's People

Contributors

Stargazers

Watchers

biosample-basex's Issues

Link types and targets

Subsetting rows based on environmental package

Recommend Projects

Recommend Topics

Recommend Org