turbomam / biosample-basex Goto Github PK
View Code? Open in Web Editor NEWUsing the Base-X XML database to discover structure in NCBI's Biosample database
Using the Base-X XML database to discover structure in NCBI's Biosample database
sample:
like the project location, the basex binaries, and the basex data directory
I intended these to differentiate between
For example, this sample:
https://www.ncbi.nlm.nih.gov/biosample/SAMN00000005/
has no Bioproject id in the harmonized attributes from the NCBI Biosample xml.
However, NCBI Bioproject does have links to Bioprojects for this sample, in fact 5 of them. However, these links are potentially not the same as an actual Bioproject id for that Biosample id, but rather something like 'related information' or 'same organism'.
What is odd is this example, is the sample is actually:
Generic sample from Biomphalaria glabrata
Which sounds more like a generic placeholder sample for that organism?
This is still a little unclear and other examples may reveal more/different information.
as opposed to /q
followed by XXX
for example joining INSDC EnvO strings/IDs
future: extract knowledge from non ASCII chars like mu
biosample_set.xml
BioSample
sespecially for EMP500?
Montana: not strictly following MIxS
We talked about moving this to INCATools, but I didn't see the green New
button at https://github.com/orgs/INCATools/repositories
That makes me think I won't be able to move it
per Chris call repaired table biosample
and the raw table biosample_raw
all_attribs -> long_all_attribs
catted_wide_harmonized_attributes -> wide_harmonized_attribs
biosample_basex_merged -> merged_wide
non_harmonized_attributes -> non_attrib_metadata
https://docs.basex.org/wiki/Statistics
FileSize | #Files | #Nodes | #Attr | #ENames | #ANames | #URIs |
---|---|---|---|---|---|---|
512 GiB(2^39 Bytes) | 536'870'912(2^29) | 2'147'483'648(2^31) | no limit | 32768(2^15) | 32768(2^15) | 256(2^8) |
No limits on 'DbSize' (total space on disk?) or 'Height' (?)
Total biosample_set.xml nodes 2'345'750'066 vs 2'147'483'648 limit
Name Resources Size Input Path
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
biosample_set 1 33669367047 /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_through_12960136_ended.xml
biosample_set_from_12960137 1 27995245425 /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_from_12960137_started.xml
Database 'biosample_set_from_12960137' was opened in 368.67 ms.
Database Properties
NAME: biosample_set_from_12960137
SIZE: 27 GB
NODES: 1054141679
DOCUMENTS: 1
BINARIES: 0
TIMESTAMP: 2021-12-20T14:02:03.341Z
UPTODATE: true
Resource Properties
INPUTPATH: /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_from_12960137_started.xml
INPUTSIZE: 26 GB
INPUTDATE: 2021-12-20T13:19:44.317Z
Database 'biosample_set_from_12960137' was opened in 255.13 ms.
Database Properties
NAME: biosample_set
SIZE: 32 GB
NODES: 1291608387
DOCUMENTS: 1
BINARIES: 0
TIMESTAMP: 2021-12-20T14:57:57.172Z
UPTODATE: true
Resource Properties
INPUTPATH: /global/cfs/cdirs/m3513/endurable/biosample/mam/biosample-basex/target/biosample_set_through_12960136_ended.xml
INPUTSIZE: 31 GB
INPUTDATE: 2021-12-20T13:17:26.604Z
Bolded fields are not in the harmonized_wide
(or harmonized_wide_repaired
) table. Italicized fields are close matches to the requested field.
From our previous discussion, here’s the terms we’d like to pull and evaluate how they’re being used.
Remove from
biosample_non_harmonized_attributes_wide.xq
biosample_non_harmonized_attributes_wide.tsv
non_harmonized_attributes
emp500_principal_investigator
emp500_study_id
emp500_title
Remove for new minimal external ids/links policy:
dna_source
entrez_links
xref
Keep
sra_id
<Id db="SRA">SRS003378</Id>
doi
<Link type="url" label="DOI">http://dx.doi.org/10.1016/0967-0645(96)00005-7</Link>
Add bioproject id xref
<Link type="entrez" target="bioproject">40793</Link>
SRRs important for Bin Hu/EMP500 whole metagenomes (not amplicon studies)
Labels and texts also available/common
4404408 entrez bioproject
1679738 url
35866 entrez pubmed
9901 entrez omim
519 entrez biosample
62 entrez gds
57 entrez nuccore
16 entrez PUBMED
4 entrez PubMed
1 url publication
1 url GoldStamp Id
1 type target
1 entrez genome
Add analogous tabulation of id@db
s?
all columns except raw_id are TEXT
now
consult MIxS vs empirical observations
Include reporting of emp500 fields
am I backing myself into a corner wrt not running on computers with more modest RAM (32 GB?) in the future?
be very clear about differences from existing Perl approach. creates two TSVs: a long EAV of attributes and a wide table of non-attribute data. Casting the long EAV to wide is chunked but still needs lots of RAM. Then merging of the two wide tables is tricky. Currently doing in SQLite, but not happy with duplicated index column and soem other annoyances. Try in Python?!
On my Ubuntu 20 desktop:
determining attributes
2022-01-03 11:21:20.011939
Traceback (most recent call last):
File "util/extract_harmonizeds.py", line 58, in <module>
starts = list(range(0, aa_max_id_res, chunk_size))
TypeError: 'NoneType' object cannot be interpreted as an integer
We are currently clobbering the same .basex file for a few of the later steps in the Makefile. To improve the pipeline and keep better track of progress and issues we can think of which intermediate files would be useful to store along the way. Some of these could be tmp files that are deleted after the step completes.
one has to be gte or lte
where xs:integer(
$bs_id_val
) > $min_bs_id_val
and xs:integer(
$bs_id_val
) < $max_bs_id_val
I have tried to infer the normalization in the past, but the low number of targets and high diversity of patterns has led me to do hand curation recently
Incomplete list:
Chris suggested populating the repaired table with value unit strings
Marcin concerned about finding preferred values and applicability of value unit strings to ML
biosample_basex.db -> biosample_from_xq.db
create table if not exists non_attribute_metadata_sel_envs
as
select
nam.*
from
non_attribute_metadata nam
join harmonized_wide_repaired hwr
on
nam.raw_id = hwr.raw_id
join harmonized_wide hw
on
nam.raw_id = hw.raw_id
where
hwr.env_package in ('plant-associated', 'soil', 'sediment', 'water')
;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.