gelog / 1kg-metadata Goto Github PK
View Code? Open in Web Editor NEW1000 Genome Project Metadata
License: Apache License 2.0
1000 Genome Project Metadata
License: Apache License 2.0
Write a script that downloads, clean, and normalize the metadata of all the sequences in the FTP (e.g. sequence.index
).
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index
)FASTQ.Sequence
tab in https://docs.google.com/spreadsheets/d/1EXe7HnYFFnJvKN9IIARDg2flwtzVTftTpHwt7f64aaI/edit#gid=0sequence.index
. The local file's modified date should match the FTP date (for debugging).sequence.index
has not changed since last download (file size and modified date is the same, then the script should stop executing gracefully (nothing to do).Example of how this script could be called:
./sequences-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index > sequences.csv
Write a script that crawls an FTP to collect recursively the metadata of the files.
ftp://ftp.ncbi.nlm.nih.gov/
)/snp/organisms/human_9606/VCF
)00-All.vcf.gz
)All_20150603.vcf.gz
) e.g. for when the entry is a symbolic link.19
) in bytes2015-06-08T15:16:20Z
) in ISO 8601 format in the UTC timezone.Example of how this script could be called:
./files-crawler.sh ftp://ftp.ncbi.nlm.nih.gov/ > files.csv
There should a database configuration file that is NOT versionned in source control for security reasons. Instead there should be a sample config provided with standard defaults that each developer can modify locally. Then the developer can then copy the file locally which is to be ignored by git.
The db.env.sample is a BASH script provided as an example.
Modify the script as needed to suit the needs (e.g. .ini file, python, etc.). If the file is modified, the .gitignore
file may need to be modified as well.
Write a script that downloads, clean, and normalize the metadata of all the samples information in the FTP (e.g. sample_info.txt
).
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt
)Samples
tab in https://docs.google.com/spreadsheets/d/1EXe7HnYFFnJvKN9IIARDg2flwtzVTftTpHwt7f64aaI/edit#gid=0 and description in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/README_20130606_sample_infosample_info.txt
. The local file's modified date should match the FTP date (for debugging).sample_info.txt
has not changed since last download (file size and modified date is the same, then the script should stop executing gracefully (nothing to do).Example of how this script could be called:
./sample-metadata.sh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt > samples.csv
Write a script that downloads, clean, and normalize the metadata of all the alignment in the FTP (e.g. alignment.index
).
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index
)BAM.Alignment
tab in https://docs.google.com/spreadsheets/d/1EXe7HnYFFnJvKN9IIARDg2flwtzVTftTpHwt7f64aaI/edit#gid=0alignment.index
. The local file's modified date should match the FTP date (for debugging).alignment.index
has not changed since last download (file size and modified date is the same, then the script should stop executing gracefully (nothing to do).Example of how this script could be called:
./alignment-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index > alignments.csv
Write a script that runs all the other scripts so that the 1000 Genome Project can be synchronized with one single command.
Example of how this script could be called:
./sync1kg.sh
If all other files adhere to the specified examples, this file should have a content similar to:
#!/bin/bash
echo "Creating directory for holding temporary files"
mkdir -p etl/
./src/files-crawler.sh ftp://ftp.ncbi.nlm.nih.gov/ > etl/files.csv
./src/sequences-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index > etl/sequences.csv
./src/alignment-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index > etl/alignments.csv
./src/sample-metadata.sh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt > etl/samples.csv
TODO: download population and superpopulations csv files
./src/load-mysql.sh etl/samples.csv etl/populations.csv etl/superpopulations.csv etl/sequences.csv etl/alignments.csv etl/files.csv
Write a script that loads all data fetched from the previous scripts.
sample_sequences
: join table between the sequences
table (for in memory join, use sample_name + fastq_filename) and the files
tablesample_alignments
: join table between the alignments
table (for in memory join, use bam_filename) and the files
tableExample of how this script could be called:
./load-mysql.sh samples.csv populations.csv superpopulations.csv sequences.csv alignments.csv files.csv
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.