Code Monkey home page Code Monkey logo

1kg-metadata's People

Contributors

davidonlaptop avatar kmoukui avatar

Watchers

 avatar  avatar  avatar

1kg-metadata's Issues

Sequence Metadata script

Write a script that downloads, clean, and normalize the metadata of all the sequences in the FTP (e.g. sequence.index).

Script input parameters:

  • indexURL (e.g. ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index)

Script output (TSV Format):

Particular behavior

  • The script should store locally on disk the original sequence.index. The local file's modified date should match the FTP date (for debugging).
  • If the sequence.index has not changed since last download (file size and modified date is the same, then the script should stop executing gracefully (nothing to do).

Example

Example of how this script could be called:

./sequences-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index   >  sequences.csv 

FTP Metadata Crawler script

Write a script that crawls an FTP to collect recursively the metadata of the files.

Script input parameters:

  • FTP_Site (e.g. ftp://ftp.ncbi.nlm.nih.gov/)

Script output (TSV Format):

  • rows:
    • one row per file
  • columns:
    • directory (/snp/organisms/human_9606/VCF)
    • filename (00-All.vcf.gz)
    • linktarget (All_20150603.vcf.gz) e.g. for when the entry is a symbolic link.
    • filesize (19) in bytes
    • DateModified (2015-06-08T15:16:20Z) in ISO 8601 format in the UTC timezone.
    • other metadata available ?

Particular behavior

  • The script should throttle its speed so not to flood the FTP server with too many requests. A good rule of thumb would be not more than 4 requests per second. This should be a variable in the script.

Example

Example of how this script could be called:

./files-crawler.sh ftp://ftp.ncbi.nlm.nih.gov/   >  files.csv 

Database configuration file

There should a database configuration file that is NOT versionned in source control for security reasons. Instead there should be a sample config provided with standard defaults that each developer can modify locally. Then the developer can then copy the file locally which is to be ignored by git.

The db.env.sample is a BASH script provided as an example.

Modify the script as needed to suit the needs (e.g. .ini file, python, etc.). If the file is modified, the .gitignore file may need to be modified as well.

Sample script metadata

Write a script that downloads, clean, and normalize the metadata of all the samples information in the FTP (e.g. sample_info.txt).

Script input parameters:

  • indexURL (e.g. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt)

Script output (TSV Format):

Particular behavior

  • The script should store locally on disk the original sample_info.txt. The local file's modified date should match the FTP date (for debugging).
  • If the sample_info.txt has not changed since last download (file size and modified date is the same, then the script should stop executing gracefully (nothing to do).

Example

Example of how this script could be called:

./sample-metadata.sh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt   >  samples.csv 

Alignment Metadata script

Write a script that downloads, clean, and normalize the metadata of all the alignment in the FTP (e.g. alignment.index).

Script input parameters:

  • indexURL (e.g. ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index)

Script output (TSV Format):

Particular behavior

  • The script should store locally on disk the original alignment.index. The local file's modified date should match the FTP date (for debugging).
  • If the alignment.index has not changed since last download (file size and modified date is the same, then the script should stop executing gracefully (nothing to do).

Example

Example of how this script could be called:

./alignment-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index   >  alignments.csv 

Do all script

Write a script that runs all the other scripts so that the 1000 Genome Project can be synchronized with one single command.

Particular behavior

  • The script should output the date (HH:mm:ss) at each step and log to stdout its progress

Execution Example

Example of how this script could be called:

./sync1kg.sh

Contents Example

If all other files adhere to the specified examples, this file should have a content similar to:

#!/bin/bash

echo "Creating directory for holding temporary files"
mkdir -p etl/

./src/files-crawler.sh ftp://ftp.ncbi.nlm.nih.gov/   >  etl/files.csv
./src/sequences-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index   >  etl/sequences.csv
./src/alignment-metadata.sh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index   >  etl/alignments.csv
./src/sample-metadata.sh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt   >  etl/samples.csv
TODO: download population and superpopulations csv files

./src/load-mysql.sh etl/samples.csv etl/populations.csv etl/superpopulations.csv etl/sequences.csv etl/alignments.csv etl/files.csv

Loading script

Write a script that loads all data fetched from the previous scripts.

Script input parameters:

  • samples
  • populations
  • superpopulations
  • sequences
  • alignments
  • files

Script output (TSV Format):

  • none

Particular behavior

  • The script should use the database configuration file (see #5).
  • The script should create the database if it does not exists
  • The script should create the tables only if they do not exists
  • If data exists, the tables should be TRUNCATED.
  • One table is created for each of the 6 CSV files
  • Additionally, these 2 tables are created (in-memory join mapping required):
    • sample_sequences : join table between the sequences table (for in memory join, use sample_name + fastq_filename) and the files table
    • sample_alignments : join table between the alignments table (for in memory join, use bam_filename) and the files table

Example

Example of how this script could be called:

./load-mysql.sh samples.csv populations.csv superpopulations.csv sequences.csv alignments.csv files.csv 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.