Code Monkey home page Code Monkey logo

mngs's Introduction

Introduction

Getting started

Create a directory named "MNGS2" in your home directory. This is where the analysis will be carried out. As we go on, we shall be creating sub-directories accordingly to ensure that intermediate outputs can be refered to easily whenever needed for downstream analyses.

mkdir MNGS2

Make a sub-directory data in MNGS2 and download the data.

mkdir MNGS2/data
cd data
wget https://mgexamples.s3.climb.ac.uk/HMP_GUT_SRS052697.25M.1.fastq.gz
wget https://mgexamples.s3.climb.ac.uk/HMP_GUT_SRS052697.25M.2.fastq.gz

To confirm that the links have been successfully created, list the contents of the folder using the ls command. If the data is compressed (i.e .gz files), uncompress them using gunzip * Check to see if the data has been downloaded by using ls -lh. What is the size of these files?

Quality control check

Here we use fastqc , it is necessary that we store quality control files for easy reference. In MNGS2, create a sub-directory qcresults, this is where the fastqc results will be stored. After that, do the quality checks;

mkdir $HOME/MNGS2/qcresults
fastqc $HOME/MNGS2/data/*.fastq -o $HOME/MNGS2/qcresults

Once this is done, navigate to qcresults and view the ".html" files. This report can be used to assess quality distribution, length distribution, GC-content, nucleotide distribution. This informs downstream quality and adapter trimming.

Data quality trimming

We use trim_galore for quality and adapter trimming. Depending on the qc results, it would be necessary to change some parameters used here accordingly. This script clips off the first 16 bases of the reads from the 5' end. In addition, it removes bases with phred quality less than 25 on the 3' end of the reads. We need to store quality trimmed reads, as such, in WGS directory, make a sub directory trimmed.

mkdir $HOME/WGS/trimmed
trim_galore -q 25 -l 75 --dont_gzip --clip_R1 16 --clip_R2 16 --paired ../data/HMP_GUT_SRS052697.25M.1.fastq ../data/HMP_GUT_SRS052697.25M.2.fastq -o .

If all goes well, trimmed reads will be available in trimmed. You may consider looking at the trimmed reads using fastqc to check the improvement made by trim_galore.

Remove host sequences

There is need to get rid of host DNA sequences that could have contaminated the data earlier in the sampling and extraction stage. This is done by mapping the reads to host reference genome and picking the unmapped sequences.

There is a copy of the human genome on this system, so we shall share that into every one's working directory. Create a directory hostgenome which will harbour the host genome. Note: Change the genome name accordingly.

mkdir hostgenome
bwa index Gh37.fa

At this point, we create a directory to store host-free sequence data. Samtools and bedtools are the key tools used for this purpose.

mkdir hostfree
cd hostfree
bwa aln ../hostgenome/Gh37.fa ../data/HMP_GUT_SRS052697.25M.1_trim.fastq > HMP_GUT_SRS052697.25M.1.sai
bwa aln ../hostgenome/Gh37.fa ../data/HMP_GUT_SRS052697.25M.2_trim.fastq > HMP_GUT_SRS052697.25M.2.sai
bwa sampe ../hostgenome/Gh37.fa HMP_GUT_SRS052697.25M.1.sai HMP_GUT_SRS052697.25M.2.sai ../trimmed/HMP_GUT_SRS052697.25M.1_trim.fastq ../trimmed/HMP_GUT_SRS052697.25M.2_trim.fastq > reads12_alignment.sam

Extract the unmapped reads, sort them and create a corresponding bam file

samtools view -bS reads_mapped_and_unmapped.sam | samtools view -b -f 12 -F 256 | samtools sort -n > reads_unmapped.bam

Convert bam files to fastq

bedtools bamtofastq -i reads_unmapped.bam -fq read_1.fastq -fq2 read_2.fastq

Now at this point we very much hope that we dont have any host sequences sequences.

De novo assembly

Here we use IDBA-UD to assemble host free reads to obtain consesus contig.This program takes on a single input. As such we need to merge the forward and backward reads before asssemly.

mkdir $HOME/MNGS2/denovo
cd denovo
fq2fa --merge  ../hostfree/read_1.fastq ../hostfree/read_2.fastq  ../hostfree/reads_12.fa
idba_ud -r ../hostfree/reads_12.fa --num_threads 20 -o .

To evaluate and assess the assembly, we use quast. This will provide a summary of the metagenome assembly, including but not limited to N50, N75, L50, L75, GC percentage, number of contigs with size greater than 500bp (Only assesses the consensus, similar procedure can be used to assess other outputs).

quast.py -t 4 -f --meta contigs.fa -o .

Gene predicition

For gene prediction proceeds using prodigal software. In this example, the output includes predicted genes, coded protein sequences and an annotation file. The annotation file has features, corresponding length and location on the genome among others. See full description of at GFF format

mkdir genes
cd genes
prodigal -a genes.fa -d proteins.fa -i ../denovo/contigs.fa -f gff -p meta > annotation.gff

Taxonomic classification

kaiju makeDB.sh -d kaijudb -v 

#do the classification using kaiju metagenome classifier
for sample in $(ls ../mapping/*R1*.fastq); do
        bname=$(basename $sample "_host_removed_R1.fastq")
        kaiju -v -x -z 20 \
        -t /kaijudb/nodes.dmp \
        -f /kaijudb/kaiju_db.fmi \
        -i /mapping/${bname}_host_removed_R1.fastq \
        -j /mapping/${bname}_host_removed_R2.fastq \
        -o ${bname}_kaiju.out ;
done

#add taxon names to each file (this is stored in addtaxon.sh)
for sample in $(ls *.out); do
        addTaxonNames -t ../kaijudb/nodes.dmp \
        -n ../kaijudb/names.dmp \
        -i ${sample} \
        -o ${sample}_names.out
done

#add sample names to each file
for f in $(ls *names.out); do
        #kaiju2krona -t ../kaijudb/nodes.dmp -n ../kaijudb/names.dmp -i $f -o ${s}.krona
        sample=$(echo $f | cut -f 1 -d 'k' | sed 's/_/ /g' | sed 's/ /_/')
        sed -i "s/$/\t$sample/" $f;
done

#pick only the classified records (identified by C in the first column)
for i in $(ls *.out); do awk -F"\t" '$1=="C" {print $8"\t"$9}' $i > ${i}.classified; done

#merge all this into a single file
cat *.classified > merged.classified

mngs's People

Contributors

alfredug avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.