Code Monkey home page Code Monkey logo

variant-calling-in-rna-seq-data-na12878's Introduction

Variant Calling in NA12878 RNA data

Variant Calling in NA12878 was performed using the GATK best practices workflow.

Fastq files and reference genome

Fastq files were obtained from the following link https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5665260 To download the files I installed SRAtoolkit on the server and used the fasterq dump command:

fasterq-dump SRR5665260

Reference genome was downloaded from the following link: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0;tab=objects?prefix=&forceOnObjectsSortingFiltering=false The fasta file, .fai file, .dict, dbsnp file, known indel sites were also downloaded.

GTF file was downloaded from Gencode: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.annotation.gtf.gz

Workflow

Workflow

Options used

Running fastp for quality check and to trim adapters

Input

SRR5665260_1.fastq,
SRR5665260_2.fastq

Command

fastp -i SRR5665260_1.fastq -I SRR5665260_2.fastq -o SRR5665260_1.fastp.fastq -O SRR5665260_2.fastp.fastq --detect_adapter_for_pe 

Output

SRR5665260_1.fastp.fastq,
SRR5665260_2.fastp.fastq,
fastp.html,
fastp.json

Generating a index for the reference genome using the gtf file

Input

resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta,# reference genome
gencode_v39_annotation.gtf #gtf file

Command

STAR --runThreadN 6 --runMode genomeGenerate --genomeDir reference_genome_hg38_index/ --genomeFastaFiles reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --sjdbGTFfile gtf_files/gencode_v39_annotation.gtf --sjdbOverhang 143

Output

genome Dir

Mapping the fastq files using STAR two pass

Input

genomeDir,
SRR5665260_1.fastp.fastq,
SRR5665260_2.fastp.fastq

Command

nohup STAR --runThreadN 6 --genomeDir reference_genome_hg38_index/ --readFilesIn SRR5665260_1.fastp.fastq SRR5665260_2.fastp.fastq --sjdbOverhang 143 --twopassMode Basic --outSAMtype BAM SortedByCoordinate --outFileNamePrefix NA12878_RNA & 

Output

NA12878_RNAAligned.sortedByCoord.out.bam #bam file sorted by coordinate

Mark duplicates

Input

NA12878_RNAAligned.sortedByCoord.out.bam

Command

gatk MarkDuplicates --INPUT NA12878_RNAAligned.sortedByCoord.out.bam --OUTPUT NA12878_RNAAligned.sortedByCoord.out_deduped.bam --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --METRICS_FILE NA12878_RNA_duplicates.metrics &

Output

NA12878_RNAAligned.sortedByCoord.out_deduped.bam, \ #deduplicated bam file NA12878_RNAAligned.sortedByCoord.out_deduped.bai, \ #deduplicated bam file index NA12878_RNA_duplicates.metrics #metrics

SplitNCigar

Input

reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta, #reference genome
NA12878_RNAAligned.sortedByCoord.out_deduped.bam #deduplicated bam file

Command

nohup gatk SplitNCigarReads -R reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -I NA12878_RNAAligned.sortedByCoord.out_deduped.bam -O NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar.bam &

Output

NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar.bam,
NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar.bai

Base Recalibration

Input

reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta,
/workspace/RNAseq_data/NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp.bam,
/workspace/RNAseq_data/gatk_resource_bundle/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -known-sites,
/workspace/RNAseq_data/gatk_resource_bundle/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz

Command

gatk BaseRecalibrator -R /workspace/RNAseq_data/reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -I /workspace/RNAseq_data/NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp.bam --use-original-qualities -O NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal.bam -known-sites /workspace/RNAseq_data/gatk_resource_bundle/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -known-sites /workspace/RNAseq_data/gatk_resource_bundle/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz &

Output

NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal.bam

Apply BQSR

Input

/workspace/RNAseq_data/reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta,
/workspace/RNAseq_data/NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp.bam,
NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal.bam

Command

nohup gatk ApplyBQSR --add-output-sam-program-record -R /workspace/RNAseq_data/reference_genome_hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -I /workspace/RNAseq_data/NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp.bam --use-original-qualities -O NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal_2.bam --bqsr-recal-file NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal.bam &

Output

NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal_2.bam,
NA12878_RNAAligned.sortedByCoord.out_deduped_postSplitNCigar_withReadGrp_recal_2.bai

Variant calling

variant-calling-in-rna-seq-data-na12878's People

Contributors

shaikhfarheen03 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.