Code Monkey home page Code Monkey logo

bliss_test's Introduction

BLISS test

Analysis of BLISS test runs performed in June 2018.

The commands used can be found in cmd.sh.

Basic questions

  1. How many reads start with the barcode ('AGACTCT', or 'AGACTC')?
  2. How many reads can be mapped to a reference human genome?
  3. How many reads can be mapped to telomeres?

Input data

The input fastq files were downloaded from here to the fastq directory, but are also accessible on the server (bigram:/ngs/Fastq/new_BCHLT/). They correspond to sequencing results from two samples.

This is how sequences were designed:

alt text

Adapter trimming

The steps below were done knowing that the overrepresented sequences found by FastQC were reverse complement sequences of the opposite adapters (R1 <> R2). This was due to formation of "dimers", without prior ligation of the barcode between adapters.

All commands used are in cmd_3.sh.

  • In the beginning there are:

    • 5,253,177 C1_S1 paired reads
    • 3,793,031 NB_S2 paired reads
  • Reads with the native adapters in R1 at the 5' end were discarded with cutadapt: cutadapt -g AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC --discard-trimmed --minimum-length 6 -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_native_adapters.fastq.gz -p trim_adapters/B_SC-BLESS_C1_S1_L001_R2_BCHLT_cutadapt_native_adapters.fastq.gz fastq/B_SC-BLESS_C1_S1_L001_R1_BCHLT.fastq.gz fastq/B_SC-BLESS_C1_S1_L001_R2_BCHLT.fastq.gz > analysis/cutadapt_native_adapters.out

    • 7,074 (0.1%) R1 reads discarded in C1_S1
    • 4,686 (0.1%) R1 reads discarded in NB_S2
    • So, in general, R1 reads were free from R1 adapters.
  • We also want to trim the R2 reverse complement (R2_RC) adapter from R1 reads at the 3' end: cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG --minimum-length 6 -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_native_adapters_2.fastq.gz -p trim_adapters/B_SC-BLESS_C1_S1_L001_R2_BCHLT_cutadapt_native_adapters_2.fastq.gz trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_native_adapters.fastq.gz trim_adapters/B_SC-BLESS_C1_S1_L001_R2_BCHLT_cutadapt_native_adapters.fastq.gz > analysis/cutadapt_native_adapters_2.out

    • 5,032,988 (95.9%) R1 reads with R2_RC adapter in C1_S1
    • 3,229,386 (85.2%) R1 reads with R2_RC adapter in NB_S2
    • In general, around 90% of R1 reads contain R2_RC adapter.
  • The above steps were repeated for R2 reads.

  • The resulting reads of minimum 6 nt are at trim_adapters/*cutadapt_native_adapters_4.fastq.gz:

    • 4,017,871 reads in C1_S1
    • 3,040,363 reads in NB_S2

Barcode presence

This paragraph corresponds to commands from cmd_4.sh.

  • Histogram of read lengths (with barcodes left) is shown in a plot.

  • The first and last 10 nucleotides in R1 and R2 reads, respectively, plotted in WebLogo: alt text

  • Therefore, we assume barcode as 'AGACTC' (and 'GAGTCT'/'CAGTCT' at 3' end for R2 reads as C is very common at the R2 5' ends). Numbers of R1 reads starting with the exact barcode:

    • 2,463,583 (46.9% of total) in C1_S1:
      • among which 2,349,656 (44.7% of total) have the barcode (max. one mismatch) in R2
    • 2,061,719 (54.0% of total) in NB_S1
      • among which 1,902,007 (50.1% of total) have the barcode (max. one mismatch) in R2
  • R2 reads start usually with C, followed by GG (logo for C1 and NB).

  • Filtered and trimmed reads (min 1 nt, trimmed from both sides) were written to barcodes and analyzed with FastQC (C1_R1, NB_R1).

Mapping to human genome

This paragraph corresponds to commands from cmd_5.sh.

  • Only reads of length at least 30 nt (R1) were left for mapping (barcodes/*barcodes_2_30.fastq.gz):

    • 277,829 R1 (5.3% of the initial reads) in C1_S1
      • among which 238,841 R2 have at least 1 nt
        • among which 72,801 R2 have at least 30 nt
    • 692,904 R1 (18.3% of the initial reads) in NB_S1
      • among which 587,517 R2 have at least 1 nt
        • among which 168,203 R2 have at least 30 nt
  • Then, first nucleotides from R2 reads were trimmed with fastx_trimmer: fastx_trimmer -Q33 -z -f 2 < barcodes/B_SC-BLESS_C1_S1_L001_R2_BCHLT_barcodes_2_30.fastq > barcodes/B_SC-BLESS_C1_S1_L001_R2_BCHLT_barcodes_2_30_trim1.fastq.gz

  • The paired-end reads (R1 at least 30 nt, R2 at least 30 nt, in "mixed mode") were mapped to hg38 using bowtie2 (stdout in bowtie.out): bowtie2 -5 1 --fr -N 1 -x hg_mapping/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index -1 barcodes/B_SC-BLESS_C1_S1_L001_R1_BCHLT_barcodes_2_30.fastq.gz -2 barcodes/B_SC-BLESS_C1_S1_L001_R2_BCHLT_barcodes_2_30.fastq.gz --threads 6 -S hg_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie_N1.sam 2>> analysis/bowtie.out:

    • C1_S1:
     238841 reads; of these:
       238841 (100.00%) were paired; of these:
         234495 (98.18%) aligned concordantly 0 times
         2620 (1.10%) aligned concordantly exactly 1 time
         1726 (0.72%) aligned concordantly >1 times
         ----
         234495 pairs aligned concordantly 0 times; of these:
           194 (0.08%) aligned discordantly 1 time
         ----
         234301 pairs aligned 0 times concordantly or discordantly; of these:
           468602 mates make up the pairs; of these:
             370608 (79.09%) aligned 0 times
             636 (0.14%) aligned exactly 1 time
             97358 (20.78%) aligned >1 times
     22.42% overall alignment rate
    
    • NB_S2:
     613650 reads; of these:
       613650 (100.00%) were paired; of these:
         574468 (93.61%) aligned concordantly 0 times
         21680 (3.53%) aligned concordantly exactly 1 time
         17502 (2.85%) aligned concordantly >1 times
         ----
         574468 pairs aligned concordantly 0 times; of these:
           673 (0.12%) aligned discordantly 1 time
         ----
         573795 pairs aligned 0 times concordantly or discordantly; of these:
           1147590 mates make up the pairs; of these:
             911382 (79.42%) aligned 0 times
             7231 (0.63%) aligned exactly 1 time
             228977 (19.95%) aligned >1 times
     25.74% overall alignment rate
    
  • In general, only a small fraction of reads map to the human genome with high precision:

    • 5,754 (0.1% of total reads) = 4575 + 191 + 988 "aligned exactly 1 time" in C1_S1
    • 29,584 (0.8% of total reads) = 21,680 + 673 + 7,231 "aligned exactly 1 time" in NB_S2
  • Samples of the trimmed sequences are at mapping (*head1000.fa).

  • SAM files were converted to BAM using samtools: samtools view -Sb hg38_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie.sam > hg38_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie.bam and then to BED using bedtools: bedtools bamtobed -i hg38_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie.bam > hg38_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie.bed, and sorted: bedtools sort -i hg38_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie.bed > hg38_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie_sorted.bed.

  • Get the read length: samtools stats hg_mapping/B_SC-BLESS_C1_S1_L001_BCHLT_bowtie.sam | grep ^RL | cut -f 2- > hg_mapping/C1_S1_bowtie_RL

Where do the reads map?

Script length_distribution.py performs very basic analysis of the mapped reads, including printing histogram of length of mapped reads, as well as statistics on numbers of regions to which the reads map. Standard output is written to 10+reads. These are regions with highest (paired and unique) read coverage:

Obsolete analysis

Reads statistics

Script get_cutadapt_stats.py writes a table with trimming statistics: python scripts/get_cutadapt_stats.py. A plot below (from analysis/cutadapt_stats.numbers) shows numbers of reads after trimming and with barcodes for the respective fastq files.

alt text

  • Large contamination with adapters (over 90% of reads are affected).
  • Small fraction of reads with perfect match to the barcode at 5' end.
File Total reads Without adapters (min 6 nt) With 'AGACTCT' barcode (min 1 nt, max 1 mismatch) With 'AGACTCT' barcode (min 1 nt)
C1_S1_L001_R1 5,253,177 214,533 172,057 44,853
C1_S1_L001_R2 5,253,177 853,167 4,141 88
NB_S2_L001_R1 3,793,031 560,407 474,298 126,418
NB_S2_L001_R2 3,793,031 958,800 3,527 55

What happens to the barcode?

Analysis of the barcoded reads (commands cmd_2.sh):

  • First, only reads without adaptors (R1 of min. 6 nt) were left, together with their R2 counterparts, e.g.: cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG --discard-trimmed --minimum-length 6 -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_paired_1.fastq.gz -p trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_paired_2.fastq.gz fastq/B_SC-BLESS_C1_S1_L001_R1_BCHLT.fastq.gz fastq/B_SC-BLESS_C1_S1_L001_R2_BCHLT.fastq.gz > analysis/cutadapt_2.out

    • 214,533 reads of C1_S1
    • 927,052 reads of NB_S2
  • To check overrepresented sequences at 5':

    • Fastq files were converted to fasta (at mapping) with: fastq_to_fasta -Q33 -i B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_paired_1.fastq -o C1_S1_R1.fa
    • The first 10 nt cut with: python ../scripts/cut_10.py C1_S1_R1.fa C1_S1_R1_first10.fa
    • Visualized with weblogo using: weblogo -f C1_S1_R1_first10.fa -D fasta -o C1_S1_R1_first10.pdf -F pdf -A dna --errorbars no -c classic

    alt text

    • Therefore, in the next steps, we would assume barcode = 'AGACTC'.
  • Barcoded reads (with 'AGACTC') were written together with their R2 counterparts to paired fasta files: cutadapt -g ^AGACTC -e 0 --overlap 6 --minimum-length 1 --discard-untrimmed -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_barcodes_1.fastq.gz -p trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_barcodes_2.fastq.gz trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_paired_1.fastq.gz trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_paired_2.fastq.gz:

    • 166,254 reads in C1_S1 (79.7%)
    • 739,074 reads in NB_S2 (77.5%)

How many of the barcoded reads can be mapped to human genome?

Mapping of the barcoded reads to the human genome was done as follows (cmd_2.sh):

  • The reference indexed GRCh38/hg38 genome was downloaded from NCBI ftp (GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz), as linked in the Bowtie 2 website.

Quality control (with guessed sequences of adapters)

  • FastQC was run to check quality of the reads (outputs written to FastQC). Overrepresented sequences suggest contamination with 'RNA PCR Primer, Index 6' (R1 C1_S1), 'RNA PCR Primer, Index 12' (R1 NB_S2), and 'Illumina RNA PCR Primer' (R2). Sequences of these adapters (their reverse complement) were found here and written to primers.fa. This step was done before knowing sequences of the adapters. With this knowledge, the results suggest that majority of reads are R1+R2 dimers, with no barcodes.

  • The adapters were trimmed with cutadapt (allowing at minimum three bases match between adapter and match, and discarding proccesed reads that are shorter than 6 nt), and only the reads without matching adapters were left, e.g.: cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG --discard-trimmed --minimum-length 6 -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_1.fastq.gz fastq/B_SC-BLESS_C1_S1_L001_R1_BCHLT.fastq.gz > analysis/cutadapt.out:

      === Summary ===
    
      Total reads processed:               5,253,177
      Reads with adapters:                 5,038,644 (95.9%)
      Reads that were too short:             737,275 (14.0%)
      Reads written (passing filters):       214,533 (4.1%)
      
      Total basepairs processed:   372,975,567 bp
      Total written (filtered):     15,231,843 bp (4.1%)
    
  • Then, only reads with AGACTCT at 5' end were left, e.g.: cutadapt -g ^AGACTCT -e 0 --overlap 7 --minimum-length 1 -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_2.fastq.gz --discard-untrimmed trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt.fastq.gz >> analysis/cutadapt.out:

      === Summary ===
    
      Total reads processed:                 214,533
      Reads with adapters:                    44,853 (20.9%)
      Reads that were too short:                   0 (0.0%)
      Reads written (passing filters):        44,853 (20.9%)
      
      Total basepairs processed:    15,231,843 bp
      Total written (filtered):      2,870,592 bp (18.8%)
    
  • Reads with barcodes with max. one mismatch were also filtered, e.g.: cutadapt -g ^AGACTCT -e 0.15 --overlap 7 --minimum-length 1 -o trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt_3.fastq.gz --discard-untrimmed trim_adapters/B_SC-BLESS_C1_S1_L001_R1_BCHLT_cutadapt.fastq.gz >> analysis/cutadapt.out

  • Similarly for the remaining files (or run bash cmd.sh &), write standard output to cutadapt.out.

bliss_test's People

Watchers

Dorota Mat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.