Code Monkey home page Code Monkey logo

bamslam's Introduction

BamSlam

This script was written for long-read Oxford Nanopore Technologies direct RNA/cDNA sequencing data. It uses BAM files produced after mapping with minimap2 to the reference transcriptome. It will output a summary file and plots from your aligned reads. This script was used in: https://doi.org/10.1093/nar/gkab1129.

Inputs:

To obtain a BAM file align your FASTQ/FASTA files to the transcriptome with minimap2.

minimap2 -ax map-ont --sam-hit-only transcriptome.fasta reads.fastq | samtools view -bh > aligned_reads.bam

If minimap2 is not run with --sam-hit-only you should remove unmapped reads prior to running BamSlam to avoid slowing it down. You can also input a BAM file output from NanoCount: https://github.com/a-slide/NanoCount.

Requirements:

R (tested with v4.2.2) R packages:

  • GenomicAlignments (Bioconductor)
  • dplyr
  • tidyr
  • tibble
  • data.table
  • ggplot2
  • viridis
  • hexbin

Usage:

Download/copy the Rscript from this repository and run it from the terminal as follows:

Rscript BamSlam.R [DATA_TYPE] [BAM_FILE] [OUT_PREFIX]
Rscript BamSlam.R rna undiff1_5Y.bam undiff1

DATA_TYPE, enter either: cdna, rna
BAM_FILE, a BAM file of alignments to the transcriptome
OUT_PREF, output file prefix

The script takes approximately 5 minutes per million reads.

Outputs:

Data

  • A summary CSV file of your alignments.
  • A CSV file of all input alignments (primary and secondary) for downstream analysis.
  • A CSV file of each unique transcript identified in the data with its corresponding median read coverage.

Plots

  • Histogram of read coverages (full-length reads cutoff/dashed line = 0.95):

  • Histogram of known transcript lengths (per distinct transcript in the data):

  • Histogram of known transcript lengths (per read):

  • Histogram density plot of known transcript length vs coverage fractions:

  • Bar chart of the secondary alignments:

  • Density plot of the read accuracies:

Details of summary file metrics

  • Total number of reads
  • Number of reads representing full-length transcripts (reads with coverage fractions > 0.95)
  • Percentage of reads representing full-length transcripts
  • Median read coverage fraction (primary alignments)
  • Median alignment length (primary alignments)
  • Median accuracy (primary alignments) (calculated from CIGAR strings as: (nbrM+nbrI+nbrD-NM)/(nbrM+nbrI+nbrD))
  • Number of reads with no secondary alignments
  • Percentage of reads with no secondary alignments
  • Total number of distinct transcripts identified in the data
  • Median coverage fraction of all unique transcripts (median value of the median read coverages per transcript)
  • Median length of all unique transcripts identified (median length in nt of the annotated length of transcripts identified)

bamslam's People

Contributors

josiegleeson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bamslam's Issues

Error appears when running the BamSlam.R

Hi, I am working with Nanopore direct-RNA sequencing and found your tools really useful. However, while I was trying to use your BamSlam script to generate some summaries and figures for the alignments, I came across the error below (I generated the bam file using the exact Minimap2 commands suggested by you).

Do you have any ideas what caused the error below, and how I should fix it? Thank you!

A summary CSV file of your alignments.

Rscript BamSlam.R rna align_reads/Min01_aligned_reads.bam Min01

Error:
Imported bam file
Error in [<-(*tmp*, paste0("nbr", col_names_extract), value = c(514L, :
3661425 rows in value to replace 732285 rows
Execution halted

Rscript BamSlam.R rna align_reads/Min02_aligned_reads.bam Min02

Error:
Imported bam file
Error in [<-(*tmp*, paste0("nbr", col_names_extract), value = c(840L, :
943080 rows in value to replace 188616 rows
Execution halted

I am looking forwards to your respond. Thank you!

Some guidance on usage

Hey! nice paper. I wanted to use this script to provide summary statistics of my mapped reads as well and cite your paper. Looking through the Rscript I can see we need to supply the file-path to the BAM file and output directory but there is also an argument flag for type. I was wondering what this was for.

Can we run simply by:

Rscript in.bam out.directory as in arguments are positional or are there flags?

metric definitions request

Hi folks,
I was looking for a way to test some metrics for ONT RNA reads and found your tool. Bamslam is super useful, and I enjoy how easy it is to run.

I am reviewing results I got for my samples and before I share them with anyone I want to make sure I understand the meanings for some metrics.

In particular, I'm looking for the difference between "Median coverage fraction of transcripts (primary alignments)" and "Median coverage fraction of all unique transcripts". I suspect the former is just the median fraction any primary alignment captures of its matching cDNA. Across the protocols we are testing we see that some have higher values this than others (values from 0.2 to 0.6), and this aligns with our expectation. The "Median coverage fraction of all unique transcripts" however, seems to yield a similar value for each of our samples (values from 0.2 to 0.24) . Can you help me understand this latter metric?

thanks
Richard

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.