josiegleeson / bamslam Goto Github PK

Summarise and plot data from long-read ONT (direct RNA/cDNA) BAM files

License: MIT License

R 100.00%

bioinformatics direct-rna-sequencing long-read-sequencing

bamslam's Introduction

BamSlam

This script was written for long-read Oxford Nanopore Technologies direct RNA/cDNA sequencing data. It uses BAM files produced after mapping with minimap2 to the reference transcriptome. It will output a summary file and plots from your aligned reads. This script was used in: https://doi.org/10.1093/nar/gkab1129.

Inputs:

To obtain a BAM file align your FASTQ/FASTA files to the transcriptome with minimap2.

minimap2 -ax map-ont --sam-hit-only transcriptome.fasta reads.fastq | samtools view -bh > aligned_reads.bam

If minimap2 is not run with --sam-hit-only you should remove unmapped reads prior to running BamSlam to avoid slowing it down. You can also input a BAM file output from NanoCount: https://github.com/a-slide/NanoCount.

Requirements:

R (tested with v4.2.2) R packages:

GenomicAlignments (Bioconductor)
dplyr
tidyr
tibble
data.table
ggplot2
viridis
hexbin

Usage:

Download/copy the Rscript from this repository and run it from the terminal as follows:

Rscript BamSlam.R [DATA_TYPE] [BAM_FILE] [OUT_PREFIX]
Rscript BamSlam.R rna undiff1_5Y.bam undiff1

DATA_TYPE, enter either: cdna, rna
BAM_FILE, a BAM file of alignments to the transcriptome
OUT_PREF, output file prefix

The script takes approximately 5 minutes per million reads.

Outputs:

Data

A summary CSV file of your alignments.
A CSV file of all input alignments (primary and secondary) for downstream analysis.
A CSV file of each unique transcript identified in the data with its corresponding median read coverage.

Plots

Histogram of read coverages (full-length reads cutoff/dashed line = 0.95):
Histogram of known transcript lengths (per distinct transcript in the data):
Histogram of known transcript lengths (per read):
Histogram density plot of known transcript length vs coverage fractions:
Bar chart of the secondary alignments:
Density plot of the read accuracies:

Details of summary file metrics

Total number of reads
Number of reads representing full-length transcripts (reads with coverage fractions > 0.95)
Percentage of reads representing full-length transcripts
Median read coverage fraction (primary alignments)
Median alignment length (primary alignments)
Median accuracy (primary alignments) (calculated from CIGAR strings as: (nbrM+nbrI+nbrD-NM)/(nbrM+nbrI+nbrD))
Number of reads with no secondary alignments
Percentage of reads with no secondary alignments
Total number of distinct transcripts identified in the data
Median coverage fraction of all unique transcripts (median value of the median read coverages per transcript)
Median length of all unique transcripts identified (median length in nt of the annotated length of transcripts identified)

bamslam's People

Contributors

Stargazers

Watchers

Forkers

pyoelii wenmm lkwhite parthomayo

bamslam's Issues

Error appears when running the BamSlam.R

Hi, I am working with Nanopore direct-RNA sequencing and found your tools really useful. However, while I was trying to use your BamSlam script to generate some summaries and figures for the alignments, I came across the error below (I generated the bam file using the exact Minimap2 commands suggested by you).

Do you have any ideas what caused the error below, and how I should fix it? Thank you!

A summary CSV file of your alignments.

Rscript BamSlam.R rna align_reads/Min01_aligned_reads.bam Min01

Error:
Imported bam file
Error in [<-(*tmp*, paste0("nbr", col_names_extract), value = c(514L, :
3661425 rows in value to replace 732285 rows
Execution halted

Rscript BamSlam.R rna align_reads/Min02_aligned_reads.bam Min02

Error:
Imported bam file
Error in [<-(*tmp*, paste0("nbr", col_names_extract), value = c(840L, :
943080 rows in value to replace 188616 rows
Execution halted

I am looking forwards to your respond. Thank you!

Some guidance on usage

Hey! nice paper. I wanted to use this script to provide summary statistics of my mapped reads as well and cite your paper. Looking through the Rscript I can see we need to supply the file-path to the BAM file and output directory but there is also an argument flag for type. I was wondering what this was for.

Can we run simply by:

Rscript in.bam out.directory as in arguments are positional or are there flags?

metric definitions request

Hi folks,
I was looking for a way to test some metrics for ONT RNA reads and found your tool. Bamslam is super useful, and I enjoy how easy it is to run.

I am reviewing results I got for my samples and before I share them with anyone I want to make sure I understand the meanings for some metrics.

In particular, I'm looking for the difference between "Median coverage fraction of transcripts (primary alignments)" and "Median coverage fraction of all unique transcripts". I suspect the former is just the median fraction any primary alignment captures of its matching cDNA. Across the protocols we are testing we see that some have higher values this than others (values from 0.2 to 0.6), and this aligns with our expectation. The "Median coverage fraction of all unique transcripts" however, seems to yield a similar value for each of our samples (values from 0.2 to 0.24) . Can you help me understand this latter metric?

thanks
Richard