The icetea from vivekbhr

use summarizeoverlaps

Instead of using regioncounts and windowcounts from csaw, use summarizeOverlaps so that strand could be considered.

Use exact 5' start position match find duplicates

Duplicate detection algorithm is too stringent (discarding reads with duplicate UMIs within 1kb windows). Match by start position instead

selective input of reads into duplicate removal

This would be a big change from the beginning of the workflow. In principle it's possible to do the PCR duplicate removal and demultiplexing at the same time. The idea is :

During mapping, add the readGroupID and readGroup tags to the bam file, based on the fastq sample index (I don't know how).
Directly use the filterDuplicates function, but input the reads from the bam file selectively by tag, then output one filtered file per tag, using the tagFilter arg in ScanBamParam

mapped reads are counted as reads, not fragments

post mapping stats are 2x the number of mapped reads..

Travis

Follwing changes pending for unit testing.

Activate travis builds
Write testthat checks.
Make all examples work (if possible).

filterDuplicates with `UMI` option

filterDuplicates assumes that UMI is present and throws an error for CAGE data (no UMI) . Add an option called UMI (default = TRUE) such that users could set this to FALSE to remove duplicates using only 5' end.

autodetect single-end bam files

for filterdups and detectTSS functions

taking into consideration strandedness

Looking in to detetcTSS.R file and the strandBinCounts function precisely (line 28), reads are counted as single-end no matter the info here (CSobject@paired_end).

fdata <-
        GenomicAlignments::summarizeOverlaps(
            features = windows$gr.plus,
            reads = bam.files,
            mode = "IntersectionStrict",
            ignore.strand = FALSE,
            inter.feature = ignoreMultiMap,
            singleEnd = TRUE,
            fragments = FALSE,
            preprocess.reads = func,
            param = bam_param,
            BPPARAM = bp_param)

So I think that you don't want to count twice the reads in both the plus (fdata) and minus signs (rdata), but what if we have a stranded library.

fdata <-
        GenomicAlignments::summarizeOverlaps(
            features = windows$gr.plus,
            reads = BamFileList(bam.files,asMates = TRUE),
            mode = "IntersectionStrict",
            ignore.strand = FALSE,
            inter.feature = ignoreMultiMap,
            singleEnd = FALSE,
            fragments = FALSE,
            strandMode = 2,
            preprocess.reads = func,
            param = bam_param,
            BPPARAM = bp_param)

And same thing in line 203:

wider <-
                suppressWarnings({
                GenomicAlignments::summarizeOverlaps(
                    features = neighbors,
                    reads = BamFileList(bam.files,asMates = TRUE),
                    mode = "IntersectionStrict",
                    ignore.strand = FALSE,
                    inter.feature = FALSE,
                    singleEnd = FALSE,
                    fragments = FALSE,
                    strandMode=2,
                    preprocess.reads = ppfunc,
                    param = bamParams,
                    BPPARAM = bpParams)
                  })

Would you tell me if this is ok? or if counting reads as paired end will have an effect on the analysis?
Thanks in advance.

reduce deps

try to reduce dependencies : plyr and reshape2 can go away

count paired-end

In order to avoid filtering the files for single end reads, paired-end reads should be counted.

Find a way to consider both reads of a fragment during PCR duplicate filtering.
During TSS calling, first base of the read 1 (for MAPCap/RAMPAGE) or read 2 (for TruSeq) should be overlapped and counted within the windows. 2nd read should be ignored.
But since bamFiltering would not be needed after implementing this, we can turn on the counting of fragments in case of gene-level summarization of counts later (for DE, correlation with RNAseq etc.)

detectTSS and detectDiffTSS with parameters for choosing 5'end, 3'end or whole read

Hi,
Now for detectTSS and detectDiffTSS, ice tea only uses 5'end to count the reads, it is perfect for TSS-sequencing techniques. But it is too limited for other techniques like FLASH-seq. Can you add the parameter to enable choosing 5'end, 3'end or the whole read. Thank you very much.

stacked barchart is wrong

The option "dodge" is not evaluated

Move to subjunc for alignment

Move to subjunc for alignment since subread performs bad at the splice anchors, leading to false mapping of the spliced read.

multicore improvements

Initiate parallel backend early when multiple calls to bplapply is used in the function
add BPPARAM to fitdiffTSS

demultiplexing reads is leading to duplication?

demultplex_fastq.R : I start with 11M reads (mutiplexed), I end up with 12M and 19M reads for the two samples.. How?

vignette improvements

Add references.
modify style and name

plot read metrics

provide a function to plot the read processing matrices from the CapSet object

ggsave with NA

Doesn't work

S4 implementation

For ease of use and going along with bioc infrastructure, I am implementing S4 classes to handle the analysis..

S4 class CapSet contructed with raw fastq files and barcode info.
function trimfastqIndex should use this and trim the index, plus should attach the path to trimmed fastq into it.
function demultiplex_fastq can use the trimmed output to demultiplex fastq and add the demult_reads info + path to demult files into sampleinfo slot.
function mapCaps should be able to execute in a multiplexed and demultiplxed mode, depending upon whether or not the sampleinfo slot has multiplxed info. It should then attach the prop of mapped reads into the object (either in sampleinfo or somewhere else.
function splitBAM_byIndex may or may not be needed depending on pre/post mapping demultiplexing. in case of post-mapping demultiplexing, it can still use the capset object and fill the demult_reads info in the sampleInfo slot.
function filterduplicates should be able to use the CapSet object. In this case the function would run on all the bamfiles and append the stats of dup removal in the sampleInfo slot called dedup_reads. If not given the CapSet object, it would work as a single sample function (current status).
function detectTSS should also be able to use the capset object, in this case It would append the design matrix to the object. In this case the samplenames should match the existing sample names in the sampleinfo slot.
To push it further, the function fit_diffTSS can also use the CapSet object, replacing the args bam.files and design and taking it from the object. detect_diffTSS would stay the same.

New plotting functions can then be added, taking info from the sampleinfo slot, one can make the plots for mapped reads, demult_reads, dedup_reads and reads_inTSS (i.e. number of reads falling in the detected TSS) for each sample

implement tests

using testThat and travis

Update R dependency

Update, before submission to bioc :

R to 3.6
Pkg version to 0.99.0

calculate FRIT score

during TSS detection, also get number of reads per sample within peaks and save within the S4 object

vivekbhr / icetea Goto Github PK

icetea's Introduction

icetea

Installing icetea

Stable version

Latest version

Documentation

Citation

icetea's People

Contributors

Stargazers

Watchers

icetea's Issues

Recommend Projects

Recommend Topics

Recommend Org