kassambara / fastqcr Goto Github PK

View Code? Open in Web Editor NEW

66.0 66.0 20.0 5.35 MB

fastqcr: Quality Control of Sequencing Data

Home Page: http://www.sthda.com/english/rpkgs/fastqcr

R 100.00%

fastqcr's People

Contributors

Stargazers

Watchers

fastqcr's Issues

sh: 1: fastqc: not found

Hi,

I successfully installed the fastqcr package on R/3.3.3, both locally and on a cluster.
I load the library and then when I try to use the first command:
fastqc(fq.dir = "/home/user/test_fastq", qc.dir = "/home/user/test_fastq/fastqc", threads = 4)
it outputs this error immediately:
sh: 1: fastqc: not found

What am I doing wrong?
I tried doing fastqc_install(), but that didn't change anything..

Thank you for your help.

sh: 1: fastqc: not found

Hi there,

I installed the fastqcr package on R 3.4.4 locally.
I run the command fastqc_install() successfully and tried:
fastqc(fq.dir = "/home/user/test_fastq", qc.dir = "/home/user/fastqc", threads = 1)
but I got the following error:
sh: 1: fastqc: not found

What should I do? Thanks!

Simnon

Running FastQC in R from a Windows Machine

Hi!

I installed the package in R and tried to run it (using the fastqc() function) but there was a check on whether or not my system was unix based and gave an error.

I was wondering if either fastqcr does not support running from a windows machine or I just accidentally installed a version that is only for unix machines (I did not use the function fastqc_install)

Thanks!

add custom intercept line to qc_plot

Hi, thanks for the great tool!

I have a small question: I am trimming my reads at a certain base position based on the sequence quality. Could you help me how I can add a vertical line into my unfiltered qc_plot to indicate where I am trimming?
I tried this, but the plot does not show the line:

p<-qc_plot(qc_file,modules = "Per base sequence quality")
        p + geom_vline(xintercept = 240, color = "red")

Cheers,
Alex

qc_read_collection() error: Can't combine <double> and <character>

Dear Alboukadel,

Many thanks for this and other handy R packages!

I've beein using qc_read_collection(), on many "*_fastqc.zip" files, and noticed that this function suffers from dplyr issue #5358 when binding data.frames.

Here is a reprex leading to the error in lapply(res, dplyr::bind_rows, .id = "sample") inside qc_read_collection():

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# create example data.frames to be bound using dplyr::bind_rows()
dn <- data.frame(Length = 150, Count = 2)
ds <- data.frame(Length = c("150-155"), Count = 4)
de <- data.frame(array(NA, dim = c(0,0)))
res <- list(module = list(Sample1=dn, Sample2=ds, Sample3=de))

str(res)
#> List of 1
#>  $ module:List of 3
#>   ..$ Sample1:'data.frame':  1 obs. of  2 variables:
#>   .. ..$ Length: num 150
#>   .. ..$ Count : num 2
#>   ..$ Sample2:'data.frame':  1 obs. of  2 variables:
#>   .. ..$ Length: chr "150-155"
#>   .. ..$ Count : num 4
#>   ..$ Sample3:'data.frame':  0 obs. of  0 variables

# reproduce the error
res <- lapply(res, dplyr::bind_rows, .id = "sample")
#> Error: Can't combine `Sample1$Length` <double> and `Sample2$Length` <character>.

The error above will occur when calling qc_read_collection(files, sample_names, modules = "all") on a collection of "*_fastqc.zip" files, if there is a sample in files that has a different class for any variable in the data.frame to be bound.

In my case, this happened mostly with the modules $sequence_length_distribution (variable "Length") or $kmer_content (variable "Max Obs/Exp Position").

Here is a possible fix I came up with:

# convert <double> to <character> if a column should be <character>
res <- lapply(res, function(x) {
  # tibble with classes for each non-emtpy data.frame column
  dcl <- dplyr::bind_rows(lapply(x, function(y) {
    if (nrow(y) > 0) sapply(y, class)
  }))
  # define classes to assign
  cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
  # assign classes
  lapply(x, function(w) {
    if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
  })
})

# reproduce the fix
res <- lapply(res, dplyr::bind_rows, .id = "sample")
str(res)
#> List of 1
#>  $ module:'data.frame':  2 obs. of  3 variables:
#>   ..$ sample: chr [1:2] "Sample1" "Sample2"
#>   ..$ Length: chr [1:2] "150" "150-155"
#>   ..$ Count : num [1:2] 2 4

^{Created on 2021-12-21 by the reprex package (v2.0.1)}

Perhaps a patch for qc_read_collection() similar to the one below (enclosed by ##<##<##) could be useful generally, given that dplyr is not going to fix this because it is a "deliberate design decision" (see #5358)?

qc_read_collection <- function(files, sample_names, modules = "all", verbose=T) 
{
  module_data <- lapply(files, qc_read, modules = modules, 
                        verbose = verbose)
  if (missing(sample_names) || length(sample_names) != length(files)) {
    sample_names <- lapply(module_data, function(x) unique(x$summary))
    sample_names <- unlist(sample_names)
  }
  names(module_data) <- sample_names
  module_names <- unique(unlist(lapply(module_data, names)))
  res <- list()
  for (i in seq_along(module_names)) {
    res[[i]] <- lapply(module_data, function(x) as.data.frame(x[[module_names[i]]]))
  }
  names(res) <- module_names
  
  ##<##<## begin patch
  res <- lapply(res, function(x) {
    dcl <- dplyr::bind_rows(lapply(x, function(y) {
      if (nrow(y) > 0) sapply(y, class)
      }))
    cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
    lapply(x, function(w) {
      if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
    })
  })
  ##<##<## end patch
  
  res <- lapply(res, dplyr::bind_rows, .id = "sample")
  res <- structure(res, class = c("list", "qc_read_collection"))
  res
}

Perhaps you'd like to look into this yourself, and maybe come up with an easier and prettier solution? :)

Cheers,
Simon

Wondering if possible to modify the color/ggplot parameters of qc_plot_collection

Hello,
I was wondering if it is possible to edit the ggplot parameters of the qc_plot_collection function? I see the source code where I would like to edit on the github repo (R/qc_plot_collection.R) but when I check the R folder in my directory, it only has fastqcr, fastqcr.rdb and fastqcr.rdx.

I would like to change this section:

# Per base sequence quality
.plot_base_quality_collection <- function(qc, ggtheme = theme_minimal(), ...){
  
  .names <- names(qc)
  if(!("per_base_sequence_quality" %in% .names))
    return(NULL)
  . <- NULL
  
  d <- qc$per_base_sequence_quality
  if(nrow(d) == 0) return(NULL)
  
  colnames(d) <- make.names(colnames(d))
  d$Base <- factor(d$Base, levels = unique(d$Base))
  # Select some breaks
  nlev <- nlevels(d$Base)
  breaks <- scales::extended_breaks()(1:nlev)[-1] %>% # index
    c(1, ., nlev) %>% # Add the minimum & the max
    d$Base[.] %>% # Values
    as.vector()
  
  
  ggplot()+
    geom_line(data = d, aes_string(x = "Base", y = "Median", group = 'sample', color = 'sample')) +
    expand_limits(x = 0, y = 0)+
    geom_rect(aes(xmin = 0, ymin = 0, ymax = 20, xmax = Inf),
              fill = "red", alpha = 0.2)+
    geom_rect(aes(xmin = 0, ymin = 20, ymax = 28, xmax = Inf),
              fill = "yellow", alpha = 0.2)+
    geom_rect(aes(xmin = 0, ymin = 28, ymax = Inf, xmax = Inf),
              fill = "#00AFBB", alpha = 0.2)+
    scale_x_discrete(breaks = breaks)+
    labs(title = "Per base sequence quality", x = "Position in read (pb)",
         y = "Median quality scores",
         subtitle = "Red: low quality zone")+
    theme_minimal()
}

Specifically the part where the color = sample as I would like all the lines to just be "color = "grey40", alpha = 0.35".

Thank you for making such an awesome package!

Match name error

Hi!

Thanks for the package! I installed and started to use with 200+ samples. I could run fastqc(), I can see and browse the outputs from cmd, but qc_aggregate() gives the following error:

fastqcr::qc_aggregate("fastqc/", progressbar = FALSE)

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

Same with the example data. Since i cannot get a qc_aggregate class object, no downstream function works.

qc_report() throws back the following:

fastqcr::qc_report("fastqc/", result.file = "multi-qc-report")

Quitting from lines 51-53 (multi-qc-report.Rmd) 
Error in qc_aggregate(qc.path, progressbar = FALSE) : 
  Can't find any *fastqc.zip files in the specified qc.dir

But there are more then 200 files named *fastqc.zip in that directory.

So I just can't use any of the cool function of the package. Do you have any idea to fix this?

Thanks in advance!

>devtools::session_info()
Session info ------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Asia/Tokyo                  
 date     2018-08-23

SSL certificate expired

I am getting the following error when running fastqc_install()

> fastqc_install()
Error in open.connection(x, "rb") : 
  SSL certificate problem: certificate has expired

Overlaying plots ?

Nice package! I was wondering if it is possible to overlay the qc plots, i.e. probably before and after reads are "cleaned", for a straight forward comparative analysis.
Another issue is that I have to use the development branch of ggplot2 just to use this package, which can be little troublesome.

`qc_plot(modules='all')` does not check whether the FastQC report has data for that module.

After FastQC version 0.11.6, kmer content is no longer being reported, see https://github.com/s-andrews/FastQC/releases/tag/v0.11.6. However, when we run qc_plot with modules='all', the results from .valid_fastqc_modules(modules='all') will still contain ("Kmer Content", despite qc$summary no longer having an entry for it. This means, the plot.func(qc, status = status) step will throw an error when modules reach "kmer_content", owing to status = NA.

qc_plot should including checking for the output .valid_fastqc_modules(modules='all') on whether qc has data for the module(s) selected. A PR is nearly ready and will be submitted within today.

The attached 5k_pbmc_protein_v3_nextgem_gex_S1_L001_R1_001_fastqc.zip is an example from this; it's generated by running fastqc version 0.12.1 with default parameters on one of the FASTQ files from the 10X Genomics 5k_pbmc_protein_v3_nextgem data set.

qc_report cannot produce rmd due to permission issue

I am using fastqcr with R 4.1.0.

I tried to execute qc_report function for a single zip archive within my user's directory,
but I have met a Error msg like below

List of 3
 $ fig.width : num 4
 $ fig.height: num 3.5
 $ fig.align : chr "center"

  |..................................................................    |  94%
  ordinary text without R code

  |....................................................................  |  97%
label: kmer-content (with options) 
List of 3
 $ fig.width : num 4
 $ fig.height: num 3.5
 $ fig.align : chr "center"

Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
Calls: qc_report ... element_text -> .handleSimpleError -> h -> cat -> file
In addition: Warning messages:
1: Missing column names filled in: 'X1' [1] 
2: In file(file, ifelse(append, "a", "w")) :
  cannot open file 'sample-report.knit.md': Permission denied

Execution halted

It says "permission problem"
I could not find where above messages produced from yet.
but when I tried to run as root. "Permission denied" message is disappeared.

I guess that there is invalid setting related on knitr.
How can we resolve this problem?

Submitting fastqcr 0.1.3 to CRAN

qc_report will only take full path to zipfiles

I'm running fastqrc from the folder that contains all the zip files.

qc.path <- "."

Running list.files(qc.path) gives files as expected, and qc <- qc_aggregate(qc.path) gives a valid report, however qc_report(qc.path, result.file = "test") gives:

Quitting from lines 51-53 (multi-qc-report.Rmd)
Error in qc_aggregate(qc.path, progressbar = FALSE) :
Can't find any *fastqc.zip files in the specified qc.dir

Replacing the relative path with a full path to the same directory allows both qc_aggregate and qc_report to run, but requiring full paths limits the program's usability for scripting. This is on an HPC, running R/3.2.0 and pandoc/1.17.3

`select()` instead of select_()`in qc_aggregate.R

Hi,

While running the qc_aggregate, I get an warning message saying that

1: select_() was deprecated in dplyr 0.7.0.
ℹ Please use select() instead.
ℹ The deprecated feature was likely used in the fastqcr package.
Please report the issue at https://github.com/kassambara/fastqcr/issues.

It would be great if you can update this in line 85 of qc_aggregate.R or any other line that is generating the warning message.

Thank you.

Sincerely,
Eliza Dhungel

Per tile sequence quality

Is there a reason as to why this is left out in the plot code? I can see that there is a function for this in the code, but there is no reference to this function in the .plot_func. Is it not possible to create this plot?

select()

select_() was deprecated in dplyr 0.7.0.
Please use select() instead.

Column `Length` can't be converted from numeric to character

Hi,
I would like to plot FastQC data of multiple samples, but I encounter a problem with the qc_read_collection function (while the qc_read and qc_report functions work fine).
Do you have any idea what is wrong?

Thanks for your help!
Annabelle

qc.files <- list.files(qc.dir, full.names = TRUE)
samples<-(c("ERR158720_1_700kreads_HiSeq2k","SRR5909287_1_849kreads_MiSeq","SRR7748059_1_800kreads_HiSeq4k"))
qc <- qc_read_collection(qc.files, sample_names = samples)

Reading: /path-to/ERR158720_1_700kreads_HiSeq2k_fastqc.zip
Reading: /path-to/SRR5909287_1_849kreads_MiSeq_fastqc.zip
Reading: /path-to/SRR7748059_1_800kreads_HiSeq4k_fastqc.zip
Erreur : Column Length can't be converted from numeric to character

Cannot open zip file including colon ':' in the filename

Hi,

I ran into an issue with a bunch of FASTQC files which include colon: in their filenames. Renaming the files is not an option. While this char is not allowed in Windows filenames,
Linux allows any character other than nul and /. See the session using example FASTQC reports / zip files (obtained from test[:err].fastq.bz2 input files):

test_fastqc.zip ✔️
test:err_fastqc.zip ❌

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
...
> packageVersion("fastqcr")
[1] ‘0.1.2'
> library(fastqcr)
> qc <- qc_read("test_fastqc.zip")
Reading: test_fastqc.zip
> names(qc)
 [1] "summary"                       "basic_statistics"             
 [3] "per_base_sequence_quality"     "per_tile_sequence_quality"    
 [5] "per_sequence_quality_scores"   "per_base_sequence_content"    
 [7] "per_sequence_gc_content"       "per_base_n_content"           
 [9] "sequence_length_distribution"  "sequence_duplication_levels"  
[11] "overrepresented_sequences"     "adapter_content"              
[13] "kmer_content"                  "total_deduplicated_percentage"
> qc <- qc_read("test:err_fastqc.zip")
Reading: test:err_fastqc.zip
Error in open.connection(con, "rb") : cannot open the connection
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open zip file 'test:err_fastqc.zip:test'

reference genome species

Hello developers of fastqcr: I'm wondering, do you use mouse or human genome as the reference for the theoretical distribution?(eg. theoretical distribution for per sequence GC content), wasn't able to find this info in the vignette. Thank you very much!

shiny app for linked one sample plots in the multiple QC report

Hello,

thanks again for all those packages you are developing.
I am using fastqcr and it works well for a first use, no problem so far.
Was wondering, that would be great to have a shiny app that would display the Multiple QC Reports but we could click on the problematic sample and obtain the One Sample Report with plots. I know several biologists that would love this feature.

cheers

extend `qc_read_collection` to take list of `qc_read` not only raw files

Hi @kassambara

I previously suggested qc_read_collection and qc_plot_collection to handle multiple files of the fastqc output. The read function is however limited to using the raw fastqc files. I want to suggest adding a data argument to qcr_read_collection to deal with a case where the raw files are not available but rather a list of qc_read objects distributed as an R object.

The behavior of qc_read_collection can be modified to something like this

# extract paths to the demo files
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc.files <- list.files(qc.dir, full.names = TRUE)

# read list of files
qc_list <- lappy(qc.files, qc_read)

# make a collection object
qc_collection <- qc_read_collection(data = qc_list, sample_names = paste('S', 1:5, sep = ''))

getting NA in some modules of qc.aggregation function in new version (0.1.1)

Hello,

I had the same issue with #10 and based on your suggestion, updated to the newer version (0.1.1) and that error is gone and I was able to generate the qc metrics but some of the modules are NA while it worked with previous version (0.1.0) with same data! Do you have any thoughts?
Thank you.
qc$tot.seq
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
qcstat<- qc_stats(qc)

qcstat
sample pct.dup pct.gc tot.seq seq.length

1 C_rep1_R1 72.4 NA NA NA
2 C_rep1_R2 72.8 NA NA NA
3 C_rep2_R1 72.8 NA NA NA
4 C_rep2_R2 72.8 NA NA NA
5 C_rep3_R1 71.6 NA NA NA
6 C_rep3_R2 70.8 NA NA NA
7 P_rep1_R1 69.1 NA NA NA
8 P_rep1_R2 67.9 NA NA NA
9 P_rep2_R1 68.6 NA NA NA
10 P_rep2_R2 67.4 NA NA NA
11 P_rep3_R1 72.3 NA NA NA
12 P_rep3_R2 71.3 NA NA NA

kassambara / fastqcr Goto Github PK

fastqcr's People

Contributors

Stargazers

Watchers

Forkers

fastqcr's Issues

Recommend Projects

Recommend Topics

Recommend Org