kassambara / fastqcr Goto Github PK
View Code? Open in Web Editor NEWfastqcr: Quality Control of Sequencing Data
Home Page: http://www.sthda.com/english/rpkgs/fastqcr
fastqcr: Quality Control of Sequencing Data
Home Page: http://www.sthda.com/english/rpkgs/fastqcr
select_()
was deprecated in dplyr 0.7.0.
Please use select()
instead.
Hi,
I would like to plot FastQC data of multiple samples, but I encounter a problem with the qc_read_collection function (while the qc_read and qc_report functions work fine).
Do you have any idea what is wrong?
Thanks for your help!
Annabelle
qc.files <- list.files(qc.dir, full.names = TRUE)
samples<-(c("ERR158720_1_700kreads_HiSeq2k","SRR5909287_1_849kreads_MiSeq","SRR7748059_1_800kreads_HiSeq4k"))
qc <- qc_read_collection(qc.files, sample_names = samples)
Reading: /path-to/ERR158720_1_700kreads_HiSeq2k_fastqc.zip
Reading: /path-to/SRR5909287_1_849kreads_MiSeq_fastqc.zip
Reading: /path-to/SRR7748059_1_800kreads_HiSeq4k_fastqc.zip
Erreur : Column Length
can't be converted from numeric to character
Hi!
I installed the package in R and tried to run it (using the fastqc() function) but there was a check on whether or not my system was unix based and gave an error.
I was wondering if either fastqcr does not support running from a windows machine or I just accidentally installed a version that is only for unix machines (I did not use the function fastqc_install)
Thanks!
Hi,
While running the qc_aggregate, I get an warning message saying that
1: select_()
was deprecated in dplyr 0.7.0.
ℹ Please use select()
instead.
ℹ The deprecated feature was likely used in the fastqcr package.
Please report the issue at https://github.com/kassambara/fastqcr/issues.
It would be great if you can update this in line 85 of qc_aggregate.R or any other line that is generating the warning message.
Thank you.
Sincerely,
Eliza Dhungel
I'm running fastqrc from the folder that contains all the zip files.
qc.path <- "."
Running list.files(qc.path)
gives files as expected, and qc <- qc_aggregate(qc.path)
gives a valid report, however qc_report(qc.path, result.file = "test")
gives:
Quitting from lines 51-53 (multi-qc-report.Rmd)
Error in qc_aggregate(qc.path, progressbar = FALSE) :
Can't find any *fastqc.zip files in the specified qc.dir
Replacing the relative path with a full path to the same directory allows both qc_aggregate and qc_report to run, but requiring full paths limits the program's usability for scripting. This is on an HPC, running R/3.2.0 and pandoc/1.17.3
Hello,
I was wondering if it is possible to edit the ggplot parameters of the qc_plot_collection function? I see the source code where I would like to edit on the github repo (R/qc_plot_collection.R) but when I check the R folder in my directory, it only has fastqcr, fastqcr.rdb and fastqcr.rdx.
I would like to change this section:
# Per base sequence quality
.plot_base_quality_collection <- function(qc, ggtheme = theme_minimal(), ...){
.names <- names(qc)
if(!("per_base_sequence_quality" %in% .names))
return(NULL)
. <- NULL
d <- qc$per_base_sequence_quality
if(nrow(d) == 0) return(NULL)
colnames(d) <- make.names(colnames(d))
d$Base <- factor(d$Base, levels = unique(d$Base))
# Select some breaks
nlev <- nlevels(d$Base)
breaks <- scales::extended_breaks()(1:nlev)[-1] %>% # index
c(1, ., nlev) %>% # Add the minimum & the max
d$Base[.] %>% # Values
as.vector()
ggplot()+
geom_line(data = d, aes_string(x = "Base", y = "Median", group = 'sample', color = 'sample')) +
expand_limits(x = 0, y = 0)+
geom_rect(aes(xmin = 0, ymin = 0, ymax = 20, xmax = Inf),
fill = "red", alpha = 0.2)+
geom_rect(aes(xmin = 0, ymin = 20, ymax = 28, xmax = Inf),
fill = "yellow", alpha = 0.2)+
geom_rect(aes(xmin = 0, ymin = 28, ymax = Inf, xmax = Inf),
fill = "#00AFBB", alpha = 0.2)+
scale_x_discrete(breaks = breaks)+
labs(title = "Per base sequence quality", x = "Position in read (pb)",
y = "Median quality scores",
subtitle = "Red: low quality zone")+
theme_minimal()
}
Specifically the part where the color = sample as I would like all the lines to just be "color = "grey40", alpha = 0.35".
Thank you for making such an awesome package!
Nice package! I was wondering if it is possible to overlay the qc plots, i.e. probably before and after reads are "cleaned", for a straight forward comparative analysis.
Another issue is that I have to use the development branch of ggplot2 just to use this package, which can be little troublesome.
Is there a reason as to why this is left out in the plot code? I can see that there is a function for this in the code, but there is no reference to this function in the .plot_func. Is it not possible to create this plot?
Hello,
I had the same issue with #10 and based on your suggestion, updated to the newer version (0.1.1) and that error is gone and I was able to generate the qc metrics but some of the modules are NA while it worked with previous version (0.1.0) with same data! Do you have any thoughts?
Thank you.
qc$tot.seq
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
qcstat<- qc_stats(qc)
qcstat
sample pct.dup pct.gc tot.seq seq.length
1 C_rep1_R1 72.4 NA NA NA
2 C_rep1_R2 72.8 NA NA NA
3 C_rep2_R1 72.8 NA NA NA
4 C_rep2_R2 72.8 NA NA NA
5 C_rep3_R1 71.6 NA NA NA
6 C_rep3_R2 70.8 NA NA NA
7 P_rep1_R1 69.1 NA NA NA
8 P_rep1_R2 67.9 NA NA NA
9 P_rep2_R1 68.6 NA NA NA
10 P_rep2_R2 67.4 NA NA NA
11 P_rep3_R1 72.3 NA NA NA
12 P_rep3_R2 71.3 NA NA NA
Hi,
I ran into an issue with a bunch of FASTQC files which include colon:
in their filenames. Renaming the files is not an option. While this char is not allowed in Windows filenames,
Linux allows any character other than nul and /
. See the session using example FASTQC reports / zip files (obtained from test[:err].fastq.bz2
input files):
test_fastqc.zip
✔️test:err_fastqc.zip
❌> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
...
> packageVersion("fastqcr")
[1] ‘0.1.2'
> library(fastqcr)
> qc <- qc_read("test_fastqc.zip")
Reading: test_fastqc.zip
> names(qc)
[1] "summary" "basic_statistics"
[3] "per_base_sequence_quality" "per_tile_sequence_quality"
[5] "per_sequence_quality_scores" "per_base_sequence_content"
[7] "per_sequence_gc_content" "per_base_n_content"
[9] "sequence_length_distribution" "sequence_duplication_levels"
[11] "overrepresented_sequences" "adapter_content"
[13] "kmer_content" "total_deduplicated_percentage"
> qc <- qc_read("test:err_fastqc.zip")
Reading: test:err_fastqc.zip
Error in open.connection(con, "rb") : cannot open the connection
In addition: Warning message:
In open.connection(con, "rb") :
cannot open zip file 'test:err_fastqc.zip:test'
Hi!
Thanks for the package! I installed and started to use with 200+ samples. I could run fastqc()
, I can see and browse the outputs from cmd, but qc_aggregate()
gives the following error:
fastqcr::qc_aggregate("fastqc/", progressbar = FALSE)
Error in match.names(clabs, names(xi)) :
names do not match previous names
Same with the example data. Since i cannot get a qc_aggregate class object, no downstream function works.
qc_report()
throws back the following:
fastqcr::qc_report("fastqc/", result.file = "multi-qc-report")
Quitting from lines 51-53 (multi-qc-report.Rmd)
Error in qc_aggregate(qc.path, progressbar = FALSE) :
Can't find any *fastqc.zip files in the specified qc.dir
But there are more then 200 files named *fastqc.zip
in that directory.
So I just can't use any of the cool function of the package. Do you have any idea to fix this?
Thanks in advance!
>devtools::session_info()
Session info ------------------------------------------------------------------
setting value
version R version 3.5.0 (2018-04-23)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz Asia/Tokyo
date 2018-08-23
I am getting the following error when running fastqc_install()
> fastqc_install()
Error in open.connection(x, "rb") :
SSL certificate problem: certificate has expired
Hi, thanks for the great tool!
I have a small question: I am trimming my reads at a certain base position based on the sequence quality. Could you help me how I can add a vertical line into my unfiltered qc_plot to indicate where I am trimming?
I tried this, but the plot does not show the line:
p<-qc_plot(qc_file,modules = "Per base sequence quality")
p + geom_vline(xintercept = 240, color = "red")
Cheers,
Alex
I am using fastqcr with R 4.1.0.
I tried to execute qc_report function for a single zip archive within my user's directory,
but I have met a Error msg like below
List of 3
$ fig.width : num 4
$ fig.height: num 3.5
$ fig.align : chr "center"
|.................................................................. | 94%
ordinary text without R code
|.................................................................... | 97%
label: kmer-content (with options)
List of 3
$ fig.width : num 4
$ fig.height: num 3.5
$ fig.align : chr "center"
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
Calls: qc_report ... element_text -> .handleSimpleError -> h -> cat -> file
In addition: Warning messages:
1: Missing column names filled in: 'X1' [1]
2: In file(file, ifelse(append, "a", "w")) :
cannot open file 'sample-report.knit.md': Permission denied
Execution halted
It says "permission problem"
I could not find where above messages produced from yet.
but when I tried to run as root. "Permission denied" message is disappeared.
I guess that there is invalid setting related on knitr.
How can we resolve this problem?
Dear Alboukadel,
Many thanks for this and other handy R packages!
I've beein using qc_read_collection()
, on many "*_fastqc.zip"
files, and noticed that this function suffers from dplyr issue #5358 when binding data.frames.
Here is a reprex leading to the error in lapply(res, dplyr::bind_rows, .id = "sample")
inside qc_read_collection()
:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# create example data.frames to be bound using dplyr::bind_rows()
dn <- data.frame(Length = 150, Count = 2)
ds <- data.frame(Length = c("150-155"), Count = 4)
de <- data.frame(array(NA, dim = c(0,0)))
res <- list(module = list(Sample1=dn, Sample2=ds, Sample3=de))
str(res)
#> List of 1
#> $ module:List of 3
#> ..$ Sample1:'data.frame': 1 obs. of 2 variables:
#> .. ..$ Length: num 150
#> .. ..$ Count : num 2
#> ..$ Sample2:'data.frame': 1 obs. of 2 variables:
#> .. ..$ Length: chr "150-155"
#> .. ..$ Count : num 4
#> ..$ Sample3:'data.frame': 0 obs. of 0 variables
# reproduce the error
res <- lapply(res, dplyr::bind_rows, .id = "sample")
#> Error: Can't combine `Sample1$Length` <double> and `Sample2$Length` <character>.
The error above will occur when calling qc_read_collection(files, sample_names, modules = "all")
on a collection of "*_fastqc.zip"
files, if there is a sample in files
that has a different class for any variable in the data.frame
to be bound.
In my case, this happened mostly with the modules $sequence_length_distribution
(variable "Length") or $kmer_content
(variable "Max Obs/Exp Position").
Here is a possible fix I came up with:
# convert <double> to <character> if a column should be <character>
res <- lapply(res, function(x) {
# tibble with classes for each non-emtpy data.frame column
dcl <- dplyr::bind_rows(lapply(x, function(y) {
if (nrow(y) > 0) sapply(y, class)
}))
# define classes to assign
cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
# assign classes
lapply(x, function(w) {
if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
})
})
# reproduce the fix
res <- lapply(res, dplyr::bind_rows, .id = "sample")
str(res)
#> List of 1
#> $ module:'data.frame': 2 obs. of 3 variables:
#> ..$ sample: chr [1:2] "Sample1" "Sample2"
#> ..$ Length: chr [1:2] "150" "150-155"
#> ..$ Count : num [1:2] 2 4
Created on 2021-12-21 by the reprex package (v2.0.1)
Perhaps a patch for qc_read_collection()
similar to the one below (enclosed by ##<##<##) could be useful generally, given that dplyr
is not going to fix this because it is a "deliberate design decision" (see #5358)?
qc_read_collection <- function(files, sample_names, modules = "all", verbose=T)
{
module_data <- lapply(files, qc_read, modules = modules,
verbose = verbose)
if (missing(sample_names) || length(sample_names) != length(files)) {
sample_names <- lapply(module_data, function(x) unique(x$summary))
sample_names <- unlist(sample_names)
}
names(module_data) <- sample_names
module_names <- unique(unlist(lapply(module_data, names)))
res <- list()
for (i in seq_along(module_names)) {
res[[i]] <- lapply(module_data, function(x) as.data.frame(x[[module_names[i]]]))
}
names(res) <- module_names
##<##<## begin patch
res <- lapply(res, function(x) {
dcl <- dplyr::bind_rows(lapply(x, function(y) {
if (nrow(y) > 0) sapply(y, class)
}))
cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
lapply(x, function(w) {
if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
})
})
##<##<## end patch
res <- lapply(res, dplyr::bind_rows, .id = "sample")
res <- structure(res, class = c("list", "qc_read_collection"))
res
}
Perhaps you'd like to look into this yourself, and maybe come up with an easier and prettier solution? :)
Cheers,
Simon
Hi,
I successfully installed the fastqcr package on R/3.3.3, both locally and on a cluster.
I load the library and then when I try to use the first command:
fastqc(fq.dir = "/home/user/test_fastq", qc.dir = "/home/user/test_fastq/fastqc", threads = 4)
it outputs this error immediately:
sh: 1: fastqc: not found
What am I doing wrong?
I tried doing fastqc_install(), but that didn't change anything..
Thank you for your help.
After FastQC version 0.11.6, kmer content is no longer being reported, see https://github.com/s-andrews/FastQC/releases/tag/v0.11.6. However, when we run qc_plot
with modules='all'
, the results from .valid_fastqc_modules(modules='all')
will still contain ("Kmer Content"
, despite qc$summary
no longer having an entry for it. This means, the plot.func(qc, status = status)
step will throw an error when modules
reach "kmer_content"
, owing to status = NA
.
qc_plot
should including checking for the output .valid_fastqc_modules(modules='all')
on whether qc
has data for the module(s) selected. A PR is nearly ready and will be submitted within today.
The attached 5k_pbmc_protein_v3_nextgem_gex_S1_L001_R1_001_fastqc.zip is an example from this; it's generated by running fastqc
version 0.12.1 with default parameters on one of the FASTQ files from the 10X Genomics 5k_pbmc_protein_v3_nextgem
data set.
Hello developers of fastqcr: I'm wondering, do you use mouse or human genome as the reference for the theoretical distribution?(eg. theoretical distribution for per sequence GC content), wasn't able to find this info in the vignette. Thank you very much!
Hello,
thanks again for all those packages you are developing.
I am using fastqcr
and it works well for a first use, no problem so far.
Was wondering, that would be great to have a shiny app that would display the Multiple QC Reports but we could click on the problematic sample and obtain the One Sample Report with plots. I know several biologists that would love this feature.
cheers
Hi @kassambara
I previously suggested qc_read_collection
and qc_plot_collection
to handle multiple files of the fastqc
output. The read function is however limited to using the raw fastqc
files. I want to suggest adding a data
argument to qcr_read_collection
to deal with a case where the raw files are not available but rather a list
of qc_read
objects distributed as an R object.
The behavior of qc_read_collection
can be modified to something like this
# extract paths to the demo files
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc.files <- list.files(qc.dir, full.names = TRUE)
# read list of files
qc_list <- lappy(qc.files, qc_read)
# make a collection object
qc_collection <- qc_read_collection(data = qc_list, sample_names = paste('S', 1:5, sep = ''))
Hi there,
I installed the fastqcr package on R 3.4.4 locally.
I run the command fastqc_install() successfully and tried:
fastqc(fq.dir = "/home/user/test_fastq", qc.dir = "/home/user/fastqc", threads = 1)
but I got the following error:
sh: 1: fastqc: not found
What should I do? Thanks!
Simnon
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.