zhxiaokang / rasflow Goto Github PK
View Code? Open in Web Editor NEWRNA-Seq analysis workflow
License: MIT License
RNA-Seq analysis workflow
License: MIT License
Hi @zhxiaokang
It seems I am having a problem with the BAM step in my workflow.
I get the following output
`[Wed Jan 6 23:51:27 2021]
rule featureCount:
input: data/output/pva027/genome/bamFileSort/20-0357.sort.bam, data/example/ref/annotation/027-annot.gff
output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary
jobid: 2
wildcards: sample=20-0357
Job counts:
count jobs
1 featureCount
1
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.6.4
//========================== featureCounts setting ===========================\
|| ||
|| Input files : 1 BAM file ||
|| P 20-0357.sort.bam ||
|| ||
|| Output file : 20-0357_count.tsv ||
|| Summary : 20-0357_count.tsv.summary ||
|| Annotation : 027-annot.gff (GTF) ||
|| Dir for temp files : output-pva/pva027/genome/countFile ||
|| ||
|| Threads : 8 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\============================================================================//
//================================= Running ==================================\
|| ||
|| Load annotation file 027-annot.gff ... ||
ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
The porgram has to terminate.
[Wed Jan 6 23:51:28 2021]
Error in rule featureCount:
jobid: 0
output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary
RuleException:
CalledProcessError in line 109 of /home/sam/Downloads/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail; featureCounts -p -T 8 -t exon -g ID -a data/example/ref/annotation/027-annot.gff -o output-pva/pva027/genome/countFile/20-0357_count.tsv data/output/pva027/genome/bamFileSort/20-0357.sort.bam && tail -n +3 output-pva/pva027/genome/countFile/20-0357_count.tsv | cut -f1,7 > temp.20-0357 && mv temp.20-0357 output-pva/pva027/genome/countFile/20-0357_count.tsv ' returned non-zero exit status 255.
File "/home/sam/Downloads/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
`
Do you have any idea how I could fix it?
Cheers,
Pablo
As part of my pipeline, I want to merge files based on groups in the config file prior to run on salmon. Is there an easy way to do this. My understanding of the wildcards of snakemake is very rudimentary. Thanks so much in advance!
Dear @zhxiaokang,
I am having problems now with the BAM sorting in the pipeline. More specifically, the error I get is this one:
`
[Wed Jun 9 15:27:30 2021]
rule sortBAM:
input: data/output/test/genome/bamFile/200357.bam
output: data/output/test/genome/bamFileSort/200357.sort.bam
jobid: 6
wildcards: sample=200357
/bin/bash: line 1: 3230 Killed samtools sort -@ 8 data/output/test/genome/bamFile/200357.bam -o data/output/test/genome/bamFileSort/200357.sort.bam
[Wed Jun 9 15:27:37 2021]
Error in rule sortBAM:
jobid: 6
output: data/output/test/genome/bamFileSort/200357.sort.bam
RuleException:
CalledProcessError in line 97 of /home/sam/Downloads/pablo2021/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail; samtools sort -@ 8 data/output/test/genome/bamFile/200357.bam -o data/output/test/genome/bamFileSort/200357.sort.bam ' returned non-zero exit status 137.
File "/home/sam/Downloads/pablo2021/RASflow/workflow/align_count_genome.rules", line 97, in __rule_sortBAM
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/sam/Downloads/pablo2021/RASflow/.snakemake/log/2021-06-09T152538.774946.snakemake.log
`
Could you please help me solve it?
Cheers,
Pablo
Hi, thank you for your work. I'm just installed RASflow by the Conda environment with the yaml file env.yaml like the tutorial and tried to run the example changing the project name and even another sample, but keeping get the error in the first QC to stop that is nice: ‘snakemake’: No such file or directory
and the outputs files don't appear.
This is how is my config_main.yaml characteristics was:
PROJECT: teste
QC: yes # "yes" or "no"
TRIMMED: no # "yes" or "no"?
REFERENCE: genome # "genome" or "transcriptome"
DEA: yes # "yes" or "no"
VISUALIZE: no # "yes" or "no"
READSPATH: data/example/fastq_pair
METAFILE: configs/metadata.tsv
END: pair
NCORE: 40
OUTPUTPATH: data/output/teste # intermediate output. do not upload to github
FINALOUTPUT: output/teste
GENOME: data/example/ref/genome/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa
ANNOTATION: data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf
ALIGNER: hisat2
COUNTER: featureCounts
alignmentQC: no # "yes" or "no" to specify whether you want to do alignment QC
DEATOOL: DESeq2
PAIR: FALSE # Is this a pair test or not? ("TRUE" or "FALSE")
CONTROL: ["Untreated"]
TREAT: ["Dexamethasone"]
obs.: a let false just for testing if goes.
FILTER:
yesOrNo: FALSE # Filter out low expressed transcripts/genes or not? (TRUE or FALSE) It's better to be set to TRUE. FALSE is set as default only for testing fake toy data
I ran the RASflow in a docker container and got all the results. The RASflow is helpful to do RNAseq analysis automatically.
I need to find the raw counts in integer numbers. Nor gene_abundance.tsv
nor gene_norm.tsv
has integer numbers. After searching all the files in the output
dir, in the subfolder, I found aux_info/ambig_info.tsv
, which contains two columns: UniqueCount
and AmbigCount
. Are they the source of all the values in the tsv files? what's the actual meaning of UniqueCount
and AmbigCount
?
Hello,
I am facing this error for 2 weeks and could not find any solution. I am running 3 group analyses (Control, Treat1, and Treat2). and during the DEA I face this error.
Start mapping using genome as reference!
Building DAG of jobs...
Nothing to be done.
Complete log: /ebio/abt6/asaadeldin/RASflow/RASflow/.snakemake/log/2021-07-12T160020.335751.snakemake.log
Start doing DEA!
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 DEA
1 all
2
[Mon Jul 12 16:00:21 2021]
rule DEA:
input: output/Arap11/genome/dea/countGroup/Emoy2_gene_count.tsv, output/Arap11/genome/dea/countGroup/H2O_gene_count.tsv, output/Arap11/genome/dea/countGroup/Waco9_gene_count.tsv
output: output/Arap11/genome/dea/countGroup/H2O_gene_norm.tsv, output/Arap11/genome/dea/countGroup/Emoy2_gene_norm.tsv, output/Arap11/genome/dea/DEA/dea_H2O_Emoy2.tsv, output/Arap11/genome/dea/DEA/deg_H2O_Emoy2.tsv
jobid: 1
Loading required package: limma
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from ‘package:limma’:
plotMA
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colMeans, colSums, colnames,
dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
anyMissing, rowMedians
Loading required package: BiocParallel
Attaching package: ‘DelayedArray’
The following objects are masked from ‘package:matrixStats’:
colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
The following objects are masked from ‘package:base’:
aperm, apply
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
Calls: DEA ... DESeqDataSetFromMatrix -> DESeqDataSet -> checkFullRank
Execution halted
[Mon Jul 12 16:00:27 2021]
Error in rule DEA:
jobid: 1
output: output/Arap11/genome/dea/countGroup/H2O_gene_norm.tsv, output/Arap11/genome/dea/countGroup/Emoy2_gene_norm.tsv, output/Arap11/genome/dea/DEA/dea_H2O_Emoy2.tsv, output/Arap11/genome/dea/DEA/deg_H2O_Emoy2.tsv
RuleException:
CalledProcessError in line 38 of /ebio/abt6/asaadeldin/RASflow/RASflow/workflow/dea_genome.rules:
Command ' set -euo pipefail; Rscript scripts/dea_genome.R ' returned non-zero exit status 1.
File "/ebio/abt6/asaadeldin/RASflow/RASflow/workflow/dea_genome.rules", line 38, in __rule_DEA
File "/ebio/abt6/asaadeldin/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ebio/abt6/asaadeldin/RASflow/RASflow/.snakemake/log/2021-07-12T160021.273441.snakemake.log
DEA is done!
Start visualization of DEA results!
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 end
1 plot
2
[Mon Jul 12 16:00:28 2021]
rule plot:
input: output/Arap11/genome/dea/countGroup, output/Arap11/genome/dea/DEA
output: output/Arap11/genome/dea/visualization/volcano_plot_H2O_Emoy2.pdf, output/Arap11/genome/dea/visualization/heatmap_H2O_Emoy2.pdf
jobid: 1
Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns
Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colMeans, colSums, colnames,
dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:hash’:
values, values<-
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘AnnotationDbi’
The following objects are masked from ‘package:hash’:
keys, keys<-
Loading required package: ggplot2
Loading required package: ggrepel
Error in file(file, "rt") : cannot open the connection
Calls: plot.volcano.heatmap -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file 'output/Arap11/genome/dea/DEA/dea_H2O_Emoy2.tsv': No such file or directory
Execution halted
[Mon Jul 12 16:00:34 2021]
Error in rule plot:
jobid: 1
output: output/Arap11/genome/dea/visualization/volcano_plot_H2O_Emoy2.pdf, output/Arap11/genome/dea/visualization/heatmap_H2O_Emoy2.pdf
RuleException:
CalledProcessError in line 53 of /ebio/abt6/asaadeldin/RASflow/RASflow/workflow/visualize.rules:
Command ' set -euo pipefail; Rscript scripts/visualize.R output/Arap11/genome/dea/countGroup output/Arap11/genome/dea/DEA output/Arap11/genome/dea/visualization ' returned non-zero exit status 1.
File "/ebio/abt6/asaadeldin/RASflow/RASflow/workflow/visualize.rules", line 53, in __rule_plot
File "/ebio/abt6/asaadeldin/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ebio/abt6/asaadeldin/RASflow/RASflow/.snakemake/log/2021-07-12T160028.139519.snakemake.log
Visualization is done!
RASflow is done!
Hi,
Thanks for your work, now we're trying to use Rasflow to process our RNAseq data.
In Rasflow, a transcriptome file (line 55 in config_main.yaml, TRANS: data/example/ref/transcriptome/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz) is used to quantify transcripts. However, featurecounts have a parameter "-g transcript_id", which can be used to quantify transcripts too using the genome file (not transcriptome file). I think this is more convinence, because we only need to provide genome file and can get both genes and transcripts quantification. Are there any difference between these two methods ?
Hi, while trying to run the dea portion of the script, I get this error message
Waiting at most 5 seconds for missing files.
MissingOutputException in line 29 of /ufrc/kawahara/yashsondhi/rnaseq/test_analysis_ver1_2018-10-17/Workflow/snakemake_rnaseq/RASflow/workflow/dea_genome.rules:
Missing files after 5 seconds:
output/test_23_march/moth/genome/dea/countGroup/Night_gene_norm.tsv
output/test_23_march/moth/genome/dea/countGroup/Day_gene_norm.tsv
output/test_23_march/moth/genome/dea/DEA/dea_Night_Day.tsv
output/test_23_march/moth/genome/dea/DEA/deg_Night_Day.tsv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ufrc/kawahara/yashsondhi/rnaseq/test_analysis_ver1_2018-10-17/Workflow/snakemake_rnaseq/RASflow/.snakemake/log/2020-03-26T000012.982733.snakemake.log
Hi,
After investigating an error to launch the rule quantify_trans.rules in single end mode,
I identified an error in getReads: output is marked as "forward" when it should be "read".
Thanks
Hello,
I keeping get this error with the name of the files(???). I file names are like this: 1_R1.fastq.gz 1_R2.fastq.gz 2_R1.fastq.gz 2_R2.fastq.gz and in metadata file I wrote like this:: sample group subject 1 C 2 T
(rasflow)[xxxxxxx RASflow-master]$ python main.py
main.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(yamlfile)
Is quality control required?
True
Is trimming required?
False
Which mapping reference will be used?
genome
Is DEA required?
True
Is visualization required?
True
Please double check the information above
Do you want to continue? (y/n)
y
Start RASflow on project: augusto
Are you sure that you want to do Quality Control?
If yes, type 'y'; if not, type 'n' and set 'QC' to 'no' in the config file
y
Start Quality Control!
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 end
2 getReads
2 qualityControl
1 summaryReport
6
[Tue Mar 28 12:55:39 2023]
rule getReads:
output: output/augusto/augusto/reads/1_forward.fastq.gz, output/augusto/augusto/reads/1_reverse.fastq.gz
jobid: 5
wildcards: sample=1
Job counts:
count jobs
1 getReads
1
cp: cannot stat 'home/xxxxxx/teste/1_*R1*.f*q.gz': No such file or directory
[Tue Mar 28 12:55:39 2023]
Error in rule getReads:
jobid: 0
output: output/augusto/augusto/reads/1_forward.fastq.gz, output/augusto/augusto/reads/1_reverse.fastq.gz
RuleException:
CalledProcessError in line 24 of /home/xxxxxx/RASflow-master/workflow/quality_control.rules:
Command ' set -euo pipefail; scp -i None home/xxxxx/teste/1_*R1*.f*q.gz output/augusto/augusto/reads/1_forward.fastq.gz ' returned non-zero exit status 1.
File "/home/xxxxxxx/RASflow-master/workflow/quality_control.rules", line 24, in __rule_getReads
File "/home/xxxxxx/.conda/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/xxxxxxxx/RASflow-master/.snakemake/log/2023-03-28T125538.626392.snakemake.log
Quality control is done!
Please check the report and decide whether trimming is needed
Please remember to turn off the QC in the config file!
What I can be doing wrong? Thankyou in advance!!!
In quantify_trans_rules,
I think that input_path is misspecified if there is a trimming step in the pipeline:
if trimmed:
input_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/trim"
instead of :
if trimmed:
input_path = config["OUTPUTPATH"] + "/trim"
Best regards,
Hi, I got the volcano plot working, but the heatmap portion is not showing up. I am mapping RNA seq reads to a genome.
This is the output of the script when I run it only using R.
(rasflow) [yashsondhi@login4 RASflow]$ Rscript scripts/visualize.R output/test_25_march/moth/genome/dea/countGroup output/test_25_march/moth/genome/dea/DEA output/test_25_march/moth/genome/dea/visualization
Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns
Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:hash’:
values, values<-
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘AnnotationDbi’
The following objects are masked from ‘package:hash’:
keys, keys<-
Loading required package: ggplot2
Loading required package: ggrepel
Querying chunk 1
Querying chunk 2
Querying chunk 3
Querying chunk 4
Querying chunk 5
Querying chunk 6
Querying chunk 7
Querying chunk 8
Querying chunk 9
Querying chunk 10
Querying chunk 11
Querying chunk 12
Querying chunk 13
Querying chunk 14
Querying chunk 15
Querying chunk 16
Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.
Error in .request.get(mygene, paste("/query/", sep = ""), params) :
Request returned unexpected status code:
Response [http://mygene.info/v3/query/?fields=symbol]
Date: 2020-04-01 15:38
Status: 400
Content-Type: application/json; charset=UTF-8
Size: 65 B
{
"success": false,
"error": "Missing required parameters."
Calls: plot.volcano.heatmap ... query -> query -> query -> .request.get -> .request.get
Execution halted
Hi Zhang,
I am having issues with the gtf file format, I assume this is something I could fix by changing the index? I have attached the output of the cluster run, but I am not sure where I should edit this parameter?
ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'transcript_id "evm.model.chr1.34";'
The program has to terminate.
First few lines of the gtf file
GWHABGR00000001 EVM transcript 15372 30018 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1"
GWHABGR00000001 EVM exon 15372 15520 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 16212 16351 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 17501 17758 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 18192 18405 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 18529 18690 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 20641 20838 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 22769 22861 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 23546 23685 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1"
serial_test_9441360.log
;
Hello,
I notice the heatmap only shows the top 20 differentially expressed genes. Is there any way to change this to top 50 or 100 DEGs?
Thank you.
Hi,
I'm trying to use the RASflow and want to use the complete genome and annotation files(human). I tried to download files from https://uswest.ensembl.org/info/data/ftp/index.html, but don't know which files to download and fill the section for "Configuration for alignment to genome and feature count " in config_main.yaml.
The original setting in the config_main.yaml is:
GENOME: data/example/ref/genome/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa
ANNOTATION: data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf
I want to replace these files with complete human genome files. Where can I download them and which files?
Thanks,
Hi,
I was wondering wether the workflow can handle multiple comparisons.
[https://github.com/zhxiaokang/RASflow/blob/master/configs/metadata.tsv]
I have three groups:
Thanks in advanced
where is code haha?!
I just cloned this repository, created the conda environment and activated it.
The first issue I ran into is that snakemake requires the number of jobs to be set:
Start Quality Control!
Error: you need to specify the maximum number of CPU cores to be used at the same time. If you want to use N cores, say --cores N or -jN. For all cores on your system (be sure that this is appropriate) use --cores
all. For no parallelization use --cores 1 or -j1.
Quality control is done!
Please check the report and decide whether trimming is needed
Please remember to turn off the QC in the config file!
This is even though I have the number of cores specified to 1 or 4 on line 38 of the config:
## number of cores you want to allocate to this workflow
NCORE: 4 # Use command "getconf _NPROCESSORS_ONLN" to check the number of cores/CPU on your machine
Adding --jobs 1
to the main.py
file on line 65 as follows luckily fixed that issue:
os.system("nice -5 snakemake --jobs 1 -s workflow/quality_control.rules 2>&1 | tee logs/log_quality_control.txt")
However, now I'm getting the following error:
Start Quality Control!
AttributeError in line 23 of /home/campbell/mgeuenic/RASflow/workflow/quality_control.rules:
invalid name for input, output, wildcard, params or log: reverse is reserved for internal use
File "/home/campbell/mgeuenic/RASflow/workflow/quality_control.rules", line 23, in <module>
Quality control is done!
The snakemake version inside the environment is as it should be (I believe)
$ snakemake -v
5.32.0
I've also checked the output folder and the logs but there are no changes in either. Any ideas on what the issue is/how to fix it?
Hi! I realised that the volcano plot function not always plots the most significant genes. The reasons of this limitation are the xlim and ylim parameters from EnhancedVolcano() [lines 82 and 85 visualization.R script]. I modified the xlim and ylim parameters with something like:
extract x/y limits based on the log2FC and padj values from results DESeq2 table
log2FC_lim <- max(abs(dea.table$log2FoldChange))
padj_lim <- -log10(min(dea.table$padj)) # NAs already removed from dea.table
redefine xlim and ylim graph coordinates
EnhancedVolcano(..., xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, padj_lim), ...)
This worked for me using DESeq2. In this way, the user will always get the most significant genes annotated in the volcano plot.
Thank you for your very useful tool.
All the best,
Raquel
Hi,
I was wondering how could I construct metadata.ts for paired_end data? For example, if I have Untreated sample 1,2 and treated sample 3,4. from 4 mice. Could I write metadata.tsv in this way:
sample group subject
S1_R1 untreated 1
S1_R2 untreated 1
S2_R1 untreated 2
S2_R2 untreated 2
S3_R1 treated 3
S3_R2 treated 3
S4_R1 treated 4
S4_R2 treated 4
Thanks!
Kun
hi@zhxiaokang,
please help, my ram is not enough, what solutions for this problem . 8g ram only, so bad, hopefully i own a new powerful notebook,...
sincerely,
luoying
Hello,
For some reason, I am getting an error in the DEA steps and subsequent visualization. It says there is an error in the data frame, but I am not sure how to solve this error. Any help is appreciated. Thank you.
rule DEA:
input: output/analysis/genome/dea/countGroup/Treatment_1_gene_count.tsv, output/analysis/genome/dea/countGroup/Untreated_gene_count.tsv
output: output/analysis/genome/dea/countGroup/Untreated_gene_norm.tsv, output/analysis/genome/dea/countGroup/Treatment_1_gene_norm.tsv, output/analysis/genome/dea/DEA/dea_Untreated_Treatment_1.tsv, output/analysis/genome/dea/DEA/deg_Untreated_Treatment_1.tsv
jobid: 1
Loading required package: limma
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from ‘package:limma’:
plotMA
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, [pmax.int](http://pmax.int/),
pmin, [pmin.int](http://pmin.int/), Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
anyMissing, rowMedians
Loading required package: BiocParallel
Attaching package: ‘DelayedArray’
The following objects are masked from ‘package:matrixStats’:
colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
The following objects are masked from ‘package:base’:
aperm, apply
Error in $<-.data.frame
(*tmp*
, "subject", value = integer(0)) :
replacement has 0 rows, data has 4
Calls: DEA ->
Execution halted
[Sun Jul 3 02:54:56 2022]
Error in rule DEA:
jobid: 1
output: output/analysis/genome/dea/countGroup/Untreated_gene_norm.tsv, output/analysis/genome/dea/countGroup/Treatment_1_gene_norm.tsv, output/analysis/genome/dea/DEA/dea_Untreated_Treatment_1.tsv, output/analysis/genome/dea/DEA/deg_Untreated_Treatment_1.tsv
RuleException:
CalledProcessError in line 38 of /root/RASflow/workflow/dea_genome.rules:
Command ' set -euo pipefail; Rscript scripts/dea_genome.R ' returned non-zero exit status 1.
File "/root/RASflow/workflow/dea_genome.rules", line 38, in __rule_DEA
File "/root/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job DEA since they might be corrupted:
output/analysis/genome/dea/countGroup/Untreated_gene_norm.tsv, output/analysis/genome/dea/countGroup/Treatment_1_gene_norm.tsv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /root/RASflow/.snakemake/log/2022-07-03T025450.360606.snakemake.log
DEA is done!
Start visualization of DEA results!
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 end
1 plot
2
[Sun Jul 3 02:54:56 2022]
rule plot:
input: output/analysis/genome/dea/countGroup, output/analysis/genome/dea/DEA
output: output/analysis/genome/dea/visualization/volcano_plot_Untreated_Treatment_1.pdf, output/analysis/genome/dea/visualization/heatmap_Untreated_Treatment_1.pdf
jobid: 1
Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns
Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, [pmax.int](http://pmax.int/),
pmin, [pmin.int](http://pmin.int/), Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:hash’:
values, values<-
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘AnnotationDbi’
The following objects are masked from ‘package:hash’:
keys, keys<-
Loading required package: ggplot2
Loading required package: ggrepel
Error in file(file, "rt") : cannot open the connection
Calls: plot.volcano.heatmap -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file 'output/analysis/genome/dea/DEA/dea_Untreated_Treatment_1.tsv': No such file or directory
Execution halted
[Sun Jul 3 02:55:01 2022]
Error in rule plot:
jobid: 1
output: output/analysis/genome/dea/visualization/volcano_plot_Untreated_Treatment_1.pdf, output/analysis/genome/dea/visualization/heatmap_Untreated_Treatment_1.pdf
RuleException:
CalledProcessError in line 53 of /root/RASflow/workflow/visualize.rules:
Command ' set -euo pipefail; Rscript scripts/visualize.R output/analysis/genome/dea/countGroup output/analysis/genome/dea/DEA output/analysis/genome/dea/visualization ' returned non-zero exit status 1.
File "/root/RASflow/workflow/visualize.rules", line 53, in __rule_plot
File "/root/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /root/RASflow/.snakemake/log/2022-07-03T025456.705370.snakemake.log
Visualization is done!
RASflow is done!
Hi, when I try to create the rasflow environment with the given command "conda env create -n rasflow -f env.yaml" it gets stuck in the resolving environment step. I have successfully installed and used rasflow in Ubuntu LTS 20.4 and in a WSL vm with the same OS but can't really install it on my work computer. Is there any workaround?
Thanks in advance.
Hello Mr.Zhang,
For some reason, I am getting an error in the DEA steps.
Error in tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no") :
all(file.exists(files)) is not TRUE
Calls: DEA -> tximport -> stopifnot
Execution halted
[Mon Sep 12 11:16:57 2022]
Error in rule DEA:
jobid: 1
output: output/WT/trans/dea/DEA/transcript-level/dea_wt_ko.tsv, output/WT/trans/dea/DEA/transcript-level/deg_wt_ko.tsv, output/WT/trans/dea/DEA/gene-level/dea_wt_ko.tsv, output/WT/trans/dea/DEA/gene-level/deg_wt_ko.tsv
RuleException:
CalledProcessError in line 40 of /mnt/d/RNAseq/qhd2/workflow/dea_trans.rules:
Command ' set -euo pipefail; Rscript scripts/dea_trans.R ' returned non-zero exit status 1.
File "/mnt/d/RNAseq/qhd2/workflow/dea_trans.rules", line 40, in __rule_DEA
File "/home/wbc/miniconda3/envs/ranseqSnakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/d/RNAseq/qhd2/.snakemake/log/2022-09-12T110512.827029.snakemake.log
DEA is done!
Visualization is not required and RASflow is done!
Building DAG of jobs...
Nothing to be done.
Complete log: /mnt/d/RNAseq/qhd2/.snakemake/log/2022-09-12T110510.915007.snakemake.log
Any help is appreciated. Thank you.
Hi,
I want to run it on a cluster using a bash script, is there some way to silence the checkpoints once the trimming is done. It cannot be run non-interactively.
Dear Dr Kang,
First of all, well done, and thank You for RASflow - You made sequencing pipeline truly easy to understand and expand.
Still, I found a wierd behavior in rule getReads of align_count_genome.rules, which possibly generalizes to other extglob-based rules. When running this rule with TRIMMED: yes
, it returns...
Command ' set -euo pipefail;
shopt -s extglob
scp -i None /tmp/data/output/test/trim/MSG3701204_small?(.*).f*q.gz /tmp/data/output/test/genome/reads/MSG3701204_small.fastq.gz ' returned non-zero exit status 1.
... despite the file (MSG3701204_small_trimmed.fq.gz) being present in the directory.
The curious thing is, the same rule (getReads) in trim.rules works perfectly well. Similarly, when running align_count_genome.rules with TRIMMED: no
, the pipeline finishes fine.
I'm running RASflow inside Singularity container (see: https://github.com/AdrianS85/rnaseq/blob/main/rasflow_sing). My config file: https://github.com/AdrianS85/rnaseq/blob/main/config_main.yaml.
Please let me know if You are interested in any other information.
Best regards,
Adrian
MultiQC doesn't appear to run in the current state of the conda environment from the yaml file. Attempting to run on a minimal dataset (2 samples) throws 'SyntaxError: future feature annotations is not defined', which seems like it may be related to use of Python version <3.7 . This is overcome by using Python 3.7, but then copytree requires 3.8, which is incompatible with other dependencies.
This is occurring in Linux on both a HPC and a local WSL install.
Hi!
I'm trying to use RASFlow for first time with ONT Minion (with master branch), after align step with HISAT2 no reads are aligned. I also tried bwa branch but there are some tokens errors so the pipeline fails (im using macOs Catalina). Any suggestion? maybe there are some hisat2 configuration for Nanopore??
Thank you in advance
AG
Alignment ressults:
38280 reads; of these:
38280 (100.00%) were unpaired; of these:
38280 (100.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.00% overall alignment rate
File path built with wildcards do not recognize "sample1.fastq.gz" as a single file if "sample10.fastq.gz" is in same directory, instead it returns a list with the 2 files.
My way to fix it:
IN align_count_genome.rules AND quantify_trans.rules
CHANGE
shell("scp -i {params.key} {params.input_path}/{wildcards.sample}*.f*q.gz {output.read}")
TO
shell("scp -i {params.key} {params.input_path}/{wildcards.sample}.f*q.gz {output.read}")
Hi,
Thank you for the nice pipeline development.
I first tried QC for the fastq samples without further downstream application. And the feedback from the pipeline is:
main.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(yamlfile)
Is quality control required?
True
Is trimming required?
False
Which mapping reference will be used?
genome
Is DEA required?
False
Is visualization required?
False
Please double check the information above
Do you want to continue? (y/n)
y
Start RASflow on project: calderon_et_al
Are you sure that you want to do Quality Control?
If yes, type 'y'; if not, type 'n' and set 'QC' to 'no' in the config file
y
Start Quality Control!
Building DAG of jobs...
Nothing to be done.
Complete log: /home/RASflow/.snakemake/log/2020-06-08T212542.413870.snakemake.log
Quality control is done!
Please check the report and decide whether trimming is needed
Please remember to turn off the QC in the config file!
And I found nothing in my pre-defined output directory.
Can you help sort out the issue?
Many thanks!
Jessie
I am trying to figure out a fix for this error. Can you give any advice?
Error in rename(df, c(X_id = "_id")) :
Some 'from' names in value not found on 'x': X_id
Calls: plot.volcano.heatmap ... queryMany -> queryMany -> .return.as -> rename -> rename
Execution halted
Hi!
I am wondering if it is possible to change the default FDR value set for DEA which is 0.05. I would like to check if setting a higher threshold gives interesting results. I checked the corresponding rules file but it doesn't seem to include the value.
Please help! Many thanks :)
Hi @zhxiaokang
i run into a problem when i execute the pipeline with the paired-end test data using htseq-count. There appears the following warning and finally the Error caused by missing file:
count jobs
1 featureCount
1
17933 GFF lines processed.
Warning: Read SRR1039509.3494278 claims to have an aligned mate which could not be found in an adjacent line.
Warning: 81048 reads with missing mate encountered.
89918 SAM alignment pairs processed.
MissingOutputException in line 99 of /data/tools/RASflow/workflow/align_count_genome.rules:
Missing files after 5 seconds:
output/test/genome/countFile/SRR1039509_count.tsv.summary
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /data/tools/RASflow/.snakemake/log/2022-11-09T132244.600008.snakemake.log
DEA is not required and RASflow is done!
The problem is, that the file SRR1039509_count.tsv.summary
is missing but I don't know if the warning of htseq-count is related to this problem.
When I execute the isolated command
htseq-count -f bam -i gene_id -s no -t exon output/test/test/genome/bamFileSort/SRR1039509.sort.bam data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf | sed '/^__/ d' > output/test/genome/countFile/SRR1039509_count.tsv
only the SRR1039509_count.tsv file is created.
With kind regards
Arsenij
I am getting an error with the visualization step.
`Querying chunk 56
Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.
Error in '$<-.data.frame'('tmp', "lab", value = c("Lcn2", "Cd24a", "Sprr1a", :
replacement has 55422 rows, data has 55421
Calls: plot.volcano.heatmap -> EnhancedVolcano ->
Execution halted
`
I am also getting the warning
`Loading required package: ggrepel
Warning message:
In readLines(con) :
incomplete final line found on 'configs/config_main.yaml'
Querying chunk 1
Querying chunk 2
`
Any help with these? Am I missing a package in R?
I also noticed it's not possible to use the main.py to run just the visualization without redoing the DEA, running it with DEA set to returns
Which mapping reference will be used? genome Is DEA requred? False Is visualization requred? True Please double check the information above Do you want to continue? (y/n) y Start RASflow on project: project Trimming is not required Start mapping using genome as reference! Building DAG of jobs... Nothing to be done.
Hi,
The visualisation fails for DESeq2, for EdgeR it runs, but comes up with this error message. I did not do it by gene since my species bombyx mori while on ensembl was not on the ensembl lookup file.
Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.
Error in .request.get(mygene, paste("/query/", sep = ""), params) :
Request returned unexpected status code:
Response [http://mygene.info/v3/query/?fields=symbol]
Date: 2020-03-28 04:33
Status: 400
Content-Type: application/json; charset=UTF-8
Size: 65 B
{
"success": false,
"error": "Missing required parameters."
Calls: plot.volcano.heatmap ... query -> query -> query -> .request.get -> .request.get
Execution halted
I am not able to see the tutorial for RASflow. The link points to a pdf here:
https://github.com/zhxiaokang/RASflow/blob/66c9f378e10af7d36dc08fdef46ac97c5e48f713/Tutorial.pdf
But I get a "Unable to render code block" error
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.