zhxiaokang / rasflow Goto Github PK

View Code? Open in Web Editor NEW

102.0 102.0 58.0 181.31 MB

RNA-Seq analysis workflow

License: MIT License

R 0.27% Shell 0.02% Java 0.02% Python 0.07% HTML 97.03% CSS 0.77% JavaScript 1.83%

rasflow's People

Contributors

Stargazers

Watchers

Forkers

pythseq yuanzehu88 bbyun28 wenliangz peiwenliu18 nifaste moelzek jfk24 antoniogps clairejj20 zengtian2680 shi-yiji yashsondhi trissim naveen584 ryanjoel keyong-bio xiaoqiwang19 jiahuaqu yixf-education monicambabazi haroon123 sherkinglee guosongjia najlabioinfo standardgalactic adrians85 victorfica unique379r swrosati joanexee muedi bioming zzygyx9119 joyce-karol wangjie07070910 danilotat shicheng-guo justintzeji harshameghadri sunnyev cuervo88 suresh2014 rnaimehaom ishak-yusuf richard1iu sogada azkasaleem antoniaprn aryazand zhiyu2023 mugpeng codus503 akashbahai wangpanqiao crowmane420

rasflow's Issues

problem with gtf file

Hi @zhxiaokang

It seems I am having a problem with the BAM step in my workflow.

I get the following output

`[Wed Jan 6 23:51:27 2021]
rule featureCount:
input: data/output/pva027/genome/bamFileSort/20-0357.sort.bam, data/example/ref/annotation/027-annot.gff
output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary
jobid: 2
wildcards: sample=20-0357

Job counts:
count jobs
1 featureCount
1

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
  v1.6.4

//========================== featureCounts setting ===========================\
|| ||
|| Input files : 1 BAM file ||
|| P 20-0357.sort.bam ||
|| ||
|| Output file : 20-0357_count.tsv ||
|| Summary : 20-0357_count.tsv.summary ||
|| Annotation : 027-annot.gff (GTF) ||
|| Dir for temp files : output-pva/pva027/genome/countFile ||
|| ||
|| Threads : 8 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\============================================================================//

//================================= Running ==================================\
|| ||
|| Load annotation file 027-annot.gff ... ||
ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
The porgram has to terminate.

[Wed Jan 6 23:51:28 2021]
Error in rule featureCount:
jobid: 0
output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary

RuleException:
CalledProcessError in line 109 of /home/sam/Downloads/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail; featureCounts -p -T 8 -t exon -g ID -a data/example/ref/annotation/027-annot.gff -o output-pva/pva027/genome/countFile/20-0357_count.tsv data/output/pva027/genome/bamFileSort/20-0357.sort.bam && tail -n +3 output-pva/pva027/genome/countFile/20-0357_count.tsv | cut -f1,7 > temp.20-0357 && mv temp.20-0357 output-pva/pva027/genome/countFile/20-0357_count.tsv ' returned non-zero exit status 255.
File "/home/sam/Downloads/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
`
Do you have any idea how I could fix it?

Cheers,
Pablo

Merging fastq files prior to salmon quantification

As part of my pipeline, I want to merge files based on groups in the config file prior to run on salmon. Is there an easy way to do this. My understanding of the wildcards of snakemake is very rudimentary. Thanks so much in advance!

BAM sorting error

Dear @zhxiaokang,

I am having problems now with the BAM sorting in the pipeline. More specifically, the error I get is this one:

`
[Wed Jun 9 15:27:30 2021]
rule sortBAM:
input: data/output/test/genome/bamFile/200357.bam
output: data/output/test/genome/bamFileSort/200357.sort.bam
jobid: 6
wildcards: sample=200357

/bin/bash: line 1: 3230 Killed samtools sort -@ 8 data/output/test/genome/bamFile/200357.bam -o data/output/test/genome/bamFileSort/200357.sort.bam
[Wed Jun 9 15:27:37 2021]
Error in rule sortBAM:
jobid: 6
output: data/output/test/genome/bamFileSort/200357.sort.bam

RuleException:
CalledProcessError in line 97 of /home/sam/Downloads/pablo2021/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail; samtools sort -@ 8 data/output/test/genome/bamFile/200357.bam -o data/output/test/genome/bamFileSort/200357.sort.bam ' returned non-zero exit status 137.
File "/home/sam/Downloads/pablo2021/RASflow/workflow/align_count_genome.rules", line 97, in __rule_sortBAM
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/sam/Downloads/pablo2021/RASflow/.snakemake/log/2021-06-09T152538.774946.snakemake.log
`

Could you please help me solve it?

Cheers,
Pablo

Can't run the examples

Hi, thank you for your work. I'm just installed RASflow by the Conda environment with the yaml file env.yaml like the tutorial and tried to run the example changing the project name and even another sample, but keeping get the error in the first QC to stop that is nice: ‘snakemake’: No such file or directory and the outputs files don't appear.

This is how is my config_main.yaml characteristics was:

PROJECT: teste
QC: yes # "yes" or "no"
TRIMMED: no # "yes" or "no"?
REFERENCE: genome # "genome" or "transcriptome"
DEA: yes # "yes" or "no"
VISUALIZE: no # "yes" or "no"
READSPATH: data/example/fastq_pair
METAFILE: configs/metadata.tsv
END: pair
NCORE: 40
OUTPUTPATH: data/output/teste # intermediate output. do not upload to github
FINALOUTPUT: output/teste
GENOME: data/example/ref/genome/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa
ANNOTATION: data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf
ALIGNER: hisat2
COUNTER: featureCounts
alignmentQC: no # "yes" or "no" to specify whether you want to do alignment QC
DEATOOL: DESeq2
PAIR: FALSE # Is this a pair test or not? ("TRUE" or "FALSE")
CONTROL: ["Untreated"]
TREAT: ["Dexamethasone"]
obs.: a let false just for testing if goes.

FILTER:
yesOrNo: FALSE # Filter out low expressed transcripts/genes or not? (TRUE or FALSE) It's better to be set to TRUE. FALSE is set as default only for testing fake toy data

raw counts

I ran the RASflow in a docker container and got all the results. The RASflow is helpful to do RNAseq analysis automatically.

I need to find the raw counts in integer numbers. Nor gene_abundance.tsv nor gene_norm.tsv has integer numbers. After searching all the files in the output dir, in the subfolder, I found aux_info/ambig_info.tsv, which contains two columns: UniqueCount and AmbigCount. Are they the source of all the values in the tsv files? what's the actual meaning of UniqueCount and AmbigCount?

DEA error and visualisation.

Hello,

I am facing this error for 2 weeks and could not find any solution. I am running 3 group analyses (Control, Treat1, and Treat2). and during the DEA I face this error.

Start mapping using genome as reference!
Building DAG of jobs...
Nothing to be done.
Complete log: /ebio/abt6/asaadeldin/RASflow/RASflow/.snakemake/log/2021-07-12T160020.335751.snakemake.log
Start doing DEA!
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 DEA
1 all
2

[Mon Jul 12 16:00:21 2021]
rule DEA:
input: output/Arap11/genome/dea/countGroup/Emoy2_gene_count.tsv, output/Arap11/genome/dea/countGroup/H2O_gene_count.tsv, output/Arap11/genome/dea/countGroup/Waco9_gene_count.tsv
output: output/Arap11/genome/dea/countGroup/H2O_gene_norm.tsv, output/Arap11/genome/dea/countGroup/Emoy2_gene_norm.tsv, output/Arap11/genome/dea/DEA/dea_H2O_Emoy2.tsv, output/Arap11/genome/dea/DEA/deg_H2O_Emoy2.tsv
jobid: 1

Loading required package: limma
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:limma’:

plotMA

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colMeans, colSums, colnames,
dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

anyMissing, rowMedians

Loading required package: BiocParallel

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

aperm, apply

Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.

Please read the vignette section 'Model matrix not full rank':

vignette('DESeq2')
Calls: DEA ... DESeqDataSetFromMatrix -> DESeqDataSet -> checkFullRank
Execution halted
[Mon Jul 12 16:00:27 2021]
Error in rule DEA:
jobid: 1
output: output/Arap11/genome/dea/countGroup/H2O_gene_norm.tsv, output/Arap11/genome/dea/countGroup/Emoy2_gene_norm.tsv, output/Arap11/genome/dea/DEA/dea_H2O_Emoy2.tsv, output/Arap11/genome/dea/DEA/deg_H2O_Emoy2.tsv

RuleException:
CalledProcessError in line 38 of /ebio/abt6/asaadeldin/RASflow/RASflow/workflow/dea_genome.rules:
Command ' set -euo pipefail; Rscript scripts/dea_genome.R ' returned non-zero exit status 1.
File "/ebio/abt6/asaadeldin/RASflow/RASflow/workflow/dea_genome.rules", line 38, in __rule_DEA
File "/ebio/abt6/asaadeldin/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ebio/abt6/asaadeldin/RASflow/RASflow/.snakemake/log/2021-07-12T160021.273441.snakemake.log
DEA is done!
Start visualization of DEA results!
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 end
1 plot
2

[Mon Jul 12 16:00:28 2021]
rule plot:
input: output/Arap11/genome/dea/countGroup, output/Arap11/genome/dea/DEA
output: output/Arap11/genome/dea/visualization/volcano_plot_H2O_Emoy2.pdf, output/Arap11/genome/dea/visualization/heatmap_H2O_Emoy2.pdf
jobid: 1

Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns

Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colMeans, colSums, colnames,
dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:hash’:

values, values<-

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: ‘AnnotationDbi’

The following objects are masked from ‘package:hash’:

keys, keys<-

Loading required package: ggplot2
Loading required package: ggrepel
Error in file(file, "rt") : cannot open the connection
Calls: plot.volcano.heatmap -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file 'output/Arap11/genome/dea/DEA/dea_H2O_Emoy2.tsv': No such file or directory
Execution halted
[Mon Jul 12 16:00:34 2021]
Error in rule plot:
jobid: 1
output: output/Arap11/genome/dea/visualization/volcano_plot_H2O_Emoy2.pdf, output/Arap11/genome/dea/visualization/heatmap_H2O_Emoy2.pdf

RuleException:
CalledProcessError in line 53 of /ebio/abt6/asaadeldin/RASflow/RASflow/workflow/visualize.rules:
Command ' set -euo pipefail; Rscript scripts/visualize.R output/Arap11/genome/dea/countGroup output/Arap11/genome/dea/DEA output/Arap11/genome/dea/visualization ' returned non-zero exit status 1.
File "/ebio/abt6/asaadeldin/RASflow/RASflow/workflow/visualize.rules", line 53, in __rule_plot
File "/ebio/abt6/asaadeldin/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ebio/abt6/asaadeldin/RASflow/RASflow/.snakemake/log/2021-07-12T160028.139519.snakemake.log
Visualization is done!
RASflow is done!

transcriptome analysis

Hi,
Thanks for your work, now we're trying to use Rasflow to process our RNAseq data.
In Rasflow, a transcriptome file (line 55 in config_main.yaml, TRANS: data/example/ref/transcriptome/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz) is used to quantify transcripts. However, featurecounts have a parameter "-g transcript_id", which can be used to quantify transcripts too using the genome file (not transcriptome file). I think this is more convinence, because we only need to provide genome file and can get both genes and transcripts quantification. Are there any difference between these two methods ?

Latency during dea for large files

Hi, while trying to run the dea portion of the script, I get this error message

Waiting at most 5 seconds for missing files.
MissingOutputException in line 29 of /ufrc/kawahara/yashsondhi/rnaseq/test_analysis_ver1_2018-10-17/Workflow/snakemake_rnaseq/RASflow/workflow/dea_genome.rules:
Missing files after 5 seconds:
output/test_23_march/moth/genome/dea/countGroup/Night_gene_norm.tsv
output/test_23_march/moth/genome/dea/countGroup/Day_gene_norm.tsv
output/test_23_march/moth/genome/dea/DEA/dea_Night_Day.tsv
output/test_23_march/moth/genome/dea/DEA/deg_Night_Day.tsv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ufrc/kawahara/yashsondhi/rnaseq/test_analysis_ver1_2018-10-17/Workflow/snakemake_rnaseq/RASflow/.snakemake/log/2020-03-26T000012.982733.snakemake.log

error in rule quantify_trans.rules

Hi,
After investigating an error to launch the rule quantify_trans.rules in single end mode,
I identified an error in getReads: output is marked as "forward" when it should be "read".

Thanks

Error in fastqc

Hello,

I keeping get this error with the name of the files(???). I file names are like this: 1_R1.fastq.gz 1_R2.fastq.gz 2_R1.fastq.gz 2_R2.fastq.gz and in metadata file I wrote like this:: sample group subject 1 C 2 T

(rasflow)[xxxxxxx RASflow-master]$ python main.py
main.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(yamlfile)
Is quality control required?
 True
Is trimming required?
 False
Which mapping reference will be used?
 genome
Is DEA required?
 True
Is visualization required?
 True
Please double check the information above
Do you want to continue? (y/n)
y
Start RASflow on project: augusto
Are you sure that you want to do Quality Control?
 If yes, type 'y'; if not, type 'n' and set 'QC' to 'no' in the config file
y
Start Quality Control!
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       end
        2       getReads
        2       qualityControl
        1       summaryReport
        6

[Tue Mar 28 12:55:39 2023]
rule getReads:
    output: output/augusto/augusto/reads/1_forward.fastq.gz, output/augusto/augusto/reads/1_reverse.fastq.gz
    jobid: 5
    wildcards: sample=1

Job counts:
        count   jobs
        1       getReads
        1
cp: cannot stat 'home/xxxxxx/teste/1_*R1*.f*q.gz': No such file or directory
[Tue Mar 28 12:55:39 2023]
Error in rule getReads:
    jobid: 0
    output: output/augusto/augusto/reads/1_forward.fastq.gz, output/augusto/augusto/reads/1_reverse.fastq.gz

RuleException:
CalledProcessError in line 24 of /home/xxxxxx/RASflow-master/workflow/quality_control.rules:
Command ' set -euo pipefail;  scp -i None home/xxxxx/teste/1_*R1*.f*q.gz output/augusto/augusto/reads/1_forward.fastq.gz ' returned non-zero exit status 1.
  File "/home/xxxxxxx/RASflow-master/workflow/quality_control.rules", line 24, in __rule_getReads
  File "/home/xxxxxx/.conda/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/xxxxxxxx/RASflow-master/.snakemake/log/2023-03-28T125538.626392.snakemake.log
Quality control is done!
 Please check the report and decide whether trimming is needed
 Please remember to turn off the QC in the config file!

What I can be doing wrong? Thankyou in advance!!!

Quantification after trimming

In quantify_trans_rules,
I think that input_path is misspecified if there is a trimming step in the pipeline:

if trimmed:
input_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/trim"

instead of :
if trimmed:
input_path = config["OUTPUTPATH"] + "/trim"

Best regards,

Final visualisation step

Hi, I got the volcano plot working, but the heatmap portion is not showing up. I am mapping RNA seq reads to a genome.

This is the output of the script when I run it only using R.
(rasflow) [yashsondhi@login4 RASflow]$ Rscript scripts/visualize.R output/test_25_march/moth/genome/dea/countGroup output/test_25_march/moth/genome/dea/DEA output/test_25_march/moth/genome/dea/visualization
Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns

Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:hash’:

values, values<-

The following object is masked from ‘package:base’:

expand.grid

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: ‘AnnotationDbi’

The following objects are masked from ‘package:hash’:

keys, keys<-

Loading required package: ggplot2
Loading required package: ggrepel
Querying chunk 1
Querying chunk 2
Querying chunk 3
Querying chunk 4
Querying chunk 5
Querying chunk 6
Querying chunk 7
Querying chunk 8
Querying chunk 9
Querying chunk 10
Querying chunk 11
Querying chunk 12
Querying chunk 13
Querying chunk 14
Querying chunk 15
Querying chunk 16
Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.
Error in .request.get(mygene, paste("/query/", sep = ""), params) :
Request returned unexpected status code:
Response [http://mygene.info/v3/query/?fields=symbol]
Date: 2020-04-01 15:38
Status: 400
Content-Type: application/json; charset=UTF-8
Size: 65 B
{
"success": false,
"error": "Missing required parameters."
Calls: plot.volcano.heatmap ... query -> query -> query -> .request.get -> .request.get
Execution halted

Gtf file format error

Hi Zhang,
I am having issues with the gtf file format, I assume this is something I could fix by changing the index? I have attached the output of the cluster run, but I am not sure where I should edit this parameter?

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'transcript_id "evm.model.chr1.34";'
The program has to terminate.

First few lines of the gtf file

GWHABGR00000001 EVM transcript 15372 30018 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1"
GWHABGR00000001 EVM exon 15372 15520 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 16212 16351 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 17501 17758 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 18192 18405 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 18529 18690 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 20641 20838 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 22769 22861 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1";
GWHABGR00000001 EVM exon 23546 23685 . - . transcript_id "evm.model.chr1.1"; gene_id "evm.TU.chr1.1"
serial_test_9441360.log
;

Show top 50 or 100 genes on heatmap

Hello,

I notice the heatmap only shows the top 20 differentially expressed genes. Is there any way to change this to top 50 or 100 DEGs?
Thank you.

Where to get genome and annotation files

Hi,

I'm trying to use the RASflow and want to use the complete genome and annotation files(human). I tried to download files from https://uswest.ensembl.org/info/data/ftp/index.html, but don't know which files to download and fill the section for "Configuration for alignment to genome and feature count " in config_main.yaml.
The original setting in the config_main.yaml is:

genome and annotation files

GENOME: data/example/ref/genome/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa
ANNOTATION: data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf

I want to replace these files with complete human genome files. Where can I download them and which files?

Thanks,

Multiple comparison with 2 treatments

Hi,

I was wondering wether the workflow can handle multiple comparisons.
[https://github.com/zhxiaokang/RASflow/blob/master/configs/metadata.tsv]
I have three groups:

1. Control
1. Treatment_A
1. Treatment_B
  In the metadata table I see this only works with a control group and one treatment comparison group at a time or should I run another workflow with my other treatment and the control?

Thanks in advanced

code?

where is code haha?!

reverse is reserved for internal use error

I just cloned this repository, created the conda environment and activated it.

The first issue I ran into is that snakemake requires the number of jobs to be set:

Start Quality Control!
Error: you need to specify the maximum number of CPU cores to be used at the same time. If you want to use N cores, say --cores N or -jN. For all cores on your system (be sure that this is appropriate) use --cores
all. For no parallelization use --cores 1 or -j1.
Quality control is done!
 Please check the report and decide whether trimming is needed
 Please remember to turn off the QC in the config file!

This is even though I have the number of cores specified to 1 or 4 on line 38 of the config:

## number of cores you want to allocate to this workflow
NCORE: 4  # Use command "getconf _NPROCESSORS_ONLN" to check the number of cores/CPU on your machine

Adding --jobs 1 to the main.py file on line 65 as follows luckily fixed that issue:

os.system("nice -5 snakemake --jobs 1 -s workflow/quality_control.rules 2>&1 | tee logs/log_quality_control.txt")

However, now I'm getting the following error:

Start Quality Control!
AttributeError in line 23 of /home/campbell/mgeuenic/RASflow/workflow/quality_control.rules:
invalid name for input, output, wildcard, params or log: reverse is reserved for internal use
  File "/home/campbell/mgeuenic/RASflow/workflow/quality_control.rules", line 23, in <module>
Quality control is done!

The snakemake version inside the environment is as it should be (I believe)

$ snakemake -v
5.32.0

I've also checked the output folder and the logs but there are no changes in either. Any ideas on what the issue is/how to fix it?

visualization.R EnhancedVolcano()

Hi! I realised that the volcano plot function not always plots the most significant genes. The reasons of this limitation are the xlim and ylim parameters from EnhancedVolcano() [lines 82 and 85 visualization.R script]. I modified the xlim and ylim parameters with something like:

extract x/y limits based on the log2FC and padj values from results DESeq2 table

log2FC_lim <- max(abs(dea.table$log2FoldChange))
 padj_lim <- -log10(min(dea.table$padj)) # NAs already removed from dea.table

redefine xlim and ylim graph coordinates
EnhancedVolcano(..., xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, padj_lim), ...)

This worked for me using DESeq2. In this way, the user will always get the most significant genes annotated in the volcano plot.

Thank you for your very useful tool.

All the best,
Raquel

how to construct metadata.tsv for paired-end data

Hi,

I was wondering how could I construct metadata.ts for paired_end data? For example, if I have Untreated sample 1,2 and treated sample 3,4. from 4 mice. Could I write metadata.tsv in this way:
sample group subject
S1_R1 untreated 1
S1_R2 untreated 1
S2_R1 untreated 2
S2_R2 untreated 2
S3_R1 treated 3
S3_R2 treated 3
S4_R1 treated 4
S4_R2 treated 4

Thanks!
Kun

process was be killed, ram is not enough.

hi@zhxiaokang,
please help, my ram is not enough, what solutions for this problem . 8g ram only, so bad, hopefully i own a new powerful notebook,...
sincerely,
luoying

DEA error

Hello,

For some reason, I am getting an error in the DEA steps and subsequent visualization. It says there is an error in the data frame, but I am not sure how to solve this error. Any help is appreciated. Thank you.

rule DEA:
input: output/analysis/genome/dea/countGroup/Treatment_1_gene_count.tsv, output/analysis/genome/dea/countGroup/Untreated_gene_count.tsv
output: output/analysis/genome/dea/countGroup/Untreated_gene_norm.tsv, output/analysis/genome/dea/countGroup/Treatment_1_gene_norm.tsv, output/analysis/genome/dea/DEA/dea_Untreated_Treatment_1.tsv, output/analysis/genome/dea/DEA/deg_Untreated_Treatment_1.tsv
jobid: 1

Loading required package: limma
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:limma’:

plotMA

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, [pmax.int](http://pmax.int/),
pmin, [pmin.int](http://pmin.int/), Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

expand.grid

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

anyMissing, rowMedians

Loading required package: BiocParallel

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

aperm, apply

Error in $<-.data.frame(*tmp*, "subject", value = integer(0)) :
replacement has 0 rows, data has 4
Calls: DEA -> $<- -> $<-.data.frame
Execution halted
[Sun Jul 3 02:54:56 2022]
Error in rule DEA:
jobid: 1
output: output/analysis/genome/dea/countGroup/Untreated_gene_norm.tsv, output/analysis/genome/dea/countGroup/Treatment_1_gene_norm.tsv, output/analysis/genome/dea/DEA/dea_Untreated_Treatment_1.tsv, output/analysis/genome/dea/DEA/deg_Untreated_Treatment_1.tsv

RuleException:
CalledProcessError in line 38 of /root/RASflow/workflow/dea_genome.rules:
Command ' set -euo pipefail; Rscript scripts/dea_genome.R ' returned non-zero exit status 1.
File "/root/RASflow/workflow/dea_genome.rules", line 38, in __rule_DEA
File "/root/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job DEA since they might be corrupted:
output/analysis/genome/dea/countGroup/Untreated_gene_norm.tsv, output/analysis/genome/dea/countGroup/Treatment_1_gene_norm.tsv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /root/RASflow/.snakemake/log/2022-07-03T025450.360606.snakemake.log
DEA is done!
Start visualization of DEA results!
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 end
1 plot
2

[Sun Jul 3 02:54:56 2022]
rule plot:
input: output/analysis/genome/dea/countGroup, output/analysis/genome/dea/DEA
output: output/analysis/genome/dea/visualization/volcano_plot_Untreated_Treatment_1.pdf, output/analysis/genome/dea/visualization/heatmap_Untreated_Treatment_1.pdf
jobid: 1

Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns

Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, [pmax.int](http://pmax.int/),
pmin, [pmin.int](http://pmin.int/), Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:hash’:

values, values<-

The following object is masked from ‘package:base’:

expand.grid

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: ‘AnnotationDbi’

The following objects are masked from ‘package:hash’:

keys, keys<-

Loading required package: ggplot2
Loading required package: ggrepel
Error in file(file, "rt") : cannot open the connection
Calls: plot.volcano.heatmap -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file 'output/analysis/genome/dea/DEA/dea_Untreated_Treatment_1.tsv': No such file or directory
Execution halted
[Sun Jul 3 02:55:01 2022]
Error in rule plot:
jobid: 1
output: output/analysis/genome/dea/visualization/volcano_plot_Untreated_Treatment_1.pdf, output/analysis/genome/dea/visualization/heatmap_Untreated_Treatment_1.pdf

RuleException:
CalledProcessError in line 53 of /root/RASflow/workflow/visualize.rules:
Command ' set -euo pipefail; Rscript scripts/visualize.R output/analysis/genome/dea/countGroup output/analysis/genome/dea/DEA output/analysis/genome/dea/visualization ' returned non-zero exit status 1.
File "/root/RASflow/workflow/visualize.rules", line 53, in __rule_plot
File "/root/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /root/RASflow/.snakemake/log/2022-07-03T025456.705370.snakemake.log
Visualization is done!
RASflow is done!

Issue with environment creation on macOS Big Sur

Hi, when I try to create the rasflow environment with the given command "conda env create -n rasflow -f env.yaml" it gets stuck in the resolving environment step. I have successfully installed and used rasflow in Ubuntu LTS 20.4 and in a WSL vm with the same OS but can't really install it on my work computer. Is there any workaround?

Thanks in advance.

DEA error

Hello Mr.Zhang,
For some reason, I am getting an error in the DEA steps.
Error in tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no") :
all(file.exists(files)) is not TRUE
Calls: DEA -> tximport -> stopifnot
Execution halted
[Mon Sep 12 11:16:57 2022]
Error in rule DEA:
jobid: 1
output: output/WT/trans/dea/DEA/transcript-level/dea_wt_ko.tsv, output/WT/trans/dea/DEA/transcript-level/deg_wt_ko.tsv, output/WT/trans/dea/DEA/gene-level/dea_wt_ko.tsv, output/WT/trans/dea/DEA/gene-level/deg_wt_ko.tsv

RuleException:
CalledProcessError in line 40 of /mnt/d/RNAseq/qhd2/workflow/dea_trans.rules:
Command ' set -euo pipefail; Rscript scripts/dea_trans.R ' returned non-zero exit status 1.
File "/mnt/d/RNAseq/qhd2/workflow/dea_trans.rules", line 40, in __rule_DEA
File "/home/wbc/miniconda3/envs/ranseqSnakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/d/RNAseq/qhd2/.snakemake/log/2022-09-12T110512.827029.snakemake.log
DEA is done!
Visualization is not required and RASflow is done!
Building DAG of jobs...
Nothing to be done.
Complete log: /mnt/d/RNAseq/qhd2/.snakemake/log/2022-09-12T110510.915007.snakemake.log

Any help is appreciated. Thank you.

Running on a cluster

Hi,
I want to run it on a cluster using a bash script, is there some way to silence the checkpoints once the trimming is done. It cannot be run non-interactively.

Strange problem with shopt -s extglob (related to issue #13)

Dear Dr Kang,

First of all, well done, and thank You for RASflow - You made sequencing pipeline truly easy to understand and expand.

Still, I found a wierd behavior in rule getReads of align_count_genome.rules, which possibly generalizes to other extglob-based rules. When running this rule with TRIMMED: yes, it returns...

Command ' set -euo pipefail;  
            shopt -s extglob
            scp -i None /tmp/data/output/test/trim/MSG3701204_small?(.*).f*q.gz /tmp/data/output/test/genome/reads/MSG3701204_small.fastq.gz ' returned non-zero exit status 1.

... despite the file (MSG3701204_small_trimmed.fq.gz) being present in the directory.

The curious thing is, the same rule (getReads) in trim.rules works perfectly well. Similarly, when running align_count_genome.rules with TRIMMED: no, the pipeline finishes fine.

I'm running RASflow inside Singularity container (see: https://github.com/AdrianS85/rnaseq/blob/main/rasflow_sing). My config file: https://github.com/AdrianS85/rnaseq/blob/main/config_main.yaml.

Please let me know if You are interested in any other information.

Best regards,
Adrian

multiQC conflicts in conda install

MultiQC doesn't appear to run in the current state of the conda environment from the yaml file. Attempting to run on a minimal dataset (2 samples) throws 'SyntaxError: future feature annotations is not defined', which seems like it may be related to use of Python version <3.7 . This is overcome by using Python 3.7, but then copytree requires 3.8, which is incompatible with other dependencies.

This is occurring in Linux on both a HPC and a local WSL install.

Problems Nanopore Align

Hi!

I'm trying to use RASFlow for first time with ONT Minion (with master branch), after align step with HISAT2 no reads are aligned. I also tried bwa branch but there are some tokens errors so the pipeline fails (im using macOs Catalina). Any suggestion? maybe there are some hisat2 configuration for Nanopore??

Thank you in advance

Alignment ressults:
38280 reads; of these:
38280 (100.00%) were unpaired; of these:
38280 (100.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.00% overall alignment rate

Wildcards for fastq files

File path built with wildcards do not recognize "sample1.fastq.gz" as a single file if "sample10.fastq.gz" is in same directory, instead it returns a list with the 2 files.

My way to fix it:

IN align_count_genome.rules AND quantify_trans.rules
CHANGE
shell("scp -i {params.key} {params.input_path}/{wildcards.sample}*.f*q.gz {output.read}")
TO
shell("scp -i {params.key} {params.input_path}/{wildcards.sample}.f*q.gz {output.read}")

"Nothing to be done"

Hi,

Thank you for the nice pipeline development.

I first tried QC for the fastq samples without further downstream application. And the feedback from the pipeline is:

main.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(yamlfile)
Is quality control required?
 True
Is trimming required?
 False
Which mapping reference will be used?
 genome
Is DEA required?
 False
Is visualization required?
 False
Please double check the information above
Do you want to continue? (y/n)
y
Start RASflow on project: calderon_et_al
Are you sure that you want to do Quality Control?
 If yes, type 'y'; if not, type 'n' and set 'QC' to 'no' in the config file
y
Start Quality Control!
Building DAG of jobs...
Nothing to be done.
Complete log: /home/RASflow/.snakemake/log/2020-06-08T212542.413870.snakemake.log
Quality control is done!
 Please check the report and decide whether trimming is needed
 Please remember to turn off the QC in the config file!

And I found nothing in my pre-defined output directory.

Can you help sort out the issue?

Many thanks!

Jessie

Workflow exits during plotting

I am trying to figure out a fix for this error. Can you give any advice?

Error in rename(df, c(X_id = "_id")) : 
  Some 'from' names in value not found on 'x': X_id
Calls: plot.volcano.heatmap ... queryMany -> queryMany -> .return.as -> rename -> rename
Execution halted

Error with mapping to genome

How can I change default FDR treshold in DEA module?

Hi!

I am wondering if it is possible to change the default FDR value set for DEA which is 0.05. I would like to check if setting a higher threshold gives interesting results. I checked the corresponding rules file but it doesn't seem to include the value.

Please help! Many thanks :)

htseq-count test data error

Hi @zhxiaokang

i run into a problem when i execute the pipeline with the paired-end test data using htseq-count. There appears the following warning and finally the Error caused by missing file:

        count   jobs
        1       featureCount
        1
17933 GFF lines processed.

Warning: Read SRR1039509.3494278 claims to have an aligned mate which could not be found in an adjacent line.
Warning: 81048 reads with missing mate encountered.

89918 SAM alignment pairs processed.
MissingOutputException in line 99 of /data/tools/RASflow/workflow/align_count_genome.rules:
Missing files after 5 seconds:
output/test/genome/countFile/SRR1039509_count.tsv.summary
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /data/tools/RASflow/.snakemake/log/2022-11-09T132244.600008.snakemake.log
DEA is not required and RASflow is done!

The problem is, that the file SRR1039509_count.tsv.summary is missing but I don't know if the warning of htseq-count is related to this problem.

When I execute the isolated command

htseq-count -f bam -i gene_id -s no -t exon output/test/test/genome/bamFileSort/SRR1039509.sort.bam data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf | sed '/^__/ d' > output/test/genome/countFile/SRR1039509_count.tsv
only the SRR1039509_count.tsv file is created.

With kind regards
Arsenij

Issue with visualization

I am getting an error with the visualization step.

`Querying chunk 56

Finished

Pass returnall=TRUE to return lists of duplicate or missing query terms.

Error in '$<-.data.frame'('tmp', "lab", value = c("Lcn2", "Cd24a", "Sprr1a", :
replacement has 55422 rows, data has 55421

Calls: plot.volcano.heatmap -> EnhancedVolcano -> $<- -> $<-.data.frame

Execution halted
`

I am also getting the warning
`Loading required package: ggrepel

Warning message:

In readLines(con) :

incomplete final line found on 'configs/config_main.yaml'

Querying chunk 1

Querying chunk 2

Any help with these? Am I missing a package in R?
I also noticed it's not possible to use the main.py to run just the visualization without redoing the DEA, running it with DEA set to returns
Which mapping reference will be used? genome Is DEA requred? False Is visualization requred? True Please double check the information above Do you want to continue? (y/n) y Start RASflow on project: project Trimming is not required Start mapping using genome as reference! Building DAG of jobs... Nothing to be done.

Error during visualisation

Hi,
The visualisation fails for DESeq2, for EdgeR it runs, but comes up with this error message. I did not do it by gene since my species bombyx mori while on ensembl was not on the ensembl lookup file.
Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.
Error in .request.get(mygene, paste("/query/", sep = ""), params) :
Request returned unexpected status code:
Response [http://mygene.info/v3/query/?fields=symbol]
Date: 2020-03-28 04:33
Status: 400
Content-Type: application/json; charset=UTF-8
Size: 65 B
{
"success": false,
"error": "Missing required parameters."
Calls: plot.volcano.heatmap ... query -> query -> query -> .request.get -> .request.get
Execution halted

Unable to render code block when opening Tutorial.pdf

I am not able to see the tutorial for RASflow. The link points to a pdf here:

https://github.com/zhxiaokang/RASflow/blob/66c9f378e10af7d36dc08fdef46ac97c5e48f713/Tutorial.pdf

But I get a "Unable to render code block" error

zhxiaokang / rasflow Goto Github PK

rasflow's People

Contributors

Stargazers

Watchers

Forkers

rasflow's Issues

genome and annotation files

Recommend Projects

Recommend Topics

Recommend Org