Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

problem with gtf file about rasflow HOT 16 CLOSED

pavlo888 commented on September 14, 2024

problem with gtf file

from rasflow.

Comments (16)

zhxiaokang commented on September 14, 2024

This is an error from featureCounts which seems not able to understand the GFF file you provided. Please check your GFF file and also the parameter "ATTRIBUTE" in config_main.yaml to make sure that they agree with each other.

from rasflow.

pavlo888 commented on September 14, 2024

Hi @zhxiaokang,

I am providing a gff file directly from a Prokka output. Also, the ATTRIBUTE is set as "ID" instead of "gene_id". Nevertheless, I still get an error.

Maybe I should edit the gff file or convert it to some other file type?

Cheers,
Pablo

from rasflow.

zhxiaokang commented on September 14, 2024

Not sure whether this is the issue, but the error says: no features were loaded in format GTF, but you're actually using a GFF. Maybe try to use a GTF file instead of GFF?

And do you mind sharing the GFF file you're using? At least show some lines that include the "ATTRIBUTE".

from rasflow.

pavlo888 commented on September 14, 2024

But Prokka doesn't provide gtf file. Also, gtf files are the old version of gff (gtf version 3, I think?)

I will share the file and the yaml line

from rasflow.

pavlo888 commented on September 14, 2024

This is how the GFF files looks like (these are screenshots since I cannot upload a gff file here)

And this is the line of the config yaml file

# genome and annotation files
GENOME: data/example/ref/genome/027-annot.fna
ANNOTATION: data/example/ref/annotation/027-annot.gff
ATTRIBUTE: ID  # the attribute used in annotation file. It's usually "gene_id", but double check that since it may also be "gene", "ID"...

from rasflow.

zhxiaokang commented on September 14, 2024

Hi, I'm sorry for the bad news but it seems that both featureCounts and htseq-count are designed for GTF format: https://help.galaxyproject.org/t/problems-with-attributes-in-featurecounts-gff3-input-instead-of-gtf/3046/2

As I tried out the example data on a GFF3 file with both tools, and they reported similar errors as what you got that they couldn't find the correct attribute. You may try to convert the GFF3 file into GTF file first. Here are some tools for the conversion: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gff_to_gtf.md

Hope this helps.

from rasflow.

pavlo888 commented on September 14, 2024

Hi @zhxiaokang

I am still having issues even when I have converted the gff file into gtf file using AGAT.
My GTF file looks like this

This is the error I obtain:
`ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'gene_id "nbis-gene-95"; transcript_id "CDACBCJP_00095_gene"; ID "nbis-exon-95"; Parent "CDACBCJP_00095_gene"; inference "ab initio prediction:Prodigal:002006" "similar to AA sequence:ISfinder:ISHaha5"; locus_tag "CDACBCJP_00095"; product "IS110 family transposase ISHaha5"; protein_id "gnl|X|CDACBCJP_00095";'
The program has to terminate.

`Error in rule featureCount:
jobid: 0
output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary

RuleException:
CalledProcessError in line 109 of /home/sam/Downloads/ilse/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail; featureCounts -p -T 4 -t exon -g gene_id -a data/example/ref/annotation/027-annot.gtf -o output-pva/pva027/genome/countFile/20-0357_count.tsv data/output/pva027/genome/bamFileSort/20-0357.sort.bam && tail -n +3 output-pva/pva027/genome/countFile/20-0357_count.tsv | cut -f1,7 > temp.20-0357 && mv temp.20-0357 output-pva/pva027/genome/countFile/20-0357_count.tsv ' returned non-zero exit status 255.
File "/home/sam/Downloads/ilse/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
`

from rasflow.

zhxiaokang commented on September 14, 2024

The GTF file looks good to me, at least the part in the screenshot. Could you share the whole file? I suspect that there are some issues somewhere in the file (not in the screenshot part).

from rasflow.

pavlo888 commented on September 14, 2024

do you have an e-mail for sending the file? I cannot upload here on github.

from rasflow.

zhxiaokang commented on September 14, 2024

[email protected]

from rasflow.

zhxiaokang commented on September 14, 2024

Hi, after testing with your GTF file, I found that featureCounts will throw this error when there are two fields with quotes in the "inference" part. In the ERROR message you posted above, there are "ab initio prediction:Prodigal:002006" and "similar to AA sequence:ISfinder:ISHaha5", and there are more such cases in your GTF file.

I'm trying to post this issue in their Google group but am still waiting to be admitted into the group.

For the time being, you may fix the issue by only cutting the gene_id part from the GTF file with such command:
cut -d';' -f1 027-annot.gtf > 027-annot_only_gene_id.gtf
And use 027-annot_only_gene_id.gtf instead. I have tested this strategy (only picking the gene_id part) on the example data in RASflow, it produced almost the same counts as using the original GTF file.

from rasflow.

pavlo888 commented on September 14, 2024

hi @zhxiaokang

Thank you for providing the command for fixing the GTF file. It worked perfectly!

However, there seems to be another error yet on the DEA visualization step

`Error in glmFit.default(y = y$counts, design = design, dispersion = dispersion, :
dispersion must be numeric
Calls: DEA ... glmFit -> glmFit.DGEList -> glmFit -> glmFit.default
In addition: Warning message:
In estimateDisp.default(y = y$counts, design = design, group = group, :
No residual df: setting dispersion to NA
Execution halted
[Wed Jan 20 09:27:00 2021]
Error in rule DEA:
jobid: 1
output: output-pva/pva027/genome/dea/countGroup/Untreated_gene_norm.tsv, output-pva/pva027/genome/dea/countGroup/Calcium_gene_norm.tsv, output-pva/pva027/genome/dea/DEA/dea_Untreated_Calcium.tsv, output-pva/pva027/genome/dea/DEA/deg_Untreated_Calcium.tsv

RuleException:
CalledProcessError in line 38 of /home/sam/Downloads/ilse/RASflow/workflow/dea_genome.rules:
Command ' set -euo pipefail; Rscript scripts/dea_genome.R ' returned non-zero exit status 1.
File "/home/sam/Downloads/ilse/RASflow/workflow/dea_genome.rules", line 38, in __rule_DEA
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job DEA since they might be corrupted:
output-pva/pva027/genome/dea/countGroup/Untreated_gene_norm.tsv, output-pva/pva027/genome/dea/countGroup/Calcium_gene_norm.tsv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/sam/Downloads/ilse/RASflow/.snakemake/log/2021-01-20T092655.885343.snakemake.log
DEA is done!
Start visualization of DEA results!
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 end
1 plot
2

[Wed Jan 20 09:27:00 2021]
rule plot:
input: output-pva/pva027/genome/dea/countGroup, output-pva/pva027/genome/dea/DEA
output: output-pva/pva027/genome/dea/visualization/volcano_plot_Untreated_Calcium.pdf, output-pva/pva027/genome/dea/visualization/heatmap_Untreated_Calcium.pdf
jobid: 1

Loading required package: plotscale
hash-3.0.1 provided by Decision Patterns

Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:hash’:

values, values<-

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: ‘AnnotationDbi’

The following objects are masked from ‘package:hash’:

keys, keys<-

Loading required package: ggplot2
Loading required package: ggrepel
Error in file(file, "rt") : cannot open the connection
Calls: plot.volcano.heatmap -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file 'output-pva/pva027/genome/dea/DEA/dea_Untreated_Calcium.tsv': No such file or directory
Execution halted
[Wed Jan 20 09:27:06 2021]
Error in rule plot:
jobid: 1
output: output-pva/pva027/genome/dea/visualization/volcano_plot_Untreated_Calcium.pdf, output-pva/pva027/genome/dea/visualization/heatmap_Untreated_Calcium.pdf

RuleException:
CalledProcessError in line 53 of /home/sam/Downloads/ilse/RASflow/workflow/visualize.rules:
Command ' set -euo pipefail; Rscript scripts/visualize.R output-pva/pva027/genome/dea/countGroup output-pva/pva027/genome/dea/DEA output-pva/pva027/genome/dea/visualization ' returned non-zero exit status 1.
File "/home/sam/Downloads/ilse/RASflow/workflow/visualize.rules", line 53, in __rule_plot
File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/sam/Downloads/ilse/RASflow/.snakemake/log/2021-01-20T092700.946097.snakemake.log
Visualization is done!
RASflow is done!
`
could you direct me on what do to fix it?

Cheers,
Pablo

from rasflow.

zhxiaokang commented on September 14, 2024

Hi Pablo, glad to hear that it works. Regarding the new issue, it actually happens in DEA, or rather, edgeR, since when I searched the error message, it led me to this post: nanoporetech/pipeline-transcriptome-de#6 As mentioned here, the problem is that there are not enough replicates. So in your case, how many replicates do you have in each group?

from rasflow.

pavlo888 commented on September 14, 2024

Hi @zhxiaokang

There are two conditions and I have just one replicate for each condition, so only two sets of reads. Does this mean I cannot go any further with the visualization?

Cheers,
Pablo

from rasflow.

zhxiaokang commented on September 14, 2024

With only one replicate, you actually can't do differential expression analysis (DEA) since that's not enough to make statistics sense.

from rasflow.

pavlo888 commented on September 14, 2024

I was fearing that was the case. Thanks a lot for your help!!!

from rasflow.

problem with gtf file about rasflow HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent