hyunhwan-jeong / salmonte Goto Github PK

View Code? Open in Web Editor NEW

81.0 81.0 23.0 264 MB

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances

License: GNU General Public License v3.0

Python 16.68% R 7.00% Jupyter Notebook 76.33%

ngs-analysis pipeline salmon transposable-elements

salmonte's People

Contributors

Stargazers

Watchers

salmonte's Issues

No version or release tagged

README file mentions version 0.4 however this is not tagged in the repository.

December 19, 2018: SalmonTE has been updated to 0.4 with some improvements!

Please tag/release your versions here on Github too, so that we can provide reproducible software installations for our scientists.
Today's clone of master branch is not necessarily the same as yesterdays.

Best,
Erich

condition problems

Hi!
quant created a condition.csv file, but since in the README says that it should be called control.csv I made the modifications and copied condition.csv into control.csv. I get this error, and I think it's telling me there is something wrong with the condition.csv file.

Step 1: Loading required libraries...
Step 2: Loading input data...
Step 3: Running the DE analysis...
Error in `$<-.data.frame`(`*tmp*`, "condition", value = integer(0)) : 
  replacement has 0 rows, data has 28
Calls: SalmonTE -> $<- -> $<-.data.frame
Execution halted

Both control.csv and condition.csv look like this:

SRR1111111_S1_L001_R2_001,condition
SRR2111111_S1_L001_R2_001,condition
SRR3111111_S1_L001_R2_001,control
SRR4111111_S1_L001_R2_001,control
SRR5111111_S1_L001_R2_001,control
SRR6111111_S1_L001_R2_001,condition
(...)
SRR2811111_S1_L001_R2_001,condition

Also, in the command line I specified --conditions=control,condition without whitespaces.

Do you have an idea for what is going wrong here?

Thank you

Failed SalmonTE example run

I followed the example code and ran
SalmonTE.py quant --reference=hs ~/programs/SalmonTE/example/

However, I got this error:
KeyError in line 3 of /home/wc376/programs/SalmonTE/snakemake/Snakefile.paired: 'index' File "/home/wc376/programs/SalmonTE/snakemake/Snakefile.paired", line 3, in <module> Will exit after finishing currently running jobs. 2018-08-26 18:23:14,773 Will exit after finishing currently running jobs. Exiting because a job execution failed. Look above for error message

Any ideas? I am running with Python 3.5.1, snakemake-5.2.2, pandas-0.23.4, docopt-0.6.2

No module named 'pandas'

I run the SalmonTE like this:

./SalmonTE.py quant --reference=hs ./example/CTRL_1_R1.fastq

and get error as below:

But it seems the quantification files are generated here:
SalmonTE_output/CTRL_1_R1/quant.sf

but I am not sure if it is right. Appreciate your help. Thanks!

mapping strategy

Hi,

Thanks for your great tool ~

I have read the website and your recent published PSB paper but still could not understand the mapping strategy for repeats (Sorry I am not familiar with bioinformatics). I have some simple questions:

How does SalmonTE deal with multiple mapping reads? or only use unique mapping reads for count.
How does SalmonTE deal with the multi-overlapping reads ( a read which covers different gene)?
Is it possible to detect the differential expressed TE if I only had one replicate?

Is there any documentation I could refer to?

Thanks for your help in advance. Sorry for my naive questions...

Best,
Alice

How to build a local index

Dear Hyun-Hwan Jeong,

I am trying to build my own SalmonTE index. In the previous post, I found this link https://github.com/hyunhwaj/SalmonTE/wiki/How-to-build-a-customized-index. However, it did not work. Could you please tell me how to build the SalmonTE index if I have my own fasta file and gtf file?

Thanks in advance,
Yong

quant mode error "name 'directory' is not defined"

Hi. I'm getting an error saying "name 'directory' is not defined".
It seemed to related to output directory, but changing names and paths of output directory didn't solve the problem. What could be the cause of this error?

$ SalmonTE.py quant --reference=mm --outpath=test --num_threads=20 d0.1.control.1.val.1.fq.gz d0.1.control.2.val.2.fq.gz

2019-02-11 11:25:39,544 Starting quantification mode
2019-02-11 11:25:39,544 Collecting FASTQ files...
2019-02-11 11:25:39,545 The input dataset is considered as a paired-ends dataset.
2019-02-11 11:25:39,545 Collected 1 FASTQ files.
2019-02-11 11:25:39,546 Quantification has been finished.
2019-02-11 11:25:39,546 Running Salmon using Snakemake
NameError in line 24 of /home/bio0/bin/SalmonTE-master/snakemake/Snakefile.paired:
name 'directory' is not defined
  File "/home/bio0/bin/SalmonTE-master/snakemake/Snakefile.paired", line 24, in <module>
2019-02-11 11:25:39,673 NameError in line 24 of /home/bio0/bin/SalmonTE-master/snakemake/Snakefile.paired:
name 'directory' is not defined
  File "/home/bio0/bin/SalmonTE-master/snakemake/Snakefile.paired", line 24, in <module>
Traceback (most recent call last):
  File "/home/bio0/bin/SalmonTE-master/SalmonTE.py", line 286, in <module>
    run(args)
  File "/home/bio0/bin/SalmonTE-master/SalmonTE.py", line 237, in run
    run_salmon(param)
  File "/home/bio0/bin/SalmonTE-master/SalmonTE.py", line 153, in run_salmon
    with open(os.path.join(param["--outpath"], "EXPR.csv" ), "r") as inp:
FileNotFoundError: [Errno 2] No such file or directory: 'test/EXPR.csv'

SalmonTE on Galaxy server

Does anyone know if it is already available through a publicGalaxy server?

Creation of index with fasta

Hello, Yesterday I 've installed successfully salmonTE but I need to create a new index because I have a library of transposable elements of my species. The problem is that I cannot create the index because the names of the classes and species of elements in the fasta file don't satisfy the requisites. The library is manually curated. How can solve this problem?

Obtain the chromosome (start/end) position after running SalmonTE.py quant

Hello,

Is there a way to know the chromosome position (e.g. chr1 145233 145561) where my reads map to after running SalmonTE.py quant?

Because after running SalmonTE.py quant, I can only view the TPM or count in the EXPR.csv. I want to know which start/end position do the TEs actually map to the reference.

Thank you for your help!

Confusion about mapping reads to repetitive elements.

Hi,

Thank you for your efforts to generate and maintain this good package, as well as the wiki documents about repbase. May I have a question about this?

Based on my understanding, the repetitive elements will be under consideration during the process. But I found that in the hg.fa file in the 'scripts' directory, besides Homo sapiens, there are other species like Mammalia, Eutheria, Primates, etc. Why these species are also included here. I would expect we only consider the repetitive elements in homo sapiens.

I guess this would cause overestimate of the reads from repetitive elements because some of them may be mapped to other species. Does this matter?
For example, I searched AGGCGGGCGGATCACGAGG in hg.fa file and I found so many matches in Primates; cgtagtggcgggcgcctgtagtcctagctacttgggaggctgaggcaggagaatggcgtgaacccgggag would be only mapped to ">AluYa8 SINE1/7SL Primates" but not the homo sapiens.

Thus, I have a worry about this problem.

Looking forward to your reply.

Thanks! Have a nice day!

SalmonTE quant outputs empty EXPR.csv file when used with mm reference

I ran the following command:
SalmonTE.py quant --reference=mm lane1.adarsg1_a.1.fastq lane1.adarsg1_a.2.fastq
with 2 fastq files from my experiment.

I am using SalmonTE 0.3 and snakemake 5.1.5. My output file EXPR.csv just contains the string "TE" and nothing else.

I got the following message with no errors:
**2018-08-14 15:35:21,905 Starting quantification mode
2018-08-14 15:35:21,905 Collecting FASTQ files...
['lane1.adarsg1_a.1.fastq', 'lane1.adarsg1_a.2.fastq']
2018-08-14 15:35:21,913 The input dataset is considered as a paired-ends dataset.
_R1.fastq _R2.fastq
2018-08-14 15:35:21,913 Collected 1 FASTQ files.
2018-08-14 15:35:21,914 Quantification has been finished.
2018-08-14 15:35:21,914 Running Salmon using Snakemake
Building DAG of jobs...
2018-08-14 15:35:22,215 Building DAG of jobs...
Using shell: /usr/bin/bash
2018-08-14 15:35:22,273 Using shell: /usr/bin/bash
Provided cores: 1
2018-08-14 15:35:22,274 Provided cores: 1
Rules claiming more threads will be scaled down.
2018-08-14 15:35:22,274 Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 collect_abundance
2
2018-08-14 15:35:22,274 Job counts:
count jobs
1 all
1 collect_abundance
2

2018-08-14 15:35:22,275
rule collect_abundance:
output: /n/scratch2/AP403/RNaseProtection_6_11_18/pairedOut/Salmon_processing/SalmonTE_output/EXPR.csv
jobid: 1
2018-08-14 15:35:22,275 rule collect_abundance:
output: /n/scratch2/AP403/RNaseProtection_6_11_18/pairedOut/Salmon_processing/SalmonTE_output/EXPR.csv
jobid: 1

2018-08-14 15:35:22,275
Building DAG of jobs...
Using shell: /usr/bin/bash
Job counts:
count jobs
1 collect_abundance
1
/home/ap403/venv_py_3/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Finished job 1.
2018-08-14 15:35:24,395 Finished job 1.
1 of 2 steps (50%) done
2018-08-14 15:35:24,396 1 of 2 steps (50%) done

2018-08-14 15:35:24,397
localrule all:
input: /n/scratch2/AP403/RNaseProtection_6_11_18/pairedOut/Salmon_processing/SalmonTE_output/EXPR.csv
jobid: 0
2018-08-14 15:35:24,397 localrule all:
input: /n/scratch2/AP403/RNaseProtection_6_11_18/pairedOut/Salmon_processing/SalmonTE_output/EXPR.csv
jobid: 0

2018-08-14 15:35:24,397
Finished job 0.
2018-08-14 15:35:24,398 Finished job 0.
2 of 2 steps (100%) done
2018-08-14 15:35:24,398 2 of 2 steps (100%) done
Complete log: /n/scratch2/AP403/RNaseProtection_6_11_18/pairedOut/Salmon_processing/.snakemake/log/2018-08-14T153522.175632.snakemake.log**

Count data for analysis

Thanks for developing salmonTE and congrats to your recent papers.

I have been playing around with your tool and was wondering about the count output data:

Is it correct that the counts in the output file can be directly used for analysis/plots without further normalisation? Or should the TPM output values be used for that?

Many thanks!

pair-end files

I found SalmonTE works well when I give it either R1 fastq or R2 fastq. But it gives error as below when I provide both:

I am using SalmonTE 0.3.

Request: add extended clades to references

Hello! I was looking through SalmonTE's references and noticed that their clades don't seem to have all of those listed in the "clades_extended.csv" file. It would just be really nice if all of the clades could be included in all of the references. :)

Input own reference file?

Hello,

I'm looking to quantify TE expression in Sorghum bicolor, and am currently using TEtranscripts. SalmonTE sounds like a very promising alternative, but was disappointed to find that it only supports Homo sapiens, Mus musculis, Drosophila melanogaster, and Danio rerio.
Is it completely impossible to add functionality for inputting your own reference, should it be in the correct format?
I currently have a genome assembly fasta, TE annotation fasta and gff3, gene annotation fasta and gff3, so would it not be possible to make a reference file myself from these files and use that as the input instead of one of the four currently accepted references?

Best,
Conor

Bioconda support

To make it easier to install SalmonTE a Bioconda package (https://github.com/bioconda/bioconda-recipes) would be great.

Thanks for all your work!

SalmonTE crashes on macOS.

It is a follow-up to #1 which was initiated by @sarah872. That issue was opened because of @sarah872 is looking for another option to run. Now the issue is resolved (I believe so), and I will finalize if we can say OKAY.

Thanks for the reporting, @sarah872.

Hyun-Hwan Jeong

what is the expected mapping rate?

Thanks for updating the tool. I am curious what the expected "percent mapped" should be for mouse tumor samples? I am getting anywhere from 2-4%, which seems quite low considering there are a lot of TEs in the genome. Is this expected or do I need to adjust any settings?
Thanks!

Extract Reads similar to TE sequences

Hi,

Would it be possible/easy to incorporate an option where SalmonTE outputs a file containing the reads used to generate the TE count file for each fastq file? I want to manually inspect the reads and see if they really are similar to TE sequences, but I can't do that at the moment.

Thank you!

Test example failed

Deployed SalmonTE on OSX Cathalina as follows.

% R --version
R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
% python -V
Python 3.8.2
% pip install --upgrade pip
% pip install docopt
% pip install snakemake

% cd NGSToools
% git clone https://github.com/hyunhwaj/SalmonTE
% echo -n 'export PATH=$HOME/NGSToools/Salmon:$PATH' >> ~/.zshrc
% source ~/.zshrc

Try to run the test script and got an error.

% cd Salmon
% SalmonTE.py test --inpath=SalmonTE_output --outpath=SalmonTE_statistical_test --tabletype=csv --figtype=png --analysis_type=DE --conditions=control,treatment
2020-11-20 20:13:41,631 Input path is specified incorrectly!

Try to run quantification and have the other error.

% SalmonTE.py quant --reference=hs example
2020-11-20 20:15:17,489 Starting quantification mode
2020-11-20 20:15:17,489 Collecting FASTQ files...
2020-11-20 20:15:17,489 SalmonTE assumes that 'example' is a directory, and SalmonTE will search any FASTQ file in the directory.
2020-11-20 20:15:17,496 The input dataset is considered as a paired-ends dataset.
2020-11-20 20:15:17,497 Collected 4 FASTQ files.
2020-11-20 20:15:17,497 Quantification has been finished.
2020-11-20 20:15:17,497 Running Salmon using Snakemake
/var/folders/fh/6cn8tw8s2ns8shxj_m1gdy400000gn/T/tmpqb8nxbk9
/Users/olferievm/NGSTools/SalmonTE/SalmonTE_output/
/Users/olferievm/NGSTools/SalmonTE/reference/hs
/Users/olferievm/NGSTools/SalmonTE/snakemake/Snakefile.paired:103: SyntaxWarning: "is" with a literal. Did you mean "=="?
Job counts:
	count	jobs
	1	all
	1	collect_abundance
	1	collect_mappability
	3
2020-11-20 20:15:17,786 Job counts:
	count	jobs
	1	all
	1	collect_abundance
	1	collect_mappability
	3
/Users/olferievm/NGSTools/SalmonTE/snakemake/Snakefile.paired:103: SyntaxWarning: "is" with a literal. Did you mean "=="?
Job counts:
	count	jobs
	1	collect_abundance
	1
WorkflowError:
Failed to solve the job scheduling problem with pulp. Please report a bug and use --scheduler greedy as a workaround:
Pulp: cannot execute cbc cwd: /Users/olferievm/NGSTools/SalmonTE
Exiting because a job execution failed. Look above for error message
2020-11-20 20:15:18,147 Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
  File "/Users/olferievm/NGSTools/SalmonTE/SalmonTE.py", line 296, in <module>
    run(args)
  File "/Users/olferievm/NGSTools/SalmonTE/SalmonTE.py", line 247, in run
    run_salmon(param)
  File "/Users/olferievm/NGSTools/SalmonTE/SalmonTE.py", line 160, in run_salmon
    with open(os.path.join(param["--outpath"], "EXPR.csv" ), "r") as inp:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/olferievm/NGSTools/SalmonTE/SalmonTE_output/EXPR.csv'

It seems something wrong with the snakemake. Not sure how can I fix it?

Sincerely,

Mike.

UnicodeDecodeError

Hi hyunhwaj,

Previously, SalmonTE works fine for me but recently when I run the program, it gives the error as below and any comments?

~/software/SalmonTE-master/SalmonTE.py quant --reference=m S17_R1_001.fastq.gz S17_R2_001.fastq.gz

Jingyi

Fail at SalmonTE test

I tried to run SalmonTE with command as following:
python3.6 /pfs1/liuguanghui/suyao/tools/SalmonTE/SalmonTE.py test --inpath=/pfs1/liuguanghui/suyao/project/lncRNA/output2/SalmonTE/HOMOP7_H9P7/Output/SalmonTE_output --outpath=/pfs1/liuguanghui/suyao/project/lncRNA/output2/SalmonTE/HOMOP7_H9P7/Output/SalmonTE_statistical_test --tabletype=csv --figtype=png
I got error like this:
Error in mutate_impl(.data, dots) : Evaluation error: 0 (non-NA) cases.
Calls: SalmonTE ... -> mutate -> mutate.tbl_df -> mutate_impl -> .Call
Execution halted
Could you tell me how to fix it?

running paired end data with SalmonTE

Hi,
I am trying to run the tool on paired end data.

./SalmonTE.py quant --reference=hs --outpath=outputPE --num_threads=8 PE1/ZI_JR_004_CTTGTA_L001_R1_001.fastq PE1/ZI_JR_004_CTTGTA_L001_R2_001.fastq 
2021-04-09 12:13:05,168 Starting quantification mode
2021-04-09 12:13:05,169 Collecting FASTQ files...
2021-04-09 12:13:05,170 The input dataset is considered as a paired-ends dataset.
2021-04-09 12:13:05,170 Collected 1 FASTQ files.
2021-04-09 12:13:05,170 Quantification has been finished.
2021-04-09 12:13:05,170 Running Salmon using Snakemake

Why is SalmonTE picking only 1 fastq file ??
Kindly help.

run_salmon_fq and EXPR.csv file not found error

I am using SalmonTE 0.4 with default quant function and getting following error.

fasta files as input

Thank you very much for your tool!
I had a question, if SalmonTE can take fasta files with raw reads as input, instead of fastq?
I am currently working on simulated data, and output of the simulator is always .fasta. Salmon itself seems to be taking .fa as raw input for quasimapping, but trying to run SalmonTE with my files results in error for collect_mappability. So I was wondering, if the error I get is due to the input file format or may be connected to something else.
Thank you for reply in advance.

The command I use is following:
SalmonTE.py quant --reference=mm --outpath=salmonte --num_threads=20 --exprtype=count simulated/sample_01_1.fasta.gz simulated/sample_01_2.fasta.gz

The error details I get are in the screenshot.

build index from species which is not available in Repbase

Hi Hwan,

Thanks for this nice tool! I want to try it on my RNA-seq data but the species (Rhesus) is not included in Repbase. So I want to create a FASTA file for SalmonTE.

I've seen this tutorial How to build a customized index but sill don't know how to do. I'm not familiar with RepeatModeler and Censor. It seems that RepeatModeler is for de novo repeat family identification and Censor is for classification, and both are not so easy to use.

Since RepeatMasker website and UCSC Table Browser have provided detiled repeat information (loci, name, class, family) for many species, can we begin from that? For example, extract some specific TE classes and reformat the file for SalmonTE.

Can this work? I don't know how to reformat the file.

Any suggestions or thoughts would be welcomed.

Bests,
Yiwei Niu

One of input files can be paired to multiple files

Hi,
I have paired fastq files for 60 samples in a directory "TRIMMED", then I ran

SalmonTE.py quant --reference=mm TRIMMED/

I got following error.

2020-06-12 18:58:43,344 Starting quantification mode
2020-06-12 18:58:43,344 Collecting FASTQ files...
2020-06-12 18:58:43,344 SalmonTE assumes that 'TRIMMED/' is a directory, and SalmonTE will search any FASTQ file in the directory.
2020-06-12 18:58:44,711 One of input files can be paired to multiple files

Fastq file names are like these

muscle14f1_R1.fq.gz
muscle14f1_R2.fq.gz

muscle14f2_R1.fq.gz
muscle14f2_R2.fq.gz

I remove any delimiter in the sample name just in case file name cause a bug.
What could be the problem here?

Thank you~

mapping rate

Hello,

I wonder if there is a way to pull out the mapping rate of each sample from SalmonTE?

SalmonTE test condition problem?

Good evening!
I am running into problem with SalmonTE test.

Step 1: Loading required libraries...
Step 2: Loading input data...
Step 3: Running the DE analysis...
Error in $<-.data.frame(*tmp*, "condition", value = integer(0)) :
replacement has 0 rows, data has 6
Calls: SalmonTE -> $<- -> $<-.data.frame
Execution halted

I used command
SalmonTE.py test --inpath salmonte --outpath salmonte_out --tabletype csv --analysis_type DE --conditions=control,condition

And my condition.csv is edited to be:

SampleID,control
sample_04,condition
sample_03,control
sample_06,condition
sample_01,control
sample_05,condition
sample_02,control

The head of my EXPR.csv is:

TE,sample_04,sample_03,sample_06,sample_01,sample_05,sample_02
B1,401146.0,389164.0,400909.0,391824.0,407338.0,395345.0

The MAPPING_INFO.csv :

SampleID,num_mapped,num_processed,percent_mapped
sample_04,5611368,83734667,6.701367785937454
sample_03,5461187,83418478,6.546735364795316
sample_06,5618886,83767436,6.707721124471328
sample_01,5458106,83385507,6.545629086359097
sample_05,5615023,83732805,6.705881882256303
sample_02,5457687,83411724,6.543069413119911

My issue seems to be same or similar to the one in Issue #39 ....
SalmonTE version I use is SalmonTE 0.4

Thank you in advance for your reply!

Best wishes,
Natalia.

Error using SalmonTE

Hi again,

I am actually still facing an issue with the mouse data I have. When I try to run on a set of fastq files downloaded from NCBI I get the following error:
(snakemake) [smegat@hpc-login1 scripts]$ ./mouseSalmonTE.sh
2019-06-20 15:48:40,434 Starting quantification mode
2019-06-20 15:48:40,435 Collecting FASTQ files...
2019-06-20 15:48:40,435 SalmonTE assumes that '/b/home/medecine/smegat/fus_tdp43_fastq/tdp43_fastq/' is a directory, and SalmonTE will search any FASTQ file in the directory.
2019-06-20 15:48:50,369 A paired-end sample and a single-end sample are placed together.

Which I do not understand since all my files in the folder seem to be properly paired....

Many thanks,

Salim.

Error when running "SalmonTE.py test"

Hi there,

Thanks for developing SalmonTE! It's really fast and suitable for large-scale project.
But I got an error when running SalmonTE.py test:

Error in FUN(X[[i]], ...) :
  assay colnames() must be NULL or equal colData rownames()
Calls: SalmonTE ... SummarizedExperiment -> SummarizedExperiment -> .local -> vapply -> FUN

Here is a possible solution I found. I hope it's helpful.
By the way, I use R version 3.4.1.

Best,
Wen-Wei

Salmon TE does not load all the fastq

Hi,
I have been trying to run Salmon TE on a Linux server without success. The installation seems fine since I can call SalmonTE.py without errors. However, when I try to run "Salmon.py quant" it is only loading 1 fastq in the folder I am pointing at (below is the command). However the folder actually contains 36 fastq files. Is it a problem with the naming of the files ? (See below the name of the fastq). Finally, theses fastq are paired-end as you can see and it recognized them as single end.

Thanks for your help !

Salim.

### (snakemake) [smegat@hpc-login1 ~]$ SalmonTE.py quant --reference=mm --outpath=Salmon_out_quant --num_threads=12 ~/fusDNLS_fastq
2019-02-19 09:03:19,755 Starting quantification mode
2019-02-19 09:03:19,755 Collecting FASTQ files...
2019-02-19 09:03:19,756 The input dataset is considered as a single-end dataset.
2019-02-19 09:03:19,757 Collected 1 FASTQ files.
2019-02-19 09:03:19,757 Quantification has been finished.
2019-02-19 09:03:19,757 Running Salmon using Snakemake
Job counts:
count jobs
1 all
1 collect_abundance
1 collect_mappability
1 run_salmon_fq
4
2019-02-19 09:03:20,969 Job counts:
count jobs
1 all
1 collect_abundance
1 collect_mappability
1 run_salmon_fq
4

fusDNLS_fastq]$ ls
SRR6924174_1.fastq.gz SRR6924176_2.fastq.gz SRR6924179_1.fastq.gz SRR6924181_2.fastq.gz SRR6924194_1.fastq.gz SRR6924196_2.fastq.gz SRR6924199_1.fastq.gz SRR6924201_2.fastq.gz
SRR6924174_2.fastq.gz SRR6924177_1.fastq.gz SRR6924179_2.fastq.gz SRR6924192_1.fastq.gz SRR6924194_2.fastq.gz SRR6924197_1.fastq.gz SRR6924199_2.fastq.gz Salmon_out_quant
SRR6924175_1.fastq.gz SRR6924177_2.fastq.gz SRR6924180_1.fastq.gz SRR6924192_2.fastq.gz SRR6924195_1.fastq.gz SRR6924197_2.fastq.gz SRR6924200_1.fastq.gz
SRR6924175_2.fastq.gz SRR6924178_1.fastq.gz SRR6924180_2.fastq.gz SRR6924193_1.fastq.gz SRR6924195_2.fastq.gz SRR6924198_1.fastq.gz SRR6924200_2.fastq.gz
SRR6924176_1.fastq.gz SRR6924178_2.fastq.gz SRR6924181_1.fastq.gz SRR6924193_2.fastq.gz SRR6924196_1.fastq.gz SRR6924198_2.fastq.gz SRR6924201_1.fastq.gz

Nothing to be done

Salmon TE runs well on my previous samples but when I run it again today. It is done with nothing output.

So I wonder if anything is going wrong. Thanks!

How can I obtain the location where reads map to TEs

Hi,

is there a way to check the genomic location where the TEs identified by SalmonTE are located? Then I could easily load the bams into IGV and validate SalmonTE findings and would not have to run a RT-qPCR in the first place.

Thanks for your help!

Repeats annotation

Hello SalmonTE developers,

Possibly a stupid question but I can't find out in the documentation. May I ask what is the genome version SalmonTE is using for zebrafish? danRer7 or danRer10? (--reference dr).

What's more, how can I know the exact location of the repeats. It seems that SalmonTE only reports the repeat name. For example, after quantification, I found that BEL-40_DRe-LTR is activated in my sample and I would like to know the genomic location of BEL-40_DRe-LTR. I tried to compare with the repeatmarker but it looks like the name is not always the same.

Thank you very much for your help!

Best,
Alice

Customise Annotation

Hey there,

I am uncertain with the annotation step for building a customised reference and I was wondering if you could help me gaining confidence in what I do?

We created a reference FASTA file as you described it using the following fasta-headers:

>SINE2 SINE2/tRNA Brachypodium distachyon

In a next step we run the indexing using the following command:
SalmonTE.py index --input_fasta=BdTE.fasta --ref_name=bd --te_only

The indexing finished without warnings or error and a new folder is added to the reference directory. I even can run the next step quant and it seems to work. So far so good. I am, however, not sure if (and how) I have to customise the clades_extended.csv file? The TEs from my fasta headers are different from the information listed in the clades extended file. Therefore my questions:

Do I have to change the clades_extended file?
Do I have to add my TEs to the exiting file?
Would it be better to replace the existing file wit ha customised file?
What format should I use?

I am nor sure about the format used in the clades_extended csv files. In the provided file you are using the following format:

Mariner/Tc1,Mariner/Tc1,DNA transposon,Transposable Element

TE name
TE name (again?)
Class
TE or simple repeat

Do I understand the format correctly and do I have to use the same or could I use the following instead:

L1-5_BDi,LINE,Retrotransposons,Transposable Element

TE name
Order
Class
TE or simple repeat

Thanks for taking the time to consider my questions.

Cannot run SalmonTE in Snakemake pipeline

Hello,

I get an error about locking the results dir when running SalmonTE in Snakemake pipeline. I forked the repo and got it to run by disabling SalmonTE directory locking. Should I submit a pull request?

command line query

Hi,
I am new to command line ...and a bit confused regarding this:
SalmonTE.py index [--ref_name=ref_name] (--input_fasta=fa_file) [--te_only]

what does --te_only mean ? Do I have to give the TE annotation file ? but in which format ?
kindly help.
I tried running:
./SalmonTE.py quant --reference=hs example/CTRL_1_R1.fastq example/CTRL_2_R1.fastq
2021-04-08 15:43:40,525 Starting quantification mode
2021-04-08 15:43:40,525 Collecting FASTQ files...
2021-04-08 15:43:40,526 The input dataset is considered as a single-end dataset.
2021-04-08 15:43:40,526 Collected 2 FASTQ files.
2021-04-08 15:43:40,526 Quantification has been finished.
2021-04-08 15:43:40,526 Running Salmon using Snakemake
Job counts:
count jobs
1 all
1 collect_abundance
1 collect_mappability
2 run_salmon_fq
5
2021-04-08 15:43:40,668 Job counts:
count jobs
1 all
1 collect_abundance
1 collect_mappability
2 run_salmon_fq
5
Job counts:
count jobs
1 collect_mappability
1
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver. Run Snakemake with --verbose to see the full solver output for debugging the problem.
Job counts:
count jobs
1 collect_abundance
1
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver. Run Snakemake with --verbose to see the full solver output for debugging the problem.

Documentations

Like #15, several people pointed out the documentation in the repository is not enough Especially,

Custom Annotation
Output files description

Counts data are not integers

Thanks for developing the SalmonTE tool. It is working wonderfully even on my humble laptop. Just as comparison, I could not run any other TE counting tool as they all needed STAR, and I do not have a server or a machine with 32 Gb RAM.

I have one question on the output file. I see that the count data is not integer. I get decimal values. Could you please explain how 'counts' are generated.

I am asking this as I have the count data for genes for the files, and I was thinking about adding them to the gene count. Is is possible/ok to concatenate the counts obtained from SalmonTE to gene counts obtained from HT-Seq mapping -> featureCounts.

Also, does SalmonTE only map a fraction of the reads, or all of them. This is again so that I understand how to add TE counts to gene counts, and perform correlation analysis and such.

Thanks!

snakemake error - FileNotFoundError: [Errno 2] No such file or directory: EXPR.csv

Hello,
I am getting the following error which running the quant function
SalmonTE.py quant --reference=hs --outpath=~/scratch60/SF3B1/SRP061203/SRX1098126/ --num_threads=8 $fq1 $fq2

I tried changing snakemake version to 5.1.5 but error repeats. I am running using a conda environment, not sure if that it is relevant.
thanks !
Manoj

2019-05-17 07:01:02,225 Starting quantification mode
2019-05-17 07:01:02,227 Collecting FASTQ files...
2019-05-17 07:01:02,417 The input dataset is considered as a paired-ends dataset.
2019-05-17 07:01:02,418 Collected 1 FASTQ files.
2019-05-17 07:01:02,418 Quantification has been finished.
2019-05-17 07:01:02,418 Running Salmon using Snakemake
NameError in line 24 of /ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/snakemake/Snakefile.paired:
name 'directory' is not defined
File "/ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/snakemake/Snakefile.paired", line 24, in
2019-05-17 07:01:04,603 NameError in line 24 of /ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/snakemake/Snakefile.paired:
name 'directory' is not defined
File "/ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/snakemake/Snakefile.paired", line 24, in
Traceback (most recent call last):
File "/ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/SalmonTE.py", line 289, in
run(args)
File "/ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/SalmonTE.py", line 240, in run
run_salmon(param)
File "/ysm-gpfs/home/mp758/project/Tools/SalmonTE/SalmonTE/SalmonTE.py", line 156, in run_salmon
with open(os.path.join(param["--outpath"], "EXPR.csv" ), "r") as inp:
FileNotFoundError: [Errno 2] No such file or directory: '~/scratch60/SF3B1/SRP061203/SRX1098126/EXPR.csv'

small RNA-seq

I did not find in the manuscript, but would it make sense to use SalmonTE to quantify expression of small RNAs? Or is there a reason not to do it?

My use case is unusual, since it would be quantification of which transposons are being targeted by piRNAs, so not a traditional gene/transcript expression quantification.

I am willing to test it, I just wanted to know if there any arguments against it before even starting.

Thank you.

Installation issue

Hello and a happy new year!

I have installed SalmonTE and seems everything went well. However, when I tried to run the example in order to test it, I am receiving the following error:

2019-01-07 14:29:47,305 Starting quantification mode
2019-01-07 14:29:47,305 Collecting FASTQ files...
2019-01-07 14:29:47,306 The input dataset is considered as a single-end dataset.
2019-01-07 14:29:47,306 The input dataset is considered as a single-end dataset.
Traceback (most recent call last):
File "/home/x.v.l.01/SalmonTE/SalmonTE.py", line 285, in <module>
run(args)
File "/home/x.v.l.01/SalmonTE/SalmonTE.py", line 231, in run
param = {**args, **collect_FASTQ_files(args['FILE'])}
File "/home/x.v.l.01/SalmonTE/SalmonTE.py", line 125, in collect_FASTQ_files
os.symlink(os.path.abspath(file), file_name)
FileExistsError: [Errno 17] File exists: '/lustrehome/home/x.v.l.01/SalmonTE/example/CTRL_2_R1.fastq' -> '/tmp/tmp0vsal2pn/x.fastq'

It looks like it has an issue in parsing the fastq files.
Any help with this would be much appreciated.

Kind regards,
Vasilis.

SnakeFile Single Issue

Hi,

I'm having an issue with single-end data on the latest version of SalmonTE and latest version of snakemake.

Output directories must be flagged with directory()

Hello, i tried to run SalmonTE both on example data and on my own data, and the tool is failing.
This is the terminal output:
`SalmonTE.py quant --reference=hs '/home/filippo/SalmonTE/example/CTRL_1_R1.fastq' '/home/filippo/SalmonTE/example/CTRL_1_R2.fastq'
2018-07-27 17:42:44,688 Starting quantification mode
2018-07-27 17:42:44,688 Collecting FASTQ files...
['/home/filippo/SalmonTE/example/CTRL_1_R1.fastq', '/home/filippo/SalmonTE/example/CTRL_1_R2.fastq']
2018-07-27 17:42:44,688 The input dataset is considered as a paired-ends dataset.
CTRL_1_R1.fastq CTRL_1_R2.fastq
2018-07-27 17:42:44,688 Collected 1 FASTQ files.
2018-07-27 17:42:44,688 Quantification has been finished.
2018-07-27 17:42:44,688 Running Salmon using Snakemake
Building DAG of jobs...
2018-07-27 17:42:44,749 Building DAG of jobs...
Using shell: /bin/bash
2018-07-27 17:42:44,757 Using shell: /bin/bash
Provided cores: 1
2018-07-27 17:42:44,757 Provided cores: 1
Rules claiming more threads will be scaled down.
2018-07-27 17:42:44,757 Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 collect_abundance
1 run_salmon_fq
3
2018-07-27 17:42:44,757 Job counts:
count jobs
1 all
1 collect_abundance
1 run_salmon_fq
3

2018-07-27 17:42:44,758
rule run_salmon_fq:
input: /home/filippo/SalmonTE/reference/hs, /tmp/tmp9fe3ywrf/CTRL_1_R1.fastq, /tmp/tmp9fe3ywrf/CTRL_1_R2.fastq
output: /home/filippo/SalmonTE_output/CTRL_1
jobid: 2
wildcards: sample_fq=CTRL_1
2018-07-27 17:42:44,758 rule run_salmon_fq:
input: /home/filippo/SalmonTE/reference/hs, /tmp/tmp9fe3ywrf/CTRL_1_R1.fastq, /tmp/tmp9fe3ywrf/CTRL_1_R2.fastq
output: /home/filippo/SalmonTE_output/CTRL_1
jobid: 2
wildcards: sample_fq=CTRL_1

2018-07-27 17:42:44,758
Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

ImproperOutputException in line 17 of /home/filippo/SalmonTE/snakemake/Snakefile.paired:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule run_salmon_fq:
/home/filippo/SalmonTE_output/CTRL_1
2018-07-27 17:42:45,267 ImproperOutputException in line 17 of /home/filippo/SalmonTE/snakemake/Snakefile.paired:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule run_salmon_fq:
/home/filippo/SalmonTE_output/CTRL_1
Removing output files of failed job run_salmon_fq since they might be corrupted:
/home/filippo/SalmonTE_output/CTRL_1
2018-07-27 17:42:45,268 Removing output files of failed job run_salmon_fq since they might be corrupted:
/home/filippo/SalmonTE_output/CTRL_1
Skipped removing non-empty directory /home/filippo/SalmonTE_output/CTRL_1
2018-07-27 17:42:45,268 Skipped removing non-empty directory /home/filippo/SalmonTE_output/CTRL_1
Shutting down, this might take some time.
2018-07-27 17:42:45,270 Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
2018-07-27 17:42:45,270 Exiting because a job execution failed. Look above for error message
Complete log: /home/filippo/.snakemake/log/2018-07-27T174244.739266.snakemake.log
2018-07-27 17:42:45,271 Complete log: /home/filippo/.snakemake/log/2018-07-27T174244.739266.snakemake.log
Traceback (most recent call last):
File "/home/filippo/SalmonTE/SalmonTE.py", line 276, in
run(args)
File "/home/filippo/SalmonTE/SalmonTE.py", line 235, in run
run_salmon(param)
File "/home/filippo/SalmonTE/SalmonTE.py", line 153, in run_salmon
with open(os.path.join(param["--outpath"], "EXPR.csv" ), "r") as inp:
FileNotFoundError: [Errno 2] No such file or directory: '/home/filippo/SalmonTE_output/EXPR.csv'`

can you help me?

reference

I noticed there are two reference options. one is hs and another is dm. Are they for human and mouse?

salmonTe

UnicodeDecodeError

Hi Hwan,

there was an error when running SalmonTE on my sequencing data:

SalmonTE.py quant --reference=dr --exprtype=TPM --num_threads=4 --outpath="SalmonTE_output/" wt_Fish_1dpf_rep1_R1.fastq.gz wt_Fish_1dpf_rep1_R2.f
astq.gz
2018-01-29 14:34:40,158 Starting quantification mode
2018-01-29 14:34:40,159 Collecting FASTQ files...
Traceback (most recent call last):
  File "/fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE.py", line 262, in <module>
    run(args)
  File "/fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE.py", line 216, in run
    param = {**args, **collect_FASTQ_files(args['FILE'])}
  File "/fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE.py", line 77, in collect_FASTQ_files
    if get_first_readid(file_a) == get_first_readid(file_b):
  File "/fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE.py", line 48, in get_first_readid
    return inp.readline().split()[0]
  File "/fsimb/groups/imb-kettinggr/tools/python3/3.6.4/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

(tested on a single paired end sample to isolate the issue)

I traced the issue to the function get_first_readid and specifically to the read file call:
https://github.com/hyunhwaj/SalmonTE/blob/57597369a02c28e23e215338f38da6d6a489539e/SalmonTE.py#L47

Changing the encoding fixed it for me (source):

def get_first_readid(file_name):
    with open(file_name, "r", encoding="ISO-8859-1") as inp:
        return inp.readline().split()[0]

I considered committing a pull request with the fix, but since there was no issue with the provided example files, I am unsure of how general the problem is. I am leaving the solution here in case someone else runs into this problem, or you can can think of a different (more general) way of solving it.

Other details which might be relevant:

NextSeq generated data
fastq.gz files
linux operating sytem

Cheers.

Empty ouput for fastq.gz files

Sorry again Hwan but I am seem to be good at breaking things.

This time I would like to report an issue with using compressed fastq files. These are two unexpected errors or outputs when using my data which is in fastq.gz format. For debugging I have used the example data contained in SalmonTE because it works as described in the Readme for uncompressed data. I then compressed the example data and tried to run SalmonTE on it:

cd example/
for f in *.fastq; do gzip $f; done
cd ..
SalmonTE.py quant --reference=hs example
2018-01-29 15:54:26,614 Starting quantification mode
2018-01-29 15:54:26,614 Collecting FASTQ files...
2018-01-29 15:54:26,620 File extensions of all files must be same.

The error points at an issue with the file extensions but these appear to be ok:

ls example/
CTRL_1_R1.fastq.gz  CTRL_1_R2.fastq.gz  CTRL_2_R1.fastq.gz  CTRL_2_R2.fastq.gz  TARDBP_1_R1.fastq.gz  TARDBP_1_R2.fastq.gz  TARDBP_2_R1.fastq.gz  TARDBP_2_R2.fastq.gz

Using the paired reads of a single sample avoids the above error and SalmonTE runs to completion:

SalmonTE.py quant --reference=hs example/CTRL_1_R1.fastq.gz example/CTRL_1_R2.fastq.gz
2018-01-29 15:54:52,250 Starting quantification mode
2018-01-29 15:54:52,250 Collecting FASTQ files...
2018-01-29 15:54:52,253 The input dataset is considered as a single-end dataset.
2018-01-29 15:54:52,253 Collected 1 FASTQ files.
2018-01-29 15:54:52,253 Quantification has been finished.
2018-01-29 15:54:52,253 Running Salmon using Snakemake
Building DAG of jobs...
2018-01-29 15:54:52,843 Building DAG of jobs...
Using shell: /bin/bash
2018-01-29 15:54:52,995 Using shell: /bin/bash
Provided cores: 1
2018-01-29 15:54:52,995 Provided cores: 1
Rules claiming more threads will be scaled down.
2018-01-29 15:54:52,996 Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1       collect_abundance
        2
2018-01-29 15:54:52,996 Job counts:
        count   jobs
        1       all
        1       collect_abundance
        2

2018-01-29 15:54:52,997 
rule collect_abundance:
    output: /fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE_output/EXPR.csv
    jobid: 1
2018-01-29 15:54:52,997 rule collect_abundance:
    output: /fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE_output/EXPR.csv
    jobid: 1

2018-01-29 15:54:52,997 
Finished job 1.
2018-01-29 15:54:55,637 Finished job 1.
1 of 2 steps (50%) done
2018-01-29 15:54:55,638 1 of 2 steps (50%) done

2018-01-29 15:54:55,640 
localrule all:
    input: /fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE_output/EXPR.csv
    jobid: 0
2018-01-29 15:54:55,640 localrule all:
    input: /fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/SalmonTE_output/EXPR.csv
    jobid: 0

2018-01-29 15:54:55,641 
Finished job 0.
2018-01-29 15:54:55,642 Finished job 0.
2 of 2 steps (100%) done
2018-01-29 15:54:55,643 2 of 2 steps (100%) done
Shutting down, this might take some time.
2018-01-29 15:54:55,644 Shutting down, this might take some time.
Complete log: /fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/.snakemake/log/2018-01-29T155452.783513.snakemake.log
2018-01-29 15:54:55,644 Complete log: /fsimb/groups/imb-kettinggr/tools/salmonTE/0.2/SalmonTE/.snakemake/log/2018-01-29T155452.783513.snakemake.log

However it doesn't seem to perform any quantifications:

cat SalmonTE_output/EXPR.csv
TE

TE is literally the only thing written to EXPR.csv.

I have now extracted my reads to fastq files and I am using these to run SalmonTE, which is good enough workaround. Just letting you know about these issues.

All the best and sorry again for all these bug reports.

António

hyunhwan-jeong / salmonte Goto Github PK

salmonte's People

Contributors

Stargazers

Watchers

Forkers

salmonte's Issues

Recommend Projects

Recommend Topics

Recommend Org