Code Monkey home page Code Monkey logo

xunchen85 / ervcaller Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 4.0 76.1 MB

ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tools using both simulated and real benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable to accurately detect various TE insertions of any lengths, particularly ERVs. It allows for the use of a TE reference library regardless of sequence complexity, such as the entire RepBase database. It is easy to install and use with command lines.

Home Page: http://www.uvm.edu/genomics/software/ERVcaller.html

Perl 99.47% R 0.53%

ervcaller's People

Contributors

jakewendt avatar xunchen85 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ervcaller's Issues

Error in User Manual

The original text is as follows:
3.3.2 Detecting TE insertions using paired-end FASTQ file as input
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .fq.gz -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM
如果使用了-BWA_MEM 参数会导致程序把输入文件当成bam文件,需要去掉-BWA_MEM程序才能正常运行

Argument "-" isn't numeric in subtraction (-) at ERVcaller_v.1.3.pl line 1305, <OUT3> line 2.

Hi!

Although I am getting a vcf with HERV-K insertions after running ERVcaller with some bam files (sorted by coordinate and with duplicates marked), I still have the sampleID_temp directory. Shouldn't this directory be erased when the program finishes?

In addition, I have been looking the logs and I get some errors in different samples:

One of the errors is this:

Argument "-" isn't numeric in subtraction (-) at ERVcaller_v.1.3.pl line 1305, <OUT3> line 2.

In another sample, I get this error:

Error in `[<-.data.frame`(`*tmp*`, , 2, value = NA_real_) :
  replacement has 1 row, data has 0
Calls: [<- -> [<-.data.frame
Execution halted

Do you know how can I solve this?

By the way, I am still getting the output vcf in these samples, so I was wondering if maybe I could ignore it.

Thanks!

The Requirement software

Hi chenxu:
Does Hydra-Version-0.5.3 is required by ERVcaller?It seem that 2.2 and 3.1 is conflicting in User Manual of ERVcaller v1.4.

Some chromosomes don't harbour any variation

Hello, Dr. Xun Chen,
My target plant species has 26 chromosomes, the genome size of which is about 2.2 Gb. We name the chromosomes like chr01, chr02, chr03...chr24, chr25, chr26. I have run the whole pipeline for 139 individuals, however, I found that the final vcf file doesn't have calls for chromosomes form chr01 to chr09. Are there any naming rules for chromosome ID, or is this software useful for other species like plants?
All the best,
Rain
Xshot-1715

Error of Step 3

这是我用的命令:
perl /vol6/home/quluj/zt/software/ERVcaller_v1.4/ERVcaller_v1.4.pl \
-i MD4 \
-f .fastq.gz \
-H /vol6/home/quluj/zt/pkduck_ref/PK_ref.fa \
-T /vol6/home/quluj/zt/DUCK/teseq/LTR.fasta \
-t 12 -S 20 -G
但是每次都到第三步就报错,只能得到一个空的vcf文件。
Step 3: Validation...

Converting SAM to BAM file, and then Sort and index the BAM file......

[bam_sort_core] merging from 11 files and 11 in-memory blocks...
[bwa_index] Pack FASTA... [bns_fasta2bntseq] Failed to allocate 0 bytes at bntseq.c line 303: Success
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files

About using TGS data

Hello Dr.Chen,

I'm wondering if the ERVcaller could be used to input long-read sequencing? ( ..and any suggestion about how to adjust command parameters and any validation result of identification accuracy while applying TGS data?

Thank you~

Error unexpected end of file

Hello,

I gave the program a fastq as an input but during the preocess I got this problem:

gzip: ERR2304551_picard_sorted_soft.fastq.gz: unexpected end of file

How could I fix it?

Thank you!

About "xx_h1_sorted.bam" file

Dear ERVcaller team.

Hi I'm Oh.

I have executed your program.
And I get output file, and output directory.

When I use your test bam, I have only one output file (TE_seq.vcf), and one temp directory.

However, when I use my sample bam, I get one more file.

--> 2016000051.vcf (Expected), 2016000051_h1_sorted.bam (No expected)

What is "_h1_sorted.bam" file?

Do you have any recommended standard filter?

thank you for developing great tool.
I had some final vcf files.
I didn't find additional filters and others filters.
So, status of detected TE : 0 to 5, type 0 is it ok?
and Are chr and position in vcf breaking point?
I don't understand where is breaking point, what are START & END means.
for example,
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TE_seq
chr1 5617379 . T <INS_MEI:HERV>
. . TSD=NULL,NULL;INFOR=HERVK, 1,7831,7831,+,4;CR=64;SR=3;GTF=YES;GR=1.000 GT:GQ:GL:DPN:DPI
1/1:40:0,0,1:0:67

where is HERVK's insertion position?

关于SE_MEI的问题

你好,陈博,readme里面写到本软件会包含SE_MEI的脚本,但是我并没有看到。
于是我到SE_MEI原作者的GitHub里面下载安装了SE_MEI(改了htslib的版本为1.1),可以跑通ERVcaller,请问这也可以吗。
谢谢大神!

bwa-mem2

Hello,
I was wondering if this software works with bwa-mem2.
Thanks in advance!

Cannot remove "sort.bam.tmp.0000.bam" files

Hello, Dr. Chen,

Recently, I met a problem that after completing the whole pipeline, sometimes there were still lots of "sort.bam.tmp.0000.bam" files left in the output folder, while sometimes they were removed, but I set the same CPU and Mem parameters for each running. I am not sure whether the output vcf file is complete if the bam files are not removed. What might cause this problem and how could I fix it?

Thank you!

failed at step 3

Hi,
I am not sure about what is happening here. I use the program on my data (which are non human) and everything seems fine until this step. Then the vcf output file at the end is empty.

Step 3: Validation...

Converting SAM to BAM file, and then Sort and index the BAM file......

[bam_sort_core] merging from 0 files and 11 in-memory blocks...
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files

Here is my command line to launch the program

perl /home/ubuntu/ERVcaller-1.4/ERVcaller_v1.4.pl -i Dmel_chr2L_sim_100X_150 -f .fq.gz -H /home/ubuntu/genome_Del_1.fasta -T /home/ubuntu/ref_TE.fasta -t 12 -S 20 -BWA_MEM -I /home/ubuntu/ -O /home/ubuntu/Dmel_100X_sim/ -r 150

Thank you in advance for you help!

RNAseq data

Hello, Thank you for sharing the ERVcaller.

I was wondering whether this tool can be used for single cell RNAseq data sets (fastq format).

Any suggestions will be appreciated.

same sample, different result.

hi,
I have human human samples and 2 computers.
I used a same sample on both computers.
but, results were different!
I used same options, programs(samtools, bwa, ERVcaller) version, reference.
additionally, I got error message when sample size over 61G.

readline() on closed filehandle TAB at /home/super/Program/ERVcaller/ERVcaller_v1.4.pl line 1376.

This is my command.
perl $ERV_path/ERVcaller_v1.4.pl -i $samplename -f .sorted.RGfixed.Rmdup.bam -H $refer -T $ERV_path/Database/Human_TE_library.fa -I $input_path/ -O $output_path/$samplename/ -r 150 -t 12 -S 40 -BWA_MEM -G

  • I ran multiple samples at the same time each screens.

Lack of "GQ" value in output.vcf and some question about "Merging" step

Hello Dr.Chen,

1.I noticed that the example output.vcf file contained "GQ:GL:DPN:DPI" values, while my output.vcf for each sample only had a result of "GT". What might cause this problem? Any important step I had ignored?

2.When I checked my merge_output.vcf, I found that this vcf file did not contain all the insertions identified in each sample vcf file. I ran over the script of Combine_VCF_files.pl, it seemed like the close insertions would be combined in one region. However, some separate insertions were not in my merge_output.vcf, (for example: Sample1‘s output.vcf contained an insertion on Chr19, while there were no insertions located in Chr19 in merge_output.vcf) ... Why has it happened?

Sorry to bother, thank you so much!!

Illegal division by zero

when @insert_size =0, the script will report errors.
Illegal division by zero at /vol6/home/quluj/zt/software/ERVcaller_v1.4/ERVcaller_v1.4.pl line 1623

About required programs

Dear Xun Chen,

Hi. I'm Oh.

I have three questions.

  1. In your programs, there is no process using "Hydra". Is it right?

  2. Where can I find "TE_consensus.fa" in your user manual?

In my ERVcaller-1.4/Database folder, there are three fasta files (ERV_library.fa, HERVK.fa, Human_TE_library.fa).

I want to use all fasta files in database (ERV_library.fa, HERVK.fa, Human_TE_library.fa).

Can I use a merged, and indexed fasta file using three fasta file?

Like this,

$cat ERV_library.fa HERVK.fa Human_TE_library.fa > All.fa
$bwa index All.fa
$perl ERVcaller_v1.4.pl -i TE_seq -f .bam -H hg38.fa -T All.fa

Thanks.

Step 2 - Improper reads onwards, bwa-mem fails to read index files

Hi there,

From the improper reads step, bwa-mem fails to read the index files - I did index both the input bam and reference genome fa files, which is why the previous step 'Chimeric and split reads' worked. Beyond that, bwa-mem does not want to work.

Any help would be much appreciated!

A question about the "INFOR" field

Hello, Dr. Xun Chen,
I have a question about the INFO column. For the "INFOR" field, the description stands for "NAME,START,END,LEN,DIRECTION,STATUS". However, it seems that the "START" is not always 1, here is some examples of my results:

#CHROM POS ID REF ALT QUAL FILTER INFO
chr01 629412 . T <INS_MEI:TE_00011020_LTR#ClassII_DNA_Mutator_nMITE> . PASS CR=114;GR=0,0.656,0.745,0.633,0.659,0.444,0.333;GTF=YES;INFOR=TE_00011020_LTR#DNA/Mutator,571,839,269,+,5;SR=8
chr01 951226 . G <INS_MEI:TE_00005384#ClassII_DNA_hAT_MITE> . PASS CR=23;GR=0,0.581,0.359;GTF=YES;INFOR=TE_00005384#DNA/hAT,3526,6274,2749,+,5;SR=9
chr01 951675 . T <INS_MEI:TE_00003559_INT#ClassI_LTR_Copia> . PASS CR=5;GR=0,0.556,0.467;GTF=YES;INFOR=TE_00003559_INT#LTR/Copia,1836,1908,73,-,5;SR=7
chr01 986317 . A <INS_MEI:TE_00006433_LTR#ClassI_LTR_Gypsy> . PASS CR=9;GR=0,1,0.444,0.5;GTF=YES;INFOR=TE_00006433_LTR#LTR/Gypsy,6,2539,2534,+,5;SR=6
chr01 990068 . A <INS_MEI:TE_00011590_LTR#ClassI_LTR_Gypsy> . PASS CR=8;GR=0,0.611;GTF=YES;INFOR=TE_00011590_LTR#LTR/Gypsy,2024,2437,414,+,5;SR=3

I'm wondering what's the insertion range of a TE insertion. For example, in the first record, the POS is chr01:629412, while the "START" of "INFOR" field is 571, does that mean the actual start position for the insertion is "629412 + 571" and the end position is "629412 + 839" ? Or the start position is 629412? In my current study, I need to obtain the start and end postion of TE insertions. I'm confused and could you do me a favor?
Best wishes,
Rain

Problem in Step 2

Hello,

I runned ERVcaller and I found this problem:

Step 2: Detecting TE insertions...

~~~~~ the input bam file was indexed
sh: extractSoftclipped: command not found

What shoul I do?
Thank you.

The help message of ERVcaller_v1.4.pl

After I typed the command $perl ERVcaller_v1.4.pl, I got the help message and examples, but I found the examples is conflicting with User Manual of ERVcaller v1.4
The help message is as follows:
perl ./ERVcaller_v1.4.pl

Examples for detecting ERV and other TE insertions:

Detecting TE insertions with a BAM file as the input

   perl /vol6/home/quluj/zt/software/ERVcaller_v1.4/ERVcaller_v1.4.pl -i TE_seq.fa -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM

Detecting TE insertions with paired-end FASTQ file as the input

   perl /vol6/home/quluj/zt/software/ERVcaller_v1.4/ERVcaller_v1.4.pl -i TE_seq.fa -f .fq.gz -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM

Detecting TE insertions with separated BAM file(s) as the input

   perl /vol6/home/quluj/zt/software/ERVcaller_v1.4/ERVcaller_v1.4.pl -i TE_seq.fa -f .list -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -m

Detecting and genotyping TE insertions with a BAM file as the input

   perl /vol6/home/quluj/zt/software/ERVcaller_v1.4/ERVcaller_v1.4.pl -i TE_seq.fa -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -G

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.