bioinformatics-centre / bayesembler Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 5.0 478 KB

A Bayesian method for doing transcriptome assembly from RNA-seq data

License: MIT License

C++ 100.00%

bayesembler's People

Contributors

Stargazers

Watchers

Forkers

bhurwitz33 bioinformaticsarchive bosmont henryxushi lucventurini

bayesembler's Issues

Detect nd data

Bayesembler is good on big datasets, but could be optimised further with a simple check if the *nd_minus and *nd_plus bams have been previously created.

Why ? Because it is desirable to run Bayesembler multiple times on the same data to test parameters while avoiding the hefty IO of the initial duplicate removal stage.

This NGS dataset I am dealing with at the moment is multiple samples totalling a BAM of ~160GB.

Thanks a lot, and also for the previous update.

Colin

Feature request: Standardise Bayesembler output to Cufflinks

Hi Lasse,

I have now successfully run Bayesembler on several transcript sets and am trying to get the CDS sequences from the output. Do you have any experience with this ?

In the past I have used a Perl script from the transdecoder 2012 package for this: http://transdecoder.github.io/
Script: cufflinks_gtf_genome_to_cdna_fasta.pl

It did a good job, and I'd like to use it again, yet it only extracts the individual exon sequences or alternatively from the Bayesembler data without combining them into a long multiexon CDS (without introns).

I also saw some differences between the Cufflinks and Bayesembler outputs (see below for two examples, which are not the same transcript)

No overarching "transcript" feature from Bayesembler - which has also bothered some of my users in the graphical display in JBrowse
No "exon_number" feature in Bayesembler.

Would it be possible to calculate these ? I believe compatibility with downstream programs and scripts which work with Cufflinks would be a major advantage if easy to implement.

Thanks very much,
Colin

1 Cufflinks transcript 93387 111087 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853"; conf_h
i "2.833129"; cov "44.134227";
1 Cufflinks exon 93387 93516 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "1"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
1 Cufflinks exon 109510 109675 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "2"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
1 Cufflinks exon 109759 109935 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "3"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
1 Cufflinks exon 110769 111087 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "4"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";

6 Bayesembler exon 20387288 20389207 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq0"; transcript_confidence "1"; FPKM
"38.6952"; FPKM_sd "0.405607"; expected_count "8384.26";
6 Bayesembler exon 20468756 20470547 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq1"; transcript_confidence "1"; FPKM
"12.247"; FPKM_sd "0.245679"; expected_count "2444.2";
6 Bayesembler exon 20532311 20533039 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq3"; transcript_confidence "1"; FPKM
"11.7757"; FPKM_sd "0.456496"; expected_count "677.976";
6 Bayesembler exon 20540566 20541604 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq4"; transcript_confidence "1"; FPKM
"15.0505"; FPKM_sd "0.3873"; expected_count "1489.79";
6 Bayesembler exon 20555972 20556532 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq5"; transcript_confidence "1"; FPKM
"12.4851"; FPKM_sd "0.5949"; expected_count "438.632";
6 Bayesembler exon 20564403 20564704 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq7"; transcript_confidence "1"; FPKM
"4.82021"; FPKM_sd "1.08253"; expected_count "20.3528";
6 Bayesembler exon 20571524 20573317 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq8"; transcript_confidence "1"; FPKM
"6.77399"; FPKM_sd "0.183838"; expected_count "1353.73";
6 Bayesembler exon 20580763 20582535 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq9"; transcript_confidence "1"; FPKM
"27.5812"; FPKM_sd "0.360588"; expected_count "5434.54";
6 Bayesembler exon 20492633 20493290 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq10"; transcript_confidence "1"; FPK
M "6.17059"; FPKM_sd "0.36231"; expected_count "296.741";

Feature request: STAR

Dear all,

I know this is on the website and is being done but I'd like to say STAR support is the most critical of the listed planned features for me.
STAR really provides huge practical advantages over Tophat in terms of speed and according to my tests accuracy especially with it's two pass alignment mode.

Thanks,
Colin

core dump when removing duplicates

Hi,
I am trying to run Bayesembler on a tophat2 (2.0.8) generated file from human K562 whole cell polya+
RNAseq (dUTP protocol):
https://www.encodeproject.org/files/ENCFF412EYU/@@download/ENCFF412EYU.bam

Here is the command I ran on a SGE cluster:
~/bin/bayesembler_v1.2.0_linux_x86_64/bayesembler -b $bam -o K562_wcell_polya+_biorep1_cshl_tophat2 -p 4 -s first 2> bayesembler.err
where $bam refers to the above bam file

and I get this in the system error file
/usr/share/univage/soldierantcluster/spool/node-hp0100rg/job_scripts/1424097: line 3: 39971 Aborted (core dumped) ~/bin/bayesembler_v1
.2.0_linux_x86_64/bayesembler -b /users/rg/jlagarde/projects/encode/scaling/whole_genome/3ncod3_production_files/www.encodeproject.org/files/ENCFF412E
YU/@@download/ENCFF412EYU.bam -o K562_wcell_polya+_biorep1_cshl_tophat2 -p 4 -s first 2> bayesembler.err

this in the standard error:
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:460: std::vector<std::pair<std::list
, std::basic_string > > Assembler::generateSpliceGraphs(double*): Assertion `!current_alignment.IsFailedQC()' failed.

and a core file of 80M

Please let me know if I did something wrong or if you need more information to solve the problem?

thanks
sarah

Feature request: merge GTFs

Hi,

it would be nice to have a GTF merging feature similar to Cufflinks merge. However, I was always a little disappointed by Cufflinks merge, and want to be able to merge transcripts from multiple samples more drastically.

For example, I might have 20-40 input GTFs, and want to create a set of High quality (few, 30-50k) or an extended set (100-150k) of transcripts.

At the moment I have to do this via conversion to fasta format and sequence based clustering, which is time consuming and can give unsatisfactory results.

How do you see this ? Good idea? Too much effort / already provided with tools like PASA, Cufflinks or alternatives ?

Thanks

Error in Bayesembler v1.2.0

Hi,

I'm trying to test Bayesembler, but so far with no successful results. My data is paired-end and I did run Tophat2 with default parameters. Still I'm getting error messages like following:

$ bayesembler -b tophat_out/accepted_hits.bam -o Bayes_test

You are using the Bayesembler v1.2.0. For more information go to bayesembler.binf.ku.dk

bam_nd_pe_plus_file_nameBayes_testaccepted_hits_nd_plus.bam
[12/06/2015 11:23:05] Removing duplicate reads
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:336: std::vectorstd::pair<std::list<GraphInfo, std::basic_string > > Assembler::generateSpliceGraphs(double*): Assertion `current_alignment.IsPaired()' failed.
Aborted

What is wrong and how do I fix this?

best,
Patrik

Feature request: Output file

Hi,

sorry if this is not the appropriate forum, but it would be useful (or rather necessary, really) to add an 'outfile' option to Bayesembler. As it stands, it will write into the same folder from where it is run and always create a statically named file (assembly.gtf). This makes it somewhat frustrating to use it in pipelines or other complex workflows.

/Marc

running error

Hi.

I want to running bayesembler with my data, but the program show me this error.
I don't know what problem is. Please tell me how I can solve this error .

$ bayesembler -b Atcol0.bam -p 64
You are using the Bayesembler v1.1.1. For more information go to bayesembler.binf.ku.dk

[17/12/2014 10:01:09] Removing duplicate reads
[17/12/2014 10:22:08] Removed duplicates from 19745500 mapped read pairs
[17/12/2014 10:22:08] Wrote 17018947 read pairs used for splice-graph construction

[17/12/2014 10:22:08] Spawning graph construction thread
[17/12/2014 10:22:08] Generating splice-graphs from Atcol0_nd_unstranded.bam using cem
bayesembler: /opt/bayesembler/src/assembler.cpp:687: void Assembler::graphConstructorCallback(std::string, std::list<GraphInfo>*, std::string, boost::mutex*, int*, int*): Assertion `system(instance_system_string.c_str()) == 0' failed.
Aborted (core dumped)

Compilation of source fails with cryptic error

Hi,

been trying to compile the source following the somewhat sparse instructions. The final 'make' fails cryptically with:

/sw/source/bayesembler/bayesembler/src/main.cpp:41: error: expected initializer before ‘’ token
/sw/source/bayesembler/bayesembler/src/main.cpp: In function ‘int main(int, char const_)’:
/sw/source/bayesembler/bayesembler/src/main.cpp:60: error: ‘class boost::program_options::typed_value<std::basic_string<char, std::char_traits, std::allocator >, char>’ has no member named ‘required’
/sw/source/bayesembler/bayesembler/src/main.cpp:120: error: no matching function for call to ‘parse_command_line(int&, char_ const_&, boost::program_options::options_description&)’
make[2]: *_* [CMakeFiles/bayesembler.dir/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/bayesembler.dir/all] Error 2

CMakeList:

cmake_minimum_required(VERSION 2.6)
project(bayesembler)

set(CMAKE_CXX_COMPILER "g++")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")

set(BOOST_LIB_DIR "/sw/boost_1_55_0/lib/")
set(BOOST_INCLUDE_DIR "sw/boost_1_55_0/include/" )
set(BAMTOOLS_LIB_DIR "/sw/bioinfo/bamtools/2.3.0/lib/")
set(BAMTOOLS_INCLUDE_DIR "/sw/bioinfo/bamtools/2.3.0/include/")
set(EIGEN_INCLUDE_DIR "/sw/system/eigen/3.0.7/include/eigen3")

include_directories(. ${BOOST_INCLUDE_DIR} ${BAMTOOLS_INCLUDE_DIR} ${EIGEN_INCLUDE_DIR})
link_directories(${BOOST_LIB_DIR} ${BAMTOOLS_LIB_DIR})

[SNIP]

Error running bayesembler 1.2.0

When trying to run bayesembler 1.2.0 on an alignment generated by tophat2, I get the following error:

bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:186: void Assembler::markDuplicates(BamTools::BamAlignment&, Assembler::FirstReads_, Assembler::ReadPairs_): Assertion `cur_pos_first_reads_it->second.insert(pair<ReadId, BamTools::BamAlignment*>(ri, new BamTools::BamAlignment(current_alignment))).second' failed.

I assumed it was an issue with the order of the alignments in the bam file, but it still happens after resorting the bam file with samtools, regardless of the version. I was able to run bayesembler on other datasets using the same version of tophat2 without issue, so it doesn't seem to be an issue with the installs.

Insufficient number of observations

Can you please suggest what values should be manually assigned for resolving the following error?

[17/10/2017 11:26:33] Sorting splice-graphs by read count
[17/10/2017 11:26:33] Finished sorting splice-graphs by read count

[17/10/2017 11:26:33] Spawning 1 thread(s) for fetching alignments and 1 i/o thread
[17/10/2017 11:26:33] Estimating fragment length distribution from 4 transcripts longer than 2500 nucleotides
ERROR: Insufficient number of observations for estimating the fragment length distribution parameters. Please specify these manually using the --frag-mean and --frag-sd options.

error running bayesembler 1.2.0

I downloaded bayesembler 1.2.0 binary for linux, and run it on one of tophat2 generated bam files as follow:

bayesembler -b ~/work/ngs_dat/project1/tophat2/sample1/accepted_hits.bam -o test

But got following error:

You are using the Bayesembler v1.2.0. For more information go to bayesembler.binf.ku.dk

bam_nd_pe_plus_file_nametestaccepted_hits_nd_plus.bam
[06/05/2015 09:00:31] Removing duplicate reads
[06/05/2015 09:33:21] Removed duplicates from 99688126 mapped read pairs
[06/05/2015 09:33:21] Wrote 41048494 read pairs used for splice-graph construction

[06/05/2015 09:33:21] Spawning graph construction thread
[06/05/2015 09:33:21] Generating splice-graphs from testaccepted_hits_nd_unstranded.bam using cem
[06/05/2015 09:59:12] Parsed 1056584 graph(s) from cem instance file

[06/05/2015 09:59:12] Parsed 1056584 splice graph(s) from cem instance file and collapsed them to 15469 assembly graph(s) (1040087 graph(s) excluded due to inference issues resulting from unstranded data).
[06/05/2015 09:59:12] 40223160 unique, non-redundant read pairs being used for quantification
[06/05/2015 09:59:12] 4.02232e+07 read pairs being used for FPKM normalisation

[06/05/2015 09:59:12] Sorting splice-graphs by read count
[06/05/2015 09:59:13] Finished sorting splice-graphs by read count

[06/05/2015 09:59:13] Spawning 1 thread(s) for fetching alignments and 1 i/o thread
[06/05/2015 09:59:34] Estimating fragment length distribution from 60 transcripts longer than 2500 nucleotides
[06/05/2015 09:59:34] Estimated fragment length "median"=178 and "median absolute deviation"=57 using 48850 observations
[06/05/2015 09:59:34] Using Gaussian fragment length distribution with parameters: Mean=178 and SD=84.5082

[06/05/2015 09:59:34] Starting Bayesembler on 11609 multi-path graph(s) and 3860 single-path graph(s)
[06/05/2015 09:59:34] Spawning 1 Bayesembler thread(s) and 2 i/o threads

bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:1411: double Assembler::calculateSequencingProbability(std::string&, std::string&, std::vectorBamTools::CigarOp&): Assertion `*deletions_it == qualities.size() + 1' failed.

What is the problem? How do I fix this? Thank you very much for your help.

CIGAR String error

Hello,
I am trying to run bayesembler on a bam generated by GSNAP and have run into a CIGAR string error as below:

bayesembler -s first -b ../GsnapAlignments/Gsnap.T1990_ctx.33080.cmu059.19.Aug.2014/T1990_ctx.all.mm.rg.dup.srtd.bam

You are using the Bayesembler v1.2.0. For more information go to bayesembler.binf.ku.dk

bam_nd_pe_plus_file_nameT1990_ctxallmmrgdupsrtd_nd_plus.bam
[24/06/2015 16:27:56] Removing duplicate reads
ERROR: Unhandled cigar string symbol 'S'!

I will be much obliged for any insight into this issue
Cheers
Ashok

Feature Request : Hisat2

Hi Lasse,

This is similar to the request for STAR alignment support in bayesassembler. Is there anyway to give the program, HISAT2 alignments instead of Tophat2. Hisat2 as you know is the successor to Tophat

Thanks
Abhijit

unable to run Bayesembler v.1.2.0

Hello.

I downloaded the 1.2.0 executable from here but get the following error when I run it:

./bayesembler: error while loading shared libraries: libboost_math_c99.so.1.56.0: cannot open shared object file: No such file or directory

However, I'm able to execute v1.1.1, which I obtained here.

Thanks.

bayesembler running errors

Recently, we use bayesembler to assemble transcriptome. We found 2 errors.
First, the samtools in dependencies folder can not work well, but rebuild samtools can work well.
Second
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_1_1_pub_release/src/assembler.cpp:894: void Assembler::graphConstructorCallback(std::string, std::list, std::string, boost::mutex, int_, int_): Assertion `current_ref_strand == "."' failed.

Pre-compiled version - Samtools not working under SL6.5 / zlib 1.2.3.29

Hi,

the samtools binary included with the pre-compiled version does not work under Scientific Linux 6.5 with zlib version 1.2.3.29 (was compiled against 1.2.3.3). Replacing with a previously compiled version tested on the same system works tho.

bioinformatics-centre / bayesembler Goto Github PK

bayesembler's People

Contributors

Stargazers

Watchers

Forkers

bayesembler's Issues

Recommend Projects

Recommend Topics

Recommend Org