bioinformatics-centre / bayesembler Goto Github PK
View Code? Open in Web Editor NEWA Bayesian method for doing transcriptome assembly from RNA-seq data
License: MIT License
A Bayesian method for doing transcriptome assembly from RNA-seq data
License: MIT License
Bayesembler is good on big datasets, but could be optimised further with a simple check if the *nd_minus and *nd_plus bams have been previously created.
Why ? Because it is desirable to run Bayesembler multiple times on the same data to test parameters while avoiding the hefty IO of the initial duplicate removal stage.
This NGS dataset I am dealing with at the moment is multiple samples totalling a BAM of ~160GB.
Thanks a lot, and also for the previous update.
Colin
Hi Lasse,
I have now successfully run Bayesembler on several transcript sets and am trying to get the CDS sequences from the output. Do you have any experience with this ?
In the past I have used a Perl script from the transdecoder 2012 package for this: http://transdecoder.github.io/
Script: cufflinks_gtf_genome_to_cdna_fasta.pl
It did a good job, and I'd like to use it again, yet it only extracts the individual exon sequences or alternatively from the Bayesembler data without combining them into a long multiexon CDS (without introns).
I also saw some differences between the Cufflinks and Bayesembler outputs (see below for two examples, which are not the same transcript)
Would it be possible to calculate these ? I believe compatibility with downstream programs and scripts which work with Cufflinks would be a major advantage if easy to implement.
Thanks very much,
Colin
1 Cufflinks transcript 93387 111087 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853"; conf_h
i "2.833129"; cov "44.134227";
1 Cufflinks exon 93387 93516 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "1"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
1 Cufflinks exon 109510 109675 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "2"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
1 Cufflinks exon 109759 109935 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "3"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
1 Cufflinks exon 110769 111087 1000 - . gene_id "CUFF.8"; transcript_id "CUFF.8.1"; exon_number "4"; FPKM "2.3661993829"; frac "0.157337"; conf_lo "1.895853
"; conf_hi "2.833129"; cov "44.134227";
6 Bayesembler exon 20387288 20389207 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq0"; transcript_confidence "1"; FPKM
"38.6952"; FPKM_sd "0.405607"; expected_count "8384.26";
6 Bayesembler exon 20468756 20470547 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq1"; transcript_confidence "1"; FPKM
"12.247"; FPKM_sd "0.245679"; expected_count "2444.2";
6 Bayesembler exon 20532311 20533039 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq3"; transcript_confidence "1"; FPKM
"11.7757"; FPKM_sd "0.456496"; expected_count "677.976";
6 Bayesembler exon 20540566 20541604 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq4"; transcript_confidence "1"; FPKM
"15.0505"; FPKM_sd "0.3873"; expected_count "1489.79";
6 Bayesembler exon 20555972 20556532 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq5"; transcript_confidence "1"; FPKM
"12.4851"; FPKM_sd "0.5949"; expected_count "438.632";
6 Bayesembler exon 20564403 20564704 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq7"; transcript_confidence "1"; FPKM
"4.82021"; FPKM_sd "1.08253"; expected_count "20.3528";
6 Bayesembler exon 20571524 20573317 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq8"; transcript_confidence "1"; FPKM
"6.77399"; FPKM_sd "0.183838"; expected_count "1353.73";
6 Bayesembler exon 20580763 20582535 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq9"; transcript_confidence "1"; FPKM
"27.5812"; FPKM_sd "0.360588"; expected_count "5434.54";
6 Bayesembler exon 20492633 20493290 . + . gene_id "graph_118857_plus"; transcript_id "graph_118857_plus_seq10"; transcript_confidence "1"; FPK
M "6.17059"; FPKM_sd "0.36231"; expected_count "296.741";
Dear all,
I know this is on the website and is being done but I'd like to say STAR support is the most critical of the listed planned features for me.
STAR really provides huge practical advantages over Tophat in terms of speed and according to my tests accuracy especially with it's two pass alignment mode.
Thanks,
Colin
Hi,
I am trying to run Bayesembler on a tophat2 (2.0.8) generated file from human K562 whole cell polya+
RNAseq (dUTP protocol):
https://www.encodeproject.org/files/ENCFF412EYU/@@download/ENCFF412EYU.bam
Here is the command I ran on a SGE cluster:
~/bin/bayesembler_v1.2.0_linux_x86_64/bayesembler -b $bam -o K562_wcell_polya+_biorep1_cshl_tophat2 -p 4 -s first 2> bayesembler.err
where $bam refers to the above bam file
and I get this in the system error file
/usr/share/univage/soldierantcluster/spool/node-hp0100rg/job_scripts/1424097: line 3: 39971 Aborted (core dumped) ~/bin/bayesembler_v1
.2.0_linux_x86_64/bayesembler -b /users/rg/jlagarde/projects/encode/scaling/whole_genome/3ncod3_production_files/www.encodeproject.org/files/ENCFF412E
YU/@@download/ENCFF412EYU.bam -o K562_wcell_polya+_biorep1_cshl_tophat2 -p 4 -s first 2> bayesembler.err
this in the standard error:
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:460: std::vector<std::pair<std::list
, std::basic_string > > Assembler::generateSpliceGraphs(double*): Assertion `!current_alignment.IsFailedQC()' failed.
and a core file of 80M
Please let me know if I did something wrong or if you need more information to solve the problem?
thanks
sarah
Hi,
it would be nice to have a GTF merging feature similar to Cufflinks merge. However, I was always a little disappointed by Cufflinks merge, and want to be able to merge transcripts from multiple samples more drastically.
For example, I might have 20-40 input GTFs, and want to create a set of High quality (few, 30-50k) or an extended set (100-150k) of transcripts.
At the moment I have to do this via conversion to fasta format and sequence based clustering, which is time consuming and can give unsatisfactory results.
How do you see this ? Good idea? Too much effort / already provided with tools like PASA, Cufflinks or alternatives ?
Thanks
Hi,
I'm trying to test Bayesembler, but so far with no successful results. My data is paired-end and I did run Tophat2 with default parameters. Still I'm getting error messages like following:
$ bayesembler -b tophat_out/accepted_hits.bam -o Bayes_test
You are using the Bayesembler v1.2.0. For more information go to bayesembler.binf.ku.dk
bam_nd_pe_plus_file_nameBayes_testaccepted_hits_nd_plus.bam
[12/06/2015 11:23:05] Removing duplicate reads
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:336: std::vectorstd::pair<std::list<GraphInfo, std::basic_string > > Assembler::generateSpliceGraphs(double*): Assertion `current_alignment.IsPaired()' failed.
Aborted
What is wrong and how do I fix this?
best,
Patrik
Hi,
sorry if this is not the appropriate forum, but it would be useful (or rather necessary, really) to add an 'outfile' option to Bayesembler. As it stands, it will write into the same folder from where it is run and always create a statically named file (assembly.gtf). This makes it somewhat frustrating to use it in pipelines or other complex workflows.
/Marc
Hi.
I want to running bayesembler with my data, but the program show me this error.
I don't know what problem is. Please tell me how I can solve this error .
$ bayesembler -b Atcol0.bam -p 64
You are using the Bayesembler v1.1.1. For more information go to bayesembler.binf.ku.dk
[17/12/2014 10:01:09] Removing duplicate reads
[17/12/2014 10:22:08] Removed duplicates from 19745500 mapped read pairs
[17/12/2014 10:22:08] Wrote 17018947 read pairs used for splice-graph construction
[17/12/2014 10:22:08] Spawning graph construction thread
[17/12/2014 10:22:08] Generating splice-graphs from Atcol0_nd_unstranded.bam using cem
bayesembler: /opt/bayesembler/src/assembler.cpp:687: void Assembler::graphConstructorCallback(std::string, std::list<GraphInfo>*, std::string, boost::mutex*, int*, int*): Assertion `system(instance_system_string.c_str()) == 0' failed.
Aborted (core dumped)
Hi,
been trying to compile the source following the somewhat sparse instructions. The final 'make' fails cryptically with:
/sw/source/bayesembler/bayesembler/src/main.cpp:41: error: expected initializer before ‘’ token
/sw/source/bayesembler/bayesembler/src/main.cpp: In function ‘int main(int, char const_)’:
/sw/source/bayesembler/bayesembler/src/main.cpp:60: error: ‘class boost::program_options::typed_value<std::basic_string<char, std::char_traits, std::allocator >, char>’ has no member named ‘required’
/sw/source/bayesembler/bayesembler/src/main.cpp:120: error: no matching function for call to ‘parse_command_line(int&, char_ const_&, boost::program_options::options_description&)’
make[2]: *_* [CMakeFiles/bayesembler.dir/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/bayesembler.dir/all] Error 2
CMakeList:
cmake_minimum_required(VERSION 2.6)
project(bayesembler)
set(CMAKE_CXX_COMPILER "g++")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(BOOST_LIB_DIR "/sw/boost_1_55_0/lib/")
set(BOOST_INCLUDE_DIR "sw/boost_1_55_0/include/" )
set(BAMTOOLS_LIB_DIR "/sw/bioinfo/bamtools/2.3.0/lib/")
set(BAMTOOLS_INCLUDE_DIR "/sw/bioinfo/bamtools/2.3.0/include/")
set(EIGEN_INCLUDE_DIR "/sw/system/eigen/3.0.7/include/eigen3")
include_directories(. ${BOOST_INCLUDE_DIR} ${BAMTOOLS_INCLUDE_DIR} ${EIGEN_INCLUDE_DIR})
link_directories(${BOOST_LIB_DIR} ${BAMTOOLS_LIB_DIR})
[SNIP]
When trying to run bayesembler 1.2.0 on an alignment generated by tophat2, I get the following error:
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:186: void Assembler::markDuplicates(BamTools::BamAlignment&, Assembler::FirstReads_, Assembler::ReadPairs_): Assertion `cur_pos_first_reads_it->second.insert(pair<ReadId, BamTools::BamAlignment*>(ri, new BamTools::BamAlignment(current_alignment))).second' failed.
I assumed it was an issue with the order of the alignments in the bam file, but it still happens after resorting the bam file with samtools, regardless of the version. I was able to run bayesembler on other datasets using the same version of tophat2 without issue, so it doesn't seem to be an issue with the installs.
Can you please suggest what values should be manually assigned for resolving the following error?
[17/10/2017 11:26:33] Sorting splice-graphs by read count
[17/10/2017 11:26:33] Finished sorting splice-graphs by read count[17/10/2017 11:26:33] Spawning 1 thread(s) for fetching alignments and 1 i/o thread
[17/10/2017 11:26:33] Estimating fragment length distribution from 4 transcripts longer than 2500 nucleotides
ERROR: Insufficient number of observations for estimating the fragment length distribution parameters. Please specify these manually using the --frag-mean and --frag-sd options.
I downloaded bayesembler 1.2.0 binary for linux, and run it on one of tophat2 generated bam files as follow:
bayesembler -b ~/work/ngs_dat/project1/tophat2/sample1/accepted_hits.bam -o test
But got following error:
You are using the Bayesembler v1.2.0. For more information go to bayesembler.binf.ku.dk
bam_nd_pe_plus_file_nametestaccepted_hits_nd_plus.bam
[06/05/2015 09:00:31] Removing duplicate reads
[06/05/2015 09:33:21] Removed duplicates from 99688126 mapped read pairs
[06/05/2015 09:33:21] Wrote 41048494 read pairs used for splice-graph construction
[06/05/2015 09:33:21] Spawning graph construction thread
[06/05/2015 09:33:21] Generating splice-graphs from testaccepted_hits_nd_unstranded.bam using cem
[06/05/2015 09:59:12] Parsed 1056584 graph(s) from cem instance file
[06/05/2015 09:59:12] Parsed 1056584 splice graph(s) from cem instance file and collapsed them to 15469 assembly graph(s) (1040087 graph(s) excluded due to inference issues resulting from unstranded data).
[06/05/2015 09:59:12] 40223160 unique, non-redundant read pairs being used for quantification
[06/05/2015 09:59:12] 4.02232e+07 read pairs being used for FPKM normalisation
[06/05/2015 09:59:12] Sorting splice-graphs by read count
[06/05/2015 09:59:13] Finished sorting splice-graphs by read count
[06/05/2015 09:59:13] Spawning 1 thread(s) for fetching alignments and 1 i/o thread
[06/05/2015 09:59:34] Estimating fragment length distribution from 60 transcripts longer than 2500 nucleotides
[06/05/2015 09:59:34] Estimated fragment length "median"=178 and "median absolute deviation"=57 using 48850 observations
[06/05/2015 09:59:34] Using Gaussian fragment length distribution with parameters: Mean=178 and SD=84.5082
[06/05/2015 09:59:34] Starting Bayesembler on 11609 multi-path graph(s) and 3860 single-path graph(s)
[06/05/2015 09:59:34] Spawning 1 Bayesembler thread(s) and 2 i/o threads
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_2_0/src/assembler.cpp:1411: double Assembler::calculateSequencingProbability(std::string&, std::string&, std::vectorBamTools::CigarOp&): Assertion `*deletions_it == qualities.size() + 1' failed.
What is the problem? How do I fix this? Thank you very much for your help.
Hello,
I am trying to run bayesembler on a bam generated by GSNAP and have run into a CIGAR string error as below:
bayesembler -s first -b ../GsnapAlignments/Gsnap.T1990_ctx.33080.cmu059.19.Aug.2014/T1990_ctx.all.mm.rg.dup.srtd.bam
You are using the Bayesembler v1.2.0. For more information go to bayesembler.binf.ku.dk
bam_nd_pe_plus_file_nameT1990_ctxallmmrgdupsrtd_nd_plus.bam
[24/06/2015 16:27:56] Removing duplicate reads
ERROR: Unhandled cigar string symbol 'S'!
I will be much obliged for any insight into this issue
Cheers
Ashok
Hi Lasse,
This is similar to the request for STAR alignment support in bayesassembler. Is there anyway to give the program, HISAT2 alignments instead of Tophat2. Hisat2 as you know is the successor to Tophat
Thanks
Abhijit
Hello.
I downloaded the 1.2.0 executable from here but get the following error when I run it:
./bayesembler: error while loading shared libraries: libboost_math_c99.so.1.56.0: cannot open shared object file: No such file or directory
However, I'm able to execute v1.1.1, which I obtained here.
Thanks.
Recently, we use bayesembler to assemble transcriptome. We found 2 errors.
First, the samtools in dependencies folder can not work well, but rebuild samtools can work well.
Second
bayesembler: /seqdata/krogh/jola/projects/transcriptome_assembly/code/release/bayesembler_1_1_1_pub_release/src/assembler.cpp:894: void Assembler::graphConstructorCallback(std::string, std::list, std::string, boost::mutex, int_, int_): Assertion `current_ref_strand == "."' failed.
Hi,
the samtools binary included with the pre-compiled version does not work under Scientific Linux 6.5 with zlib version 1.2.3.29 (was compiled against 1.2.3.3). Replacing with a previously compiled version tested on the same system works tho.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.