lh3 / bwa Goto Github PK
View Code? Open in Web Editor NEWBurrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
License: GNU General Public License v3.0
Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
License: GNU General Public License v3.0
Hi, @lh3
As I may use bwa
for some calling repeatability test for some low frequency variants, so I suspect that there may need an option for initialize the randomization seed, but unfortunately, I can't find it.
Is it convenient to add this option in bwa mem
?
For the use of multithread in bwa, I'm not that sure whether it is available.
Best Regards and Thanks.
Edit: Yep, just my slow computer.
Hey guys, so I ran the bwa mem pacbio command as follows:
bwa mem -x pacbio reference.fa PacBioReads.fastq > PacBioRawtoReference.sam
The bwa indexed reference files are in the same directory as the .fasta
But I get a recursive output that has repeated for hours now:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1004 sequences (10009168 bp)...
[M::process] read 950 sequences (10028116 bp)...
[M::mem_process_seqs] Processed 1004 reads in 109.896 CPU sec, 110.125 real sec
[M::process] read 876 sequences (10032186 bp)...
[M::mem_process_seqs] Processed 890 reads in 123.662 CPU sec, 123.758 real sec
[M::process] read 916 sequences (10010820 bp)...
[M::mem_process_seqs] Processed 876 reads in 129.959 CPU sec, 130.107 real sec
[M::process] read 856 sequences (10018929 bp)...
[M::mem_process_seqs] Processed 916 reads in 132.331 CPU sec, 133.119 real sec
[M::process] read 846 sequences (10008623 bp)...
[M::mem_process_seqs] Processed 856 reads in 118.908 CPU sec, 118.961 real sec
[M::process] read 761 sequences (9136775 bp)...
[M::mem_process_seqs] Processed 846 reads in 126.991 CPU sec, 127.065 real sec
I'm wondering if this means there is a problem with input files, an installation error, or bwa is simply still running and will finish sometime in the near indiscriminate future due to the lower processing power of the computer?
Thanks
I am running follwoing commad to convert .sai output of bwa aln to sam file
bwa sampe ref.fa read1.sai read2.sai read1.fastq read2.fastq >read.sam
and I am getting following error.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[fread] Unexpected end of file
sam file has no sequences
When I use samse converted read1.sai and read2.sai to separte sam files it works fine.
I also to 50 sequences from read1.fastq and read2.fastq did alignment and converted to sam file using sampe it also worked.
I am not sure why I am getting this error.
Really appreciate any help.
Thanks
Hi,
I've uploaded an archive containing a data set that leads to a segfault when doing
bwa bwasw -t 20 00-optrApE349.fas 2708-UW15005_S1_L001_R1_001.fastq.gz 2708-UW15005_S1_L001_R2_001.fastq.gz -f test.sam
The compressed tar archive contains the data set, a script named run_bwa_Segmentation_fault you can simply run to reproduce the issue and the output of the run in the file bwa.out.
I have tested the problem with four versions of bwa (0.7.10, 0.7.14 and 0.7.15) all with the same result and also on two different machines (all running Debian stable, the bwa installation was also from according Debian packages).
Are you able to reproduce the issue and even more importantly fixing it? As a member of the Debian Med packaging team I'd happily fix the issue in the Debian packaging as soon as there is a fix available.
Kind regards and thanks for providing bwa as Free Software, Andreas.
If you have a 64 bit system, binary and index file set but no matching reference.fasta.64.alt file (say only a reference.fasta.alt that comes from run-gen-ref) then bwa will not run as alt-aware.
While you can just copy/softlink the reference.fasta.alt to reference.fasta.64.alt to fix this, it might be better to remove the .64 from the .alt file's path since the .alt file is not created by the index command and thus should not be pathed with other files that are.
The README-alt.md document describes the strategy for handling ALTs as requiring the post-processing step with bwa-postalt.js, but I was told that BWA now handles that internally so it's no longer necessary to run the post-processing script separately. Is the documentation out of date or does this internal handling feature only apply to a development version and not to the latest release?
We run into a deadlock when running BWA (0.7.8, 0.7.10, 0.7.13, 0.7.15) on new machines on our HPC.
BWA has been working fine on the old nodes (Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz) but runs into a deadlock when running on the new machines (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz).
This is reproducible and does not depend on number of threads specified.
We run BWA which following paramters:
bwa mem -t 8 -M -R \@RG\\tID:dummy\\tSM:dummy\\tPL:Illumina\\tLB:dummy ref.fa input.fastq > output.sam
The last output lines:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 100000 sequences (10000000 bp)...
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.408 CPU sec, 20.355 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 19.992 CPU sec, 19.879 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.880 CPU sec, 20.755 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.092 CPU sec, 19.984 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.088 CPU sec, 19.967 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.540 CPU sec, 20.433 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 19.676 CPU sec, 19.546 real sec
[M::process] read 100000 sequences (10000000 bp)...
On the old machines this runs without any problems.
On the new machines after around 10 minutes CPU time (sometimes more and sometimes less) the bwa process hangs (0 % CPU usage).
When we attach with strace -p PID
the process continues again normally. This is reproducible and also happens with -t 1
setting.
We attached gdb
to the hung process and this is the stacktrace of all threads:
Thread 4 (Thread 0x2aaab876a700 (LWP 34136)):
#0 0x00002aaaab6fb5ab in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00002aaaab687b78 in _L_lock_3817 () from /lib64/libc.so.6
#2 0x00002aaaab682b59 in _int_free () from /lib64/libc.so.6
#3 0x000000000041b254 in process (shared=<optimized out>, step=<optimized out>, _data=0x2aaabc0008c0) at fastmap.c:91
#4 0x000000000042094d in ktp_worker (data=0x7fffffff87d0) at kthread.c:106
#5 0x00002aaaab1ed0a4 in start_thread () from /lib64/libpthread.so.0
#6 0x00002aaaab6ef02d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x2aaab896b700 (LWP 34137)):
#0 0x00002aaaab1ee4c2 in pthread_join () from /lib64/libpthread.so.0
#1 0x0000000000420b7e in kt_for (n_threads=<optimized out>, func=func@entry=0x43c000 <worker1>, data=data@entry=0x2aaab896ad10, n=<optimized out>) at kthread.c:59
#2 0x000000000043d24c in mem_process_seqs (opt=0x657120, bwt=<optimized out>, bns=0x657290, pac=<optimized out>, n_processed=<optimized out>, n=800000, seqs=0x2aaad8000010, pes0=0x0) at bwamem.c:1189
#3 0x000000000041b325 in process (shared=0x7fffffff8950, step=<optimized out>, _data=0x2aaacc0008c0) at fastmap.c:84
#4 0x000000000042094d in ktp_worker (data=0x7fffffff87f0) at kthread.c:106
#5 0x00002aaaab1ed0a4 in start_thread () from /lib64/libpthread.so.0
#6 0x00002aaaab6ef02d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2aaad1852700 (LWP 34161)):
#0 0x00002aaaab6fb5ab in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00002aaaab687f2c in _L_lock_11235 () from /lib64/libc.so.6
#2 0x00002aaaab6859e3 in malloc () from /lib64/libc.so.6
#3 0x0000000000440f29 in wrap_malloc (size=size@entry=8, file=file@entry=0x44dff2 "bwamem.c", line=line@entry=662, func=func@entry=0x44e867 <__func__.7724> "mem_chain2aln") at malloc_wrap.c:25
#4 0x0000000000437aec in mem_chain2aln (opt=opt@entry=0x657120, bns=bns@entry=0x657290, pac=pac@entry=0x2aaab64e2010 "W\001\\\005p\025\300]\340\327\303W\003W\003\177\003\\MN\r\\\f\\=\\\005`\026\277w\257\200\064\373\263\060\343\017\363o\374\354>|\373\356\354\217\377",
l_query=l_query@entry=100, query=query@entry=0x2aaacc1c47a0 "\001\001\003\003\002\003\003", c=0x2aaadcea1288, av=av@entry=0x2aaad1851e90) at bwamem.c:662
#5 0x000000000043be33 in mem_align1_core (opt=0x657120, bwt=<optimized out>, bns=0x657290, pac=0x2aaab64e2010 "W\001\\\005p\025\300]\340\327\303W\003W\003\177\003\\MN\r\\\f\\=\\\005`\026\277w\257\200\064\373\263\060\343\017\363o\374\354>|\373\356\354\217\377", l_seq=100,
seq=0x2aaacc1c47a0 "\001\001\003\003\002\003\003", buf=0x2aab061d56c0) at bwamem.c:1037
#6 0x000000000043c08d in worker1 (data=0x2aaab896ad10, i=<optimized out>, tid=5) at bwamem.c:1151
#7 0x00000000004209f3 in ktf_worker (data=0x2aaab896ac20) at kthread.c:41
#8 0x00002aaaab1ed0a4 in start_thread () from /lib64/libpthread.so.0
#9 0x00002aaaab6ef02d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aaaaaae8d80 (LWP 34135)):
#0 0x00002aaaab1ee4c2 in pthread_join () from /lib64/libpthread.so.0
#1 0x0000000000420ccf in kt_pipeline (n_threads=n_threads@entry=2, func=func@entry=0x41b0a0 <process>, shared_data=shared_data@entry=0x7fffffff8950, n_steps=n_steps@entry=3) at kthread.c:142
#2 0x000000000041bbd2 in main_mem (argc=argc@entry=8, argv=argv@entry=0x7fffffff8c60) at fastmap.c:353
#3 0x0000000000402ff0 in main (argc=<optimized out>, argv=0x7fffffff8c58) at main.c:85
The current @PG
line in the bam header from files that were converted from .sai to .bam using bwa samse
contain only the arguments used in the call to bwa samse
@PG ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa samse -n 100 mygenome.fa output.sai input.fastq.gz
But this is not very useful; I want to see how the alignment was done, not how the format conversion went. Or did I overlook something? Of course I can wrap things with awk and whatnot, but that requires rewriting GBytes of on-disk data ... (this is from running bwa aln
followed by bwa samse
, but I guess it's the same for bwa mem
and/or bwa sampe
)
Hi there!
I need to extract the nucleotide sequence from the coordinates provided by the 'bwa aln' procedure. Can I use bwa index for that? More generally, how can I use the index to fetch the nucleotide sequence at specified coordinates?
Thank you very much!
Shiran
Hi,
I downloaded reference genomes from Ensembl (fasta format).
But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes
Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt
Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?
I posted this to the bwa-help list recently.
I have a pair of reads that map perfectly (NM=0) to chr20 and also to chr20_KI270869v1_alt (hs38DH). The latest version of bwa mem produces the following output which seems right:
query 65 chr20 61894983 60 75M = 61894987 5 CAGCCACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGG NM:i:0 MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8678,75M,0;
query 129 chr20 61894987 60 75M = 61894983 -5 CACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCTCGAA GGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:0 MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8682,75M,0;
However, when I run them through the post-alt script the result is four records:
query 65 chr20 61894983 60 75M = 61894987 5 CAGCCACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGG NM:i:0 MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8678,75M,0; om:i:60
query 2113 chr20_KI270869v1_alt 8678 60 75M chr20 61894987 5 CAGCCACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGG NM:i:0 lt:Z:chr20,61894981,61895056,+;
query 129 chr20 61894987 60 75M = 61894983 -5 CACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCTCGAA GGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:0 MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8682,75M,0; om:i:60
query 2177 chr20_KI270869v1_alt 8682 60 75M chr20 61894983 -5 CACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCTCGAA GGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:0 lt:Z:chr20,61894985,61895060,+;
The extra records are marked as supplementary. My understanding, mirrored by the language in the SAM spec, is that supplementary records are for recording pieces alignments of that are not co-linear with the reference and thus can't be stored in a single record, while secondary/not-primary records are for storing alternative alignments of the same read.
Having the reads marked as supplementary throws off tools that wish to ignore secondary alignments, but consume supplemental alignments. E.g. Picard's CollectHsMetrics will report a significantly deflated % selected, % on target etc. when capturing genes that have equally good mappings to the primary assembly and one or more alts, because the records are essentially duplicated.
bwa pac2bwtgen
does not work with very short references (len < 8)
Does not work:
On-y-go-2:20170407172100 karel$ printf '>\nAAAAAAA' > index.fa && bwa fa2pac index.fa index.fa && bwa pac2bwtgen index.fa.pac index.fa.bwt
[main] Version: 0.7.15-r1140
[main] CMD: bwa fa2pac index.fa index.fa
[main] Real time: 0.002 sec; CPU: 0.003 sec
[BWTIncCreate] textLength=14, availableWord=65536
BWTIncSetBuildSizeAndTextAddr(): Not enough space allocated to continue construction!
Works:
On-y-go-2:20170407172100 karel$ printf '>\nAAAAAAAA' > index.fa && bwa fa2pac index.fa index.fa && bwa pac2bwtgen index.fa.pac index.fa.bwt
[main] Version: 0.7.15-r1140
[main] CMD: bwa fa2pac index.fa index.fa
[main] Real time: 0.004 sec; CPU: 0.004 sec
[BWTIncCreate] textLength=16, availableWord=65536
[bwt_gen] Finished constructing BWT in 1 iterations.
[main] Version: 0.7.15-r1140
[main] CMD: bwa pac2bwtgen index.fa.pac index.fa.bwt
[main] Real time: 0.001 sec; CPU: 0.003 sec
Dear Author ,
About my question, I have searched so long time on the internet ,but find nothing useful.
somebody says :
XT ,XA , samtools view -q 1, samtools view -q 30, samtools view -q 40 and so on.
SO which should I to confirm and use ??
I just wanna keep uniquely mapped reads for NIPT.
thank you !
Would it be possible to handle ambiguous reference bases properly? That would increase the alignment quality for viral use cases.
Thank you.
Dear BWA developers,
First of all, I wish you an happy new year.
I am currently experiencing a problem with bwa-mem. In my lab, I have two nodes, one "old" node and a new node that we received four months ago, which has a "different" generation of processors (namely more recent processors). When I launch BWA on my "old" node, on one or more thread(s), I have no problem. When I launch BWA on 1 thread on my "new" node, I have no problem too. But when I launch BWA on multi-threads (e.g 5), the job runs without generating any results and never stops. I am obliged to stop him myself. For information, I run exactly the same command as on my old node. By setting the option -v to 2, I have no error/warning different between the jobs tested. On the new node, with multithreading, the job just keeps "turning in a vacuum".
Earlier I had already encountered this problem with the gsnap tool. The developer had advised me to use their latest version that solved this error.
But here, I use the last version of bwa :
bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.15-r1142-dirty
Contact: Heng Li [email protected]
Usage: bwa [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome withbwa index'. There are three alignment algorithms in BWA:
mem',bwasw', and
aln/samse/sampe'. If you are not sure which to use, trybwa mem' first. Please
man ./bwa.1' for the manual.
So, have you ever heard of this problem of multithreading ? And do you have a solution?
Thank you very much.
Have a nice day,
Amandine
Hi,
I am using bwa mem to align pacbio reads to a reference. I am wondering if there is a way to obtain an extended CIGAR string (sam v.1.4) in the sam output, as default in bwa aln. I'd like to retrieve details on match/mismatch (X,=) instead of a generic alignment match (M).
Thanks in advance,
Amina
I am trying to do:
"bwa index E617.polished_assembly.fasta"
But error message shows that:
"[bwa_index] Pack FASTA... [bns_fasta2bntseq] fail to open file 'E617.polished_assembly.fasta.pac'. Abort!
Aborted (core dumped)"
I want to know how to solve this problem.
Any help would be appreciated.
Thanks!
Currently the kopen
function defined in kopen.c
only handles http
inputs but not https
.
It might be useful for the user to be able to load sequence files from https
as well.
Trivial change on line 270 to make this happen:
if (strstr(fn, "http://") == fn)
to
if (strstr(fn, "http://") == fn || strstr(fn, "https://") == fn)
I am using BWA/0.7.15-foss-2016a
We did an strace and it seems to not do anything although it uses 800 % of the CPU.
I am using the command
bwa mem -t 8 -U 15 -M -R '@rg\tID:'$acc'\tSM:'$acc'\tPL:Illumina\tLB:'$acc $ref $acc.1.fastq.gz $acc.2.fastq.gz > $out/$acc.sam
where the variables $ref is the reference file
$acc is the sample being mapped
$out is the output directory
The log file only shows
14312
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 800000 sequences (80000000 bp)...
[M::process] read 800000 sequences (80000000 bp)...
I am using 1 node with 16 ncpus and 34GB of memory
and it has been running for more than 1 day now
Do you plan supporting mucl-libc? Compilation of BWA on Alpine linux, failed, first at kthread.c
.
I solved it by adding #include <stdint.h>
to kthread.c
.
But later I got an error I cannot work around:
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS utils.c -o utils.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS kthread.c -o kthread.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS kstring.c -o kstring.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS ksw.c -o ksw.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwt.c -o bwt.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bntseq.c -o bntseq.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwa.c -o bwa.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwamem.c -o bwamem.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwamem_pair.c -o bwamem_pair.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwamem_extra.c -o bwamem_extra.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS malloc_wrap.c -o malloc_wrap.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS QSufSort.c -o QSufSort.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwt_gen.c -o bwt_gen.o
bwt_gen.c: In function 'BWTIncBuildRelativeRank':
bwt_gen.c:878:10: warning: variable 'oldInverseSa0RelativeRank' set but not used [-Wunused-but-set-variable]
bgint_t oldInverseSa0RelativeRank = 0;
^
bwt_gen.c: In function 'BWTIncMergeBwt':
bwt_gen.c:952:15: warning: variable 'bitsInWordMinusBitPerChar' set but not used [-Wunused-but-set-variable]
unsigned int bitsInWordMinusBitPerChar;
^
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS rope.c -o rope.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS rle.c -o rle.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS is.c -o is.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwtindex.c -o bwtindex.o
ar -csru libbwa.a utils.o kthread.o kstring.o ksw.o bwt.o bntseq.o bwa.o bwamem.o bwamem_pair.o bwamem_extra.o malloc_wrap.o QSufSort.o bwt_gen.o rope.o rle.o is.o bwtindex.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwashm.c -o bwashm.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwase.c -o bwase.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwaseqio.c -o bwaseqio.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS bwtgap.c -o bwtgap.o
In file included from bwtgap.c:4:0:
bwtgap.h:9:2: error: unknown type name 'u_int32_t'
u_int32_t info; // score<<21 | i
^
bwtgap.h:10:2: error: unknown type name 'u_int32_t'
u_int32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6;
^
bwtgap.h:11:2: error: unknown type name 'u_int32_t'
u_int32_t n_ins:16, n_del:16;
^
bwtgap.c: In function 'gap_push':
bwtgap.c:62:13: error: 'u_int32_t' undeclared (first use in this function)
p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l;
^
bwtgap.c:62:13: note: each undeclared identifier is reported only once for each function it appears in
bwtgap.c:62:23: error: expected ';' before 'score'
p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l;
^
Makefile:25: recipe for target 'bwtgap.o' failed
@lh3 Heng,
I am writing in the hope that you will consider adding a -g
option to force global (really glocal, or end-to-end) alignment in BWA MEM.
I understand that BWA MEM currently will attempt to return a global alignment if the score is "close" to the best local alignment (based on the -L
clipping penalties), otherwise prefers local alignment and puts soft clips in the SAM file.
There are cases where I want to force global alignment. For example, correcting an assembly with the same reads that generated the assembly, where naively passing BWA MEM output will sometimes result in false SNPs.
I had thought that just setting -L 99999,99999
might produce my desired behaviour, but it doesn't - probably due to the if (gscore <= 0
conditional in these lines:
https://github.com/lh3/bwa/blob/master/bwamem.c#L736
https://github.com/lh3/bwa/blob/master/bwamem.c#L764
I tried forcing that with a patch but it didn't seem to make a difference. My attempt was probably naive so I am hoping you can help / advise here!
Torsten
CC: @drpowell
I'm trying to install GRCh38 and it fails each time with an unexpected end of file (run-gen-ref for hs38a and hs38DH fail exactly the same way). It seems to be unable to locate all of the chromosomes, unplaced and unlocalized contigs and EBV files.
Any help would be greatly appreciated.
Thank you very much.
-bash-4.1$ bwa.kit/run-gen-ref hs38
--2016-05-27 15:52:54-- ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
=> “-”
Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.10, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.10|:21… connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD … done.
==> TYPE I ... done. ==> CWD (1) /genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines …
No such directory “genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines”.
gzip: stdin: unexpected end of file
Dear Professor,
I wanna use a specified gene sequence to build a reference , and then mapp PE fastq files (fq1,fq2) to this new reference gene genome via BWA .
that's just a thought ,I do not konw it's availble ?
if availble, which algorthims should I choose from BWA ??
BEST
bwa bwasw crashed with the error '[vfprintf(stdout)] Value too large for defined data type'.
Version: 0.7.15-r1142
I used 'bwa bwasw ' to compare two assembly reference.
The contig size maybe too large for bwasw, but it should not crash(coredump).
+ bwa bwasw -t 144 /ssd/biowrk/TAIR/canu.eval/ref.fa /ssd/biowrk/TAIR/canu.min4000/tair.contigs.fasta
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bsw2_aln] read 85 sequences/pairs (121314957 bp) ...
[vfprintf(stdout)] Value too large for defined data type
Dear Professor,
I do not know how to calculate MAPQ (MAPping Quality),can you give me some details ?
thank you !
I'm aligning long assembled contigs (from Supernova) to hg19. Occasionally bwa mem will generate long runs of bad alignments with many long indels, for sequence that actually matches well.
From tracing through some examples, it appears that alignments merged by mem_patch_reg will sometime not be subsequently processed by bwa_gen_cigar2 with a large enough bandwidth to generate a correct alignment. I can work around the issue by raising PATCH_MIN_SC_RATIO of 0.95 (https://github.com/lh3/bwa/blob/master/bwamem.c#L404), but I imagine there's a better solution.
I attached an example sequence which triggers the problem when aligned to hg19, on the latest bwa mem code. You can see the issue at chr2:122,075,076-122,087,607.
Does not show the problem:
bwa mem -v 5 -x intractg -w 500 -d 200 /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa 4535_slice.fasta
Shows the problem:
bwa mem -v 5 -x intractg -w 600 -d 200 /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa 4535_slice.fasta
IGV tracks of the alignments resulting from the above command are show here:
Thanks!
Pat Marks
run-gen-ref cannot be used to download hs38/hs38DH as the dowload link is broken:
$ ./run-gen-ref hs38DH
--2017-03-16 10:58:37-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
=> â-â
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids ...
No such directory âgenomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_idsâ.
gzip: stdin: unexpected end of file
Please run 'bwa index hs38DH.fa'...
Hi,
I have a pair of read. The first reads can be mapped. The second one can NOT be mapped.
Here is the mapping results:
@PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa mem ../../results/simu_TEP_ara/tair10.fa test_1read_R1.fq.gz test_1read_R2.fq.gz
chr1_8301745_8302017_2:0:0_1:0:0_0 121 chr1 8296984 60 100M = 8296984 0 TTTAGGCCCAATAAGGATAATTTGAAACCCACCCCAAAATTTAAACAGAGATGGTTGGATTATGGGCACAAGGTTCCAATTGCTAGATAGCTAAAATTTT 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:1 MD:Z:95C4 AS:i:95 XS:i:0
chr1_8301745_8302017_2:0:0_1:0:0_0 181 chr1 8296984 0 * = 8296984 0 GGATTTACTATTTGATCGATACTGCTTTGAACCCCAAATCAAACTGTGAGTTGAACAAATGGATTTACTATTTGATCGATACTGCTTTGAACCCCAAATC 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 AS:i:0 XS:i:0
In the second line, the summary of the FLAG is:
Summary:
read paired
mate unmapped
read reverse strand
mate reverse strand
first in pair
My question is why the seventh line is "=" rather than "". Does this mean that if I want to check whether the mate is mapped or not, I can NOT just select the seventh with "" and need to extract based on FLAG?
Best,
Shaojun Xie
If the first letter of the forth line in every reads, indicating the quality, is '@' symbol, will make BWA process terminated. For instance:
@sequences_name
ATTGCATGCATGC
+
@@dddad,AAFDF
To solve this problem, I have to replace the initio '@' in the quality line to 'A', hoping the replacement won't affect the coming analysis.
Also, if two reads are paired, and their names are almost identical but end with "-1" and "-2", which can distinguish them, BWA will abort too. I think you can identify two paired reads' names without considering the last number if it is "1" or "2".
if we look at this segment of code in bwa_aln2seq_core function:
if (set_main) {
best = aln[0].score;
for (i = cnt = 0; i < n_aln; ++i) {
const bwt_aln1_t *p = aln + i;
if (p->score > best) break;
if (drand48() * (p->l - p->k + 1 + cnt) > (double)cnt) {
s->n_mm = p->n_mm; s->n_gapo = p->n_gapo; s->n_gape = p->n_gape;
s->ref_shift = (int)p->n_del - (int)p->n_ins;
s->score = p->score;
s->sa = p->k + (bwtint_t)((p->l - p->k + 1) * drand48());
}
cnt += p->l - p->k + 1;
}
s->c1 = cnt;
for (; i < n_aln; ++i) cnt += aln[i].l - aln[i].k + 1;
s->c2 = cnt - s->c1;
s->type = s->c1 > 1? BWA_TYPE_REPEAT : BWA_TYPE_UNIQUE;
}
we can see that the s->type, the final XT tag, will be determined by the cnt variable.
What's more, cnt variable is determined by the break in for loop 'if (p->score > best) break;'
This may result in a conflict in final sam records, e.g. when XT report unique hit, but XA may result in several alternative hits.
Could anyone please enlighten me why do we need this break, what does this part actually doing?
Thanks!
Running as:
bwa mem -pC $base long.fq.gz
gives nonsense for the read name (and alignemtns)
If I do:
bwa mem -pC $base <(zcat long.fq.gz | awk '{ print substr($0, 0, 300) }')
or any value smaller, then it aligns and outputs as expected.
HI,
I have been struggling to improve my alignment but I am not sure which options will do the trick. I have paired illumina reads of mean 200 bp in length and mean quality is 35. Still i manage to get alignment not more than 75 percent.
bwa mem -M ref.fa read1.fq read2.fq > aln-pe.sam
Many thanks,
zusman
Hi,
I'm using bwakit for grch38 aligned bam. I'm developing tools to support grch38 and it requires correct pair information for reads (properly paired reads). However, I've noticed most of ALT reads are not properly paired. Mate pair of ALT reads exist in nonALT instead of ALT, although I clearly see many examples where both ends of the pair were present at ALT.
This issue was addressed previously (https://sourceforge.net/p/bio-bwa/mailman/message/32914937/). Is there any update yet?
Thanks,
MJ
Dear Professor,
I build a new reference with a gene sequence.
Step 1 :
Using PE fastqs to mapp complete genome (hg19)
Step 2 :
Using PE fastqs to mapp part genome sequence ,eg: gene region
what 's the difference between Step 1 and Step 2 ? I wanna use bwa mem way.
Best
Hi,
I failed to run
"bwa samse -n 1 ./p_ctg_formated.fa ./read.sai - | grep '(^@|XT:A:U)' | samtools view -S -h -b -F 0x4 - | samtools sort -T ./ -@ 20 -no - ./temporarySam > ./san_fct3_5k.map".
The error message is " [E::hts_open_format] fail to open file './temporarySam'
[bam_sort_core] fail to open './temporarySam': No such file or directory
[bam_sort] Note the <out.prefix> argument has been replaced by -T/-o options
[gzread] Bad file descriptor"
Anyone can help?
Thanks,
Quan
For some reads, split alignments are written in which one alignment is fully contained in the other alignment using the SA
tags (bwa 0.7.15
).
HISEQ1:93:H2YHMBCXX:1:2214:10015:52441 2115 chrUn_JTFH01001478v1_decoy 12 1 250M chr21 10325621 0 <snip> NM:i:13 SA:Z:chr21,10325808,-,240M10S,9,19;
How are such split alignments to be interpreted? Can the documentation be updated to reflect the meaning of split alignments for which all bases can be aligned to one of the locations.
I refer to bwa.kit version downloaded on 17Feb2017 using this link:
https://downloads.sourceforge.net/project/bio-bwa/bwakit/bwakit-0.7.15_x64-linux.tar.bz2
It seems that this version of bwa.kit contains an obsolete URL for GRCh38 download in run-gen-ref file.
It seems that the current URL for GRCh38 should be:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
See URL change notice here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README_change_notice.txt
Could you check and correct, if necessary?
Thank you,
Alexey
bwa fails to compile on IBM POWER8 platforms due to dependency on SSE2 intrinsics.
Make output:
[u0017592@sys-81783 bwa]$ make
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS ksw.c -o ksw.o
ksw.c:29:23: fatal error: emmintrin.h: No such file or directory
#include <emmintrin.h>
^
compilation terminated.
make: *** [ksw.o] Error 1
Make after removing ##include <emmintrin.h>
:
[u0017592@sys-81783 bwa]$ make
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS ksw.c -o ksw.o
ksw.c:48:2: error: unknown type name ‘__m128i’
__m128i *qp, *H0, *H1, *E, *Hmax;
^
ksw.c: In function ‘ksw_qinit’:
ksw.c:71:11: error: ‘__m128i’ undeclared (first use in this function)
q->qp = (__m128i*)(((size_t)q + sizeof(kswq_t) + 15) >> 4 << 4); // align memory
^
ksw.c:71:11: note: each undeclared identifier is reported only once for each function it appears in
ksw.c:71:19: error: expected expression before ‘)’ token
q->qp = (__m128i*)(((size_t)q + sizeof(kswq_t) + 15) >> 4 << 4); // align memory
^
ksw.c: In function ‘ksw_u8’:
ksw.c:114:2: error: unknown type name ‘__m128i’
__m128i zero, oe_del, e_del, oe_ins, e_ins, shift, *H0, *H1, *E, *Hmax;
^
ksw.c:130:2: warning: implicit declaration of function ‘_mm_set1_epi32’ [-Wimplicit-function-declaration]
zero = _mm_set1_epi32(0);
^
ksw.c:131:2: warning: implicit declaration of function ‘_mm_set1_epi8’ [-Wimplicit-function-declaration]
oe_del = _mm_set1_epi8(_o_del + _e_del);
^
ksw.c:139:3: warning: implicit declaration of function ‘_mm_store_si128’ [-Wimplicit-function-declaration]
_mm_store_si128(E + i, zero);
^
ksw.c:146:3: error: unknown type name ‘__m128i’
__m128i e, h, t, f = zero, max = zero, *S = q->qp + target[i] * slen; // s is the 1st score vector
^
ksw.c:147:3: warning: implicit declaration of function ‘_mm_load_si128’ [-Wimplicit-function-declaration]
h = _mm_load_si128(H0 + slen - 1); // h={2,5,8,11,14,17,-1,-1} in the above example
^
ksw.c:148:3: warning: implicit declaration of function ‘_mm_slli_si128’ [-Wimplicit-function-declaration]
h = _mm_slli_si128(h, 1); // h=H(i-1,-1); << instead of >> because x64 is little-endian
^
ksw.c:156:4: warning: implicit declaration of function ‘_mm_adds_epu8’ [-Wimplicit-function-declaration]
h = _mm_adds_epu8(h, _mm_load_si128(S + j));
^
ksw.c:157:4: warning: implicit declaration of function ‘_mm_subs_epu8’ [-Wimplicit-function-declaration]
h = _mm_subs_epu8(h, shift); // h=H'(i-1,j-1)+S(i,j)
^
ksw.c:159:4: warning: implicit declaration of function ‘_mm_max_epu8’ [-Wimplicit-function-declaration]
h = _mm_max_epu8(h, e);
^
ksw.c:184:5: warning: implicit declaration of function ‘_mm_movemask_epi8’ [-Wimplicit-function-declaration]
cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_subs_epu8(f, h), zero));
^
ksw.c:184:5: warning: implicit declaration of function ‘_mm_cmpeq_epi8’ [-Wimplicit-function-declaration]
ksw.c:190:3: warning: implicit declaration of function ‘_mm_srli_si128’ [-Wimplicit-function-declaration]
__max_16(imax, max); // imax is the maximum number in max
^
ksw.c:190:3: warning: implicit declaration of function ‘_mm_extract_epi16’ [-Wimplicit-function-declaration]
ksw.c: In function ‘ksw_i16’:
ksw.c:235:2: error: unknown type name ‘__m128i’
__m128i zero, oe_del, e_del, oe_ins, e_ins, *H0, *H1, *E, *Hmax;
^
ksw.c:251:2: warning: implicit declaration of function ‘_mm_set1_epi16’ [-Wimplicit-function-declaration]
oe_del = _mm_set1_epi16(_o_del + _e_del);
^
ksw.c:265:3: error: unknown type name ‘__m128i’
__m128i e, t, h, f = zero, max = zero, *S = q->qp + target[i] * slen; // s is the 1st score vector
^
ksw.c:269:4: warning: implicit declaration of function ‘_mm_adds_epi16’ [-Wimplicit-function-declaration]
h = _mm_adds_epi16(h, *S++);
^
ksw.c:271:4: warning: implicit declaration of function ‘_mm_max_epi16’ [-Wimplicit-function-declaration]
h = _mm_max_epi16(h, e);
^
ksw.c:275:4: warning: implicit declaration of function ‘_mm_subs_epu16’ [-Wimplicit-function-declaration]
e = _mm_subs_epu16(e, e_del);
^
ksw.c:292:5: warning: implicit declaration of function ‘_mm_cmpgt_epi16’ [-Wimplicit-function-declaration]
if(UNLIKELY(!_mm_movemask_epi8(_mm_cmpgt_epi16(f, h)))) goto end_loop8;
^
make: *** [ksw.o] Error 1
To support the POWER8 platform, this code will have to be changed to optionally use VSX/AltiVec.
More information: https://bitbucket.org/umarcts/ppc64le-wiki/wiki/vsx.md
Hi all,
Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).
Searching for multithreaded support for bwa index
I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.
Regardless, is there any ongoing or planned work to make bwa index
parallelizable in some form?
Hi,
samtools calmd
often outputs a warning like this:
[bam_fillmd1] different NM for read 'HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038': 0 -> 1
Per discussion at https://sourceforge.net/p/samtools/mailman/samtools-help/thread/8aaa2f1f-c31e-47c0-5e89-6fe10990c847%40gmail.com/#msg35256271 it turns out bwa could do better and calculate NM: tag not using the randomly modified sequence with N's replaced to any nucleotide, but increasing the NM: value by number of N's originally present in the aligned region. Is that possible? In turn, "samtools calmd"
would issue less warnings, and one could focus on real inconsistencies. Thank you.
HI,
like this read:
QAt17575867:1:0 99 1 10115 0 43M1I26M = 10189 126 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAAACCCAAACCCAAACCCAAACCTAAN ?<@?AF?@dcbf>?CBBF?@DCCG?@DCCG?@DCCCA?<8@@/(72AC**7*7@*173=@=<=<>+0)5# MC:Z:18S52M BD:Z:LLJNLPQMJNJMOKIMIMOKHLIMOKIMIMOKIMIIMOKIMIMOKCJNJOKCJOKPLELPMRNGNRNPLL MD:Z:48T5T5T7C0 RG:Z:VRa BI:Z:OONONTRMNOLRQLMNLRQLMNLRRMMOMSRMNOMMSRMNOMSRMGOPNSPHOPOTQIPQOTQIQRSRMO NM:i:5 MQ:i:0 AS:i:46 XS:i:46
Seems a quoted empty string upsets ARGV parser.
$ bwa mem -t 16 '' -M ref.fasta.gz pairs.fastq | head
[E::bwa_idx_load_from_disk] fail to locate the index files
$
$ bwa mem -t 16 -M ref.fasta.gz pairs.fastq | head
[M::bwa_idx_load_from_disk] read 0 ALT contigs
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
$
Now it is hard coded as 0.8. It would be useful if users can specify it manually.
In the README-alt.md file
https://github.com/lh3/bwa/blob/master/README-alt.md
there is a sentence saying:
“In theory, non-alt alignments from step 1 should be identical to alignments against the reference genome WITH alt contigs.”
It might be that there is a meaningful typo in this sentence. Should it be
“In theory, non-alt alignments from step 1 should be identical to alignments against the reference genome WITHOUT alt contigs.” ?
Please, check and correct if it is a typo.
Dear professor ,
I am a new guy for bioinformatics algorithm , My major is biology . I wann learn bioinformatics, especially about algorithm , So can you give me some suggestions ?
Thank you very much!
I have noticed that the CIGAR string column from the SAM output file sometimes does not fully contain the alignment difference between the query sequence and the reference sequence. For example with a sample alignment entry with a CIGAR string of 90M
:
SRR2650329.12.indiv_AAA252 83 R36106 686 60 90M = 30947000 -146 GCTTCTTAACAGCAACTCGCTTAGCCTGGCAGCCAGCTGCACAAATGTGTCAGAGGAAACAGTTAAACTGATGACCAAAGTCAGCCTGCA >DDDDEDDDCDDDDDDDDEEEEFFFFHHHGJIFJIJJIJIHJJJJJJJJJJJJJJJJJJJJJJIJJIIIJIGIFJJJJJJIJJJIJJHHH NM:i:2 MD:Z:23A37A28 AS:i:80 XS:i:31
Based on the CIGAR string, I expect to see a perfect 90bp perfect match of the query sequence to the reference sequence. Instead, when i compare those two sequences, I have found two omitted substitution:
OBS: GCTTCTTAACAGCAACTCGCTTAGCCTGGCAGCCAGCTGCACAAATGTGTCAGAGGAAACAGTTAAACTGATGACCAAAGTCAGCCTGCA
REF: GCTTCTTAACAGCAACTCGCTTAACCTGGCAGCCAGCTGCACAAATGTGTCAGAGGAAACAATTAAACTGATGACCAAAGTCAGCCTGCA
NOT: _______________________*_____________________________________*____________________________
QUA: >DDDDEDDDCDDDDDDDDEEEEFFFFHHHGJIFJIJJIJIHJJJJJJJJJJJJJJJJJJJJJJIJJIIIJIGIFJJJJJJIJJJIJJHHH
obs - as observed; ref - as ref; not - as notation, qua - as phred qual
This is one of the many cases. Was it intentional for bwa not to recognize these base pair substitution event? Both past version (0.7.8) and the latest version (0.7.15) of bwa mem & aln method commit the same type of action.
Inclusion of literal tab characters in the bwa command line (notably, when passing a @rg read group line) leads to inclusion of literal tab characters in the @rg portion of the CL: field within the @pg line written in the SAM header. Tab control characters within fields (instead of as field separators) violates the SAM spec, and causes downstream htslib / htsjdk to choke.
See the issue I filed with htsjdk here (although the fix should ultimately be a bwa update):
samtools/htsjdk#677
This could be fixed in a couple of different ways, including (1) forbidding tab characters and (2) rewriting the CL field of the @pg line. For simplicity and cliarity sake, I vote to forbid the literal tab characters. Rewriting the @pg line does not tell the true story of the actual command line passed to bwa, which could be argued also violates the spec.
Anyway, I fixed this and a pull request is incoming.
I have noticed that insert sizes from BWA-MEM seem to start high at the beginning of a chromosome, and drop down towards the end. This isn't really an issue (feel free to close and ignore) but I wanted to put it out here to see if anyone has a good explanation.
I'm mapping 100bp PE Illumina data to a reference genome from another species. The reference (Eucalyptus grandis) is ~5% diverged from the study species. It has 11 good chromosomes, and a bunch of rubbish scaffolds. Gene order etc. is well conserved between the two.
I'm running BWA-MEM and looking at the output with Qualimap like this:
bwa mem $ref $R1 $R2 -t $threads > out.sam
qualimap bamqc -bam out.bam -outdir $outBWAMEM"qualimap_all/" -nt $threads -c
What I don't understand is the plot of insert sizes. My expectation is that most insert sizes will be ~200bp. But what I observe from BWA-MEM looks very different.
nevertheless, the insert-size histogram looks fine, and looking at the mapping in IGV looks fine too.
So, it looks like the start of chromosomes must contain a small number of pairs with enormous insert sizes. This result is not shared by other mappers like NextGenMap, bbmap, or Stampy (all run with default params). They all look more like the plot below (this from NextGenMap):
I'm struggling to understand the difference.
Hi,
The bwa alt README states the following:
" If there are both ALT and non-ALT hits, non-ALT hits will be primary and ALT hits be supplementary"
However, I am seeing alignments such as this:
HWI-ST1023:220:H84P1ADXX:1:1208:13259:86761 83 chr1 2539909 52 107M = 2539909 -107 GTCCCTGCACTGCACACTGCACCTGGCCAGGGGCACCAAGTCAGGGGGCAGGGCCCTCTACCCATCTCCCGATCGCAACCAACCTGGCCACACGGGGACCCGTAGGG G@GFGBGI@HDHHAH?GCHI@FGBEGFHEEEEGH@FHAEG?GEDEDEFHDDDFFCF@F@?DDF@@F@BB?D@@?DEA?BEA?AEBACAD?D?>ABBD@CC=D?A?B@ NM:i:0 MD:Z:107 AS:i:107 XS:i:100 XA:Z:chr1,+2539395,104M3S,1;chr1_KI270762v1_alt,-91099,107M,0;chr1_KI270762v1_alt,+90585,104M3S,1;
In the above, there is single SAM record output, with alt hits in the XA tag. No supplementary record is output.
I also see cases where the alt hits are output as supplementary records like the following:
HWI-ST1023:220:H84P1ADXX:2:1116:14501:71115 83 chr1 2650066 43 151M = 2649959 -258 CTGACAGCCTGGAATGGCACTCTGCATCCCCAGGTGAGCATCTGACAGCCTGGAATGGTACTCTACACCCTCAGGTGAGCTTCTGACAGCCTGGAATGGCACTCTACATCCCCAGGTGAGCTTCTGACAGCCTGGAATGGCACCCCACACC ?@A=BB?<>@@GA?BEGF@EAGBGH@AB;BDCEGBHDGH@@EBH@HCGEBBFGB@BBB;@EAC:-B?7EC>CDCDDGCEEAAGCF@GEFDFDBDA@CBDF?E@E@?EA@CBBECBCCDCDEA@ECD?EBCBDBAC@?BBDE@CCCF>D>C@ NM:i:5 MD:Z:37A70C2T31T1T5 AS:i:128 XS:i:121 SA:Z:chr1_KI270762v1_alt,200851,-,151M,50,0; pa:f:0.848
HWI-ST1023:220:H84P1ADXX:2:1116:14501:71115 2131 chr1_KI270762v1_alt 200851 50 151M chr1 2649959 0 CTGACAGCCTGGAATGGCACTCTGCATCCCCAGGTGAGCATCTGACAGCCTGGAATGGTACTCTACACCCTCAGGTGAGCTTCTGACAGCCTGGAATGGCACTCTACATCCCCAGGTGAGCTTCTGACAGCCTGGAATGGCACCCCACACC ?@A=BB?<>@@GA?BEGF@EAGBGH@AB;BDCEGBHDGH@@EBH@HCGEBBFGB@BBB;@EAC:-B?7EC>CDCDDGCEEAAGCF@GEFDFDBDA@CBDF?E@E@?EA@CBBECBCCDCDEA@ECD?EBCBDBAC@?BBDE@CCCF>D>C@ NM:i:0 MD:Z:151 AS:i:151 XS:i:128 SA:Z:chr1,2650066,-,151M,43,5; XA:Z:chr1,-2650066,151M,5;chr1,-2650148,151M,6;
Is there known logic that determines whether a supplementary record is output for alt hits?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.