lh3 / bwa Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 550.0 1.67 MB

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)

License: GNU General Public License v3.0

Makefile 0.51% C 86.31% JavaScript 5.64% Shell 0.57% Perl 1.24% C++ 1.50% Roff 4.23%

bioinformatics fm-index genomics sequence-alignment

bwa's People

Contributors

Stargazers

Watchers

Forkers

aleidenroth tabbott roelkluin tobiasmarschall hyphaltip nh13 bujingde genome-vendor raonyguimaraes xied75 campagnelaboratory majulier pdragy sjackman genome traymond rpique drio ntthong gusevfe c0mpsc1 bishoyh wtsi-npg daviesrob drkeoni leemendelowitz dmeister jkaiser hammer sebhtml lindenb vsbuffalo kenongit mp15 jmarshall sonwell genecodes frogee genepi totakke tdanford zestrada cuskinfor bpow ak352 exascience janee solecisme wjc007 androniki gkno berutti kontakter pkerpedjiev bioboy2014 yfu torchmed nsoranzo annazare cfilip04 chrovis earonesty gt1 marghoob jwbargsten mmoisse cbolson1 cinquin a113n theboocock ilveroluca kg5 abdul59 guochangjiang whitwham allenday yinhao85 bioinf-wuerzburg liupfskygre muchabi ereckase jtarraga mcshane yvdriess mmuggli aukluobo heuermh tudyzhou rodoyle zengfengbo truman8432 intel-hss dpryan79 ganyfml neocruiser zhmz90 widewater zhangjikai boukn combine-lab

bwa's Issues

How to fix a randomization seed while aligning reads?

Hi, @lh3

As I may use bwa for some calling repeatability test for some low frequency variants, so I suspect that there may need an option for initialize the randomization seed, but unfortunately, I can't find it.
Is it convenient to add this option in bwa mem?
For the use of multithread in bwa, I'm not that sure whether it is available.

Best Regards and Thanks.

Perpetual [M::process] read x sequences (x bp) and [M::mem_process_seqs]

Edit: Yep, just my slow computer.

Hey guys, so I ran the bwa mem pacbio command as follows:

bwa mem -x pacbio reference.fa PacBioReads.fastq > PacBioRawtoReference.sam

The bwa indexed reference files are in the same directory as the .fasta

But I get a recursive output that has repeated for hours now:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1004 sequences (10009168 bp)...
[M::process] read 950 sequences (10028116 bp)...
[M::mem_process_seqs] Processed 1004 reads in 109.896 CPU sec, 110.125 real sec
[M::process] read 876 sequences (10032186 bp)...
[M::mem_process_seqs] Processed 890 reads in 123.662 CPU sec, 123.758 real sec
[M::process] read 916 sequences (10010820 bp)...
[M::mem_process_seqs] Processed 876 reads in 129.959 CPU sec, 130.107 real sec
[M::process] read 856 sequences (10018929 bp)...
[M::mem_process_seqs] Processed 916 reads in 132.331 CPU sec, 133.119 real sec
[M::process] read 846 sequences (10008623 bp)...
[M::mem_process_seqs] Processed 856 reads in 118.908 CPU sec, 118.961 real sec
[M::process] read 761 sequences (9136775 bp)...
[M::mem_process_seqs] Processed 846 reads in 126.991 CPU sec, 127.065 real sec

I'm wondering if this means there is a problem with input files, an installation error, or bwa is simply still running and will finish sometime in the near indiscriminate future due to the lower processing power of the computer?

Thanks

bwa sampe [fread] Unexpected end of file

I am running follwoing commad to convert .sai output of bwa aln to sam file
bwa sampe ref.fa read1.sai read2.sai read1.fastq read2.fastq >read.sam
and I am getting following error.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[fread] Unexpected end of file
sam file has no sequences
When I use samse converted read1.sai and read2.sai to separte sam files it works fine.
I also to 50 sequences from read1.fastq and read2.fastq did alignment and converted to sam file using sampe it also worked.
I am not sure why I am getting this error.
Really appreciate any help.
Thanks

Segmentation fault with specific data set

Hi,
I've uploaded an archive containing a data set that leads to a segfault when doing
bwa bwasw -t 20 00-optrApE349.fas 2708-UW15005_S1_L001_R1_001.fastq.gz 2708-UW15005_S1_L001_R2_001.fastq.gz -f test.sam
The compressed tar archive contains the data set, a script named run_bwa_Segmentation_fault you can simply run to reproduce the issue and the output of the run in the file bwa.out.
I have tested the problem with four versions of bwa (0.7.10, 0.7.14 and 0.7.15) all with the same result and also on two different machines (all running Debian stable, the bwa installation was also from according Debian packages).
Are you able to reproduce the issue and even more importantly fixing it? As a member of the Debian Med packaging team I'd happily fix the issue in the Debian packaging as soon as there is a fix available.
Kind regards and thanks for providing bwa as Free Software, Andreas.

64 bit index breaks alt aware alignment.

If you have a 64 bit system, binary and index file set but no matching reference.fasta.64.alt file (say only a reference.fasta.alt that comes from run-gen-ref) then bwa will not run as alt-aware.

While you can just copy/softlink the reference.fasta.alt to reference.fasta.64.alt to fix this, it might be better to remove the .64 from the .alt file's path since the .alt file is not created by the index command and thus should not be pathed with other files that are.

Is README-alt.md up to date?

The README-alt.md document describes the strategy for handling ALTs as requiring the post-processing step with bwa-postalt.js, but I was told that BWA now handles that internally so it's no longer necessary to run the post-processing script separately. Is the documentation out of date or does this internal handling feature only apply to a development version and not to the latest release?

Possible deadlock

We run into a deadlock when running BWA (0.7.8, 0.7.10, 0.7.13, 0.7.15) on new machines on our HPC.

BWA has been working fine on the old nodes (Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz) but runs into a deadlock when running on the new machines (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz).
This is reproducible and does not depend on number of threads specified.

We run BWA which following paramters:
bwa mem -t 8 -M -R \@RG\\tID:dummy\\tSM:dummy\\tPL:Illumina\\tLB:dummy ref.fa input.fastq > output.sam

The last output lines:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 100000 sequences (10000000 bp)...
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.408 CPU sec, 20.355 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 19.992 CPU sec, 19.879 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.880 CPU sec, 20.755 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.092 CPU sec, 19.984 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.088 CPU sec, 19.967 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 20.540 CPU sec, 20.433 real sec
[M::process] read 100000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 100000 reads in 19.676 CPU sec, 19.546 real sec
[M::process] read 100000 sequences (10000000 bp)...

On the old machines this runs without any problems.
On the new machines after around 10 minutes CPU time (sometimes more and sometimes less) the bwa process hangs (0 % CPU usage).
When we attach with strace -p PID the process continues again normally. This is reproducible and also happens with -t 1 setting.

We attached gdb to the hung process and this is the stacktrace of all threads:


Thread 4 (Thread 0x2aaab876a700 (LWP 34136)):
#0  0x00002aaaab6fb5ab in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab687b78 in _L_lock_3817 () from /lib64/libc.so.6
#2  0x00002aaaab682b59 in _int_free () from /lib64/libc.so.6
#3  0x000000000041b254 in process (shared=<optimized out>, step=<optimized out>, _data=0x2aaabc0008c0) at fastmap.c:91
#4  0x000000000042094d in ktp_worker (data=0x7fffffff87d0) at kthread.c:106
#5  0x00002aaaab1ed0a4 in start_thread () from /lib64/libpthread.so.0
#6  0x00002aaaab6ef02d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x2aaab896b700 (LWP 34137)):
#0  0x00002aaaab1ee4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000420b7e in kt_for (n_threads=<optimized out>, func=func@entry=0x43c000 <worker1>, data=data@entry=0x2aaab896ad10, n=<optimized out>) at kthread.c:59
#2  0x000000000043d24c in mem_process_seqs (opt=0x657120, bwt=<optimized out>, bns=0x657290, pac=<optimized out>, n_processed=<optimized out>, n=800000, seqs=0x2aaad8000010, pes0=0x0) at bwamem.c:1189
#3  0x000000000041b325 in process (shared=0x7fffffff8950, step=<optimized out>, _data=0x2aaacc0008c0) at fastmap.c:84
#4  0x000000000042094d in ktp_worker (data=0x7fffffff87f0) at kthread.c:106
#5  0x00002aaaab1ed0a4 in start_thread () from /lib64/libpthread.so.0
#6  0x00002aaaab6ef02d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x2aaad1852700 (LWP 34161)):
#0  0x00002aaaab6fb5ab in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab687f2c in _L_lock_11235 () from /lib64/libc.so.6
#2  0x00002aaaab6859e3 in malloc () from /lib64/libc.so.6
#3  0x0000000000440f29 in wrap_malloc (size=size@entry=8, file=file@entry=0x44dff2 "bwamem.c", line=line@entry=662, func=func@entry=0x44e867 <__func__.7724> "mem_chain2aln") at malloc_wrap.c:25
#4  0x0000000000437aec in mem_chain2aln (opt=opt@entry=0x657120, bns=bns@entry=0x657290, pac=pac@entry=0x2aaab64e2010 "W\001\\\005p\025\300]\340\327\303W\003W\003\177\003\\MN\r\\\f\\=\\\005`\026\277w\257\200\064\373\263\060\343\017\363o\374\354>|\373\356\354\217\377", 
   l_query=l_query@entry=100, query=query@entry=0x2aaacc1c47a0 "\001\001\003\003\002\003\003", c=0x2aaadcea1288, av=av@entry=0x2aaad1851e90) at bwamem.c:662
#5  0x000000000043be33 in mem_align1_core (opt=0x657120, bwt=<optimized out>, bns=0x657290, pac=0x2aaab64e2010 "W\001\\\005p\025\300]\340\327\303W\003W\003\177\003\\MN\r\\\f\\=\\\005`\026\277w\257\200\064\373\263\060\343\017\363o\374\354>|\373\356\354\217\377", l_seq=100, 
   seq=0x2aaacc1c47a0 "\001\001\003\003\002\003\003", buf=0x2aab061d56c0) at bwamem.c:1037
#6  0x000000000043c08d in worker1 (data=0x2aaab896ad10, i=<optimized out>, tid=5) at bwamem.c:1151
#7  0x00000000004209f3 in ktf_worker (data=0x2aaab896ac20) at kthread.c:41
#8  0x00002aaaab1ed0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002aaaab6ef02d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2aaaaaae8d80 (LWP 34135)):
#0  0x00002aaaab1ee4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000420ccf in kt_pipeline (n_threads=n_threads@entry=2, func=func@entry=0x41b0a0 <process>, shared_data=shared_data@entry=0x7fffffff8950, n_steps=n_steps@entry=3) at kthread.c:142
#2  0x000000000041bbd2 in main_mem (argc=argc@entry=8, argv=argv@entry=0x7fffffff8c60) at fastmap.c:353
#3  0x0000000000402ff0 in main (argc=<optimized out>, argv=0x7fffffff8c58) at main.c:85

bwa aln CL arguments from BAM @PG line output by samse

The current @PG line in the bam header from files that were converted from .sai to .bam using bwa samse contain only the arguments used in the call to bwa samse

@PG	ID:bwa	PN:bwa	VN:0.7.13-r1126	CL:bwa samse -n 100 mygenome.fa output.sai input.fastq.gz

But this is not very useful; I want to see how the alignment was done, not how the format conversion went. Or did I overlook something? Of course I can wrap things with awk and whatnot, but that requires rewriting GBytes of on-disk data ... (this is from running bwa aln followed by bwa samse, but I guess it's the same for bwa mem and/or bwa sampe )

get sequences from index

Hi there!
I need to extract the nucleotide sequence from the coordinates provided by the 'bwa aln' procedure. Can I use bwa index for that? More generally, how can I use the index to fetch the nucleotide sequence at specified coordinates?

Thank you very much!
Shiran

remove scaffold and other unplaced sequences before mapping ?

Hi,
I downloaded reference genomes from Ensembl (fasta format).
But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes

Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt

Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?

bwa-postalt.js marks some records as supplementary that (I think) should be secondary

I posted this to the bwa-help list recently.

I have a pair of reads that map perfectly (NM=0) to chr20 and also to chr20_KI270869v1_alt (hs38DH). The latest version of bwa mem produces the following output which seems right:

query   65  chr20   61894983    60  75M =   61894987    5   CAGCCACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGG NM:i:0  MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8678,75M,0;
query   129 chr20   61894987    60  75M =   61894983    -5  CACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCTCGAA GGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:0  MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8682,75M,0;

However, when I run them through the post-alt script the result is four records:

query   65  chr20   61894983    60  75M =   61894987    5   CAGCCACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGG NM:i:0  MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8678,75M,0;  om:i:60
query   2113    chr20_KI270869v1_alt    8678    60  75M chr20   61894987    5   CAGCCACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGG NM:i:0  lt:Z:chr20,61894981,61895056,+;
query   129 chr20   61894987    60  75M =   61894983    -5  CACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCTCGAA GGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:0  MD:Z:75 AS:i:75 XS:i:25 XA:Z:chr20_KI270869v1_alt,+8682,75M,0;  om:i:60
query   2177    chr20_KI270869v1_alt    8682    60  75M chr20   61894983    -5  CACAGCCATCATCACGGTGACAGATGTGAATGACAACCCGCCAGAATTTACCGCCAGCACGGTGAGTCCCTCGAA GGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:0  lt:Z:chr20,61894985,61895060,+;

The extra records are marked as supplementary. My understanding, mirrored by the language in the SAM spec, is that supplementary records are for recording pieces alignments of that are not co-linear with the reference and thus can't be stored in a single record, while secondary/not-primary records are for storing alternative alignments of the same read.

Having the reads marked as supplementary throws off tools that wish to ignore secondary alignments, but consume supplemental alignments. E.g. Picard's CollectHsMetrics will report a significantly deflated % selected, % on target etc. when capturing genes that have equally good mappings to the primary assembly and one or more alts, because the records are essentially duplicated.

pac2bwtgen error: Not enough space allocated to continue construction!

bwa pac2bwtgen does not work with very short references (len < 8)

Does not work:

On-y-go-2:20170407172100 karel$ printf '>\nAAAAAAA' > index.fa && bwa fa2pac index.fa index.fa && bwa pac2bwtgen index.fa.pac index.fa.bwt
[main] Version: 0.7.15-r1140
[main] CMD: bwa fa2pac index.fa index.fa
[main] Real time: 0.002 sec; CPU: 0.003 sec
[BWTIncCreate] textLength=14, availableWord=65536
BWTIncSetBuildSizeAndTextAddr(): Not enough space allocated to continue construction!

Works:

On-y-go-2:20170407172100 karel$ printf '>\nAAAAAAAA' > index.fa && bwa fa2pac index.fa index.fa && bwa pac2bwtgen index.fa.pac index.fa.bwt
[main] Version: 0.7.15-r1140
[main] CMD: bwa fa2pac index.fa index.fa
[main] Real time: 0.004 sec; CPU: 0.004 sec
[BWTIncCreate] textLength=16, availableWord=65536
[bwt_gen] Finished constructing BWT in 1 iterations.
[main] Version: 0.7.15-r1140
[main] CMD: bwa pac2bwtgen index.fa.pac index.fa.bwt
[main] Real time: 0.001 sec; CPU: 0.003 sec

how to get uniquely mapped reads from a bam produced by BWA version 0.7.10 ?

Dear Author ,
About my question, I have searched so long time on the internet ,but find nothing useful.
somebody says :
XT ,XA , samtools view -q 1, samtools view -q 30, samtools view -q 40 and so on.
SO which should I to confirm and use ??
I just wanna keep uniquely mapped reads for NIPT.

thank you !

Ambiguous reference bases

Would it be possible to handle ambiguous reference bases properly? That would increase the alignment quality for viral use cases.

Thank you.

Problem of multithreading.

Dear BWA developers,

First of all, I wish you an happy new year.

I am currently experiencing a problem with bwa-mem. In my lab, I have two nodes, one "old" node and a new node that we received four months ago, which has a "different" generation of processors (namely more recent processors). When I launch BWA on my "old" node, on one or more thread(s), I have no problem. When I launch BWA on 1 thread on my "new" node, I have no problem too. But when I launch BWA on multi-threads (e.g 5), the job runs without generating any results and never stops. I am obliged to stop him myself. For information, I run exactly the same command as on my old node. By setting the option -v to 2, I have no error/warning different between the jobs tested. On the new node, with multithreading, the job just keeps "turning in a vacuum".

Earlier I had already encountered this problem with the gsnap tool. The developer had advised me to use their latest version that solved this error.

But here, I use the last version of bwa :

bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.15-r1142-dirty
Contact: Heng Li [email protected]
Usage: bwa [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with bwa index'. There are three alignment algorithms in BWA: mem', bwasw', and aln/samse/sampe'. If you are not sure which to use, try bwa mem' first. Please man ./bwa.1' for the manual.

So, have you ever heard of this problem of multithreading ? And do you have a solution?

Thank you very much.

Have a nice day,
Amandine

Extended CIGAR (sam v.1.4)

Hi,

I am using bwa mem to align pacbio reads to a reference. I am wondering if there is a way to obtain an extended CIGAR string (sam v.1.4) in the sam output, as default in bwa aln. I'd like to retrieve details on match/mismatch (X,=) instead of a generic alignment match (M).

Thanks in advance,
Amina

bwa index problem

I am trying to do:
"bwa index E617.polished_assembly.fasta"
But error message shows that:
"[bwa_index] Pack FASTA... [bns_fasta2bntseq] fail to open file 'E617.polished_assembly.fasta.pac'. Abort!
Aborted (core dumped)"

I want to know how to solve this problem.
Any help would be appreciated.
Thanks!

Minor http vs https issue in kopen.c

Currently the kopen function defined in kopen.c only handles http inputs but not https.
It might be useful for the user to be able to load sequence files from https as well.

Trivial change on line 270 to make this happen:
if (strstr(fn, "http://") == fn)
to
if (strstr(fn, "http://") == fn || strstr(fn, "https://") == fn)

BWA not producing output

I am using BWA/0.7.15-foss-2016a
We did an strace and it seems to not do anything although it uses 800 % of the CPU.

I am using the command
bwa mem -t 8 -U 15 -M -R '@rg\tID:'$acc'\tSM:'$acc'\tPL:Illumina\tLB:'$acc $ref $acc.1.fastq.gz $acc.2.fastq.gz > $out/$acc.sam

where the variables $ref is the reference file
$acc is the sample being mapped
$out is the output directory

The log file only shows
14312
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 800000 sequences (80000000 bp)...
[M::process] read 800000 sequences (80000000 bp)...

I am using 1 node with 16 ncpus and 34GB of memory
and it has been running for more than 1 day now

BWA compilation on musl-libc fails

Do you plan supporting mucl-libc? Compilation of BWA on Alpine linux, failed, first at kthread.c.
I solved it by adding #include <stdint.h> to kthread.c.
But later I got an error I cannot work around:

gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  utils.c -o utils.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  kthread.c -o kthread.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  kstring.c -o kstring.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  ksw.c -o ksw.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwt.c -o bwt.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bntseq.c -o bntseq.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwa.c -o bwa.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwamem.c -o bwamem.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwamem_pair.c -o bwamem_pair.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwamem_extra.c -o bwamem_extra.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  malloc_wrap.c -o malloc_wrap.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  QSufSort.c -o QSufSort.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwt_gen.c -o bwt_gen.o
bwt_gen.c: In function 'BWTIncBuildRelativeRank':
bwt_gen.c:878:10: warning: variable 'oldInverseSa0RelativeRank' set but not used [-Wunused-but-set-variable]
  bgint_t oldInverseSa0RelativeRank = 0;
          ^
bwt_gen.c: In function 'BWTIncMergeBwt':
bwt_gen.c:952:15: warning: variable 'bitsInWordMinusBitPerChar' set but not used [-Wunused-but-set-variable]
  unsigned int bitsInWordMinusBitPerChar;
               ^
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  rope.c -o rope.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  rle.c -o rle.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  is.c -o is.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwtindex.c -o bwtindex.o
ar -csru libbwa.a utils.o kthread.o kstring.o ksw.o bwt.o bntseq.o bwa.o bwamem.o bwamem_pair.o bwamem_extra.o malloc_wrap.o QSufSort.o bwt_gen.o rope.o rle.o is.o bwtindex.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwashm.c -o bwashm.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwase.c -o bwase.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwaseqio.c -o bwaseqio.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwtgap.c -o bwtgap.o
In file included from bwtgap.c:4:0:
bwtgap.h:9:2: error: unknown type name 'u_int32_t'
  u_int32_t info; // score<<21 | i
  ^
bwtgap.h:10:2: error: unknown type name 'u_int32_t'
  u_int32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6;
  ^
bwtgap.h:11:2: error: unknown type name 'u_int32_t'
  u_int32_t n_ins:16, n_del:16;
  ^
bwtgap.c: In function 'gap_push':
bwtgap.c:62:13: error: 'u_int32_t' undeclared (first use in this function)
  p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l;
             ^
bwtgap.c:62:13: note: each undeclared identifier is reported only once for each function it appears in
bwtgap.c:62:23: error: expected ';' before 'score'
  p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l;
                       ^
Makefile:25: recipe for target 'bwtgap.o' failed

Add support to force global / glocal / end-to-end alignment

@lh3 Heng,

I am writing in the hope that you will consider adding a -g option to force global (really glocal, or end-to-end) alignment in BWA MEM.

I understand that BWA MEM currently will attempt to return a global alignment if the score is "close" to the best local alignment (based on the -L clipping penalties), otherwise prefers local alignment and puts soft clips in the SAM file.

There are cases where I want to force global alignment. For example, correcting an assembly with the same reads that generated the assembly, where naively passing BWA MEM output will sometimes result in false SNPs.

I had thought that just setting -L 99999,99999 might produce my desired behaviour, but it doesn't - probably due to the if (gscore <= 0 conditional in these lines:

https://github.com/lh3/bwa/blob/master/bwamem.c#L736
https://github.com/lh3/bwa/blob/master/bwamem.c#L764

I tried forcing that with a patch but it didn't seem to make a difference. My attempt was probably naive so I am hoping you can help / advise here!

Torsten

CC: @drpowell

run-gen-ref hs38 results in unexpected EOF "No such directory "genbank/genomes/Eukary…"

I'm trying to install GRCh38 and it fails each time with an unexpected end of file (run-gen-ref for hs38a and hs38DH fail exactly the same way). It seems to be unable to locate all of the chromosomes, unplaced and unlocalized contigs and EBV files.

Any help would be greatly appreciated.

Thank you very much.

-bash-4.1$ bwa.kit/run-gen-ref hs38
--2016-05-27 15:52:54-- ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
=> “-”
Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.10, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.10|:21… connected.

Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD … done.
==> TYPE I ... done. ==> CWD (1) /genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines …
No such directory “genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines”.

gzip: stdin: unexpected end of file

Can I bulid a new reference for bwa ?

Dear Professor,
I wanna use a specified gene sequence to build a reference , and then mapp PE fastq files (fq1,fq2) to this new reference gene genome via BWA .
that's just a thought ,I do not konw it's availble ?
if availble, which algorthims should I choose from BWA ??

BEST

bwa bwasw crashed with the error '[vfprintf(stdout)] Value too large for defined data type'

bwa bwasw crashed with the error '[vfprintf(stdout)] Value too large for defined data type'.
Version: 0.7.15-r1142

I used 'bwa bwasw ' to compare two assembly reference.
The contig size maybe too large for bwasw, but it should not crash(coredump).

+ bwa bwasw -t 144 /ssd/biowrk/TAIR/canu.eval/ref.fa /ssd/biowrk/TAIR/canu.min4000/tair.contigs.fasta
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bsw2_aln] read 85 sequences/pairs (121314957 bp) ...
[vfprintf(stdout)] Value too large for defined data type

how to calculate MAPQ in a bam ?

Dear Professor,
I do not know how to calculate MAPQ (MAPping Quality),can you give me some details ?

  thank you !

incorrect alignment of long contigs / problem with bandwidth tracking?

I'm aligning long assembled contigs (from Supernova) to hg19. Occasionally bwa mem will generate long runs of bad alignments with many long indels, for sequence that actually matches well.
From tracing through some examples, it appears that alignments merged by mem_patch_reg will sometime not be subsequently processed by bwa_gen_cigar2 with a large enough bandwidth to generate a correct alignment. I can work around the issue by raising PATCH_MIN_SC_RATIO of 0.95 (https://github.com/lh3/bwa/blob/master/bwamem.c#L404), but I imagine there's a better solution.

I attached an example sequence which triggers the problem when aligned to hg19, on the latest bwa mem code. You can see the issue at chr2:122,075,076-122,087,607.

Does not show the problem:
bwa mem -v 5 -x intractg -w 500 -d 200 /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa 4535_slice.fasta

Shows the problem:
bwa mem -v 5 -x intractg -w 600 -d 200 /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa 4535_slice.fasta

4535_slice.fasta.txt

IGV tracks of the alignments resulting from the above command are show here:

Thanks!
Pat Marks

run-gen-ref hs38 link broken

run-gen-ref cannot be used to download hs38/hs38DH as the dowload link is broken:

 $ ./run-gen-ref hs38DH
--2017-03-16 10:58:37--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
           => â-â
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids ...
No such directory âgenomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_idsâ.


gzip: stdin: unexpected end of file

Please run 'bwa index hs38DH.fa'...

Output for unmapped pair

Hi,

I have a pair of read. The first reads can be mapped. The second one can NOT be mapped.

Here is the mapping results:

@PG     ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem ../../results/simu_TEP_ara/tair10.fa test_1read_R1.fq.gz test_1read_R2.fq.gz
chr1_8301745_8302017_2:0:0_1:0:0_0      121     chr1    8296984 60      100M    =       8296984 0       TTTAGGCCCAATAAGGATAATTTGAAACCCACCCCAAAATTTAAACAGAGATGGTTGGATTATGGGCACAAGGTTCCAATTGCTAGATAGCTAAAATTTT        2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222        NM:i:1  MD:Z:95C4       AS:i:95     XS:i:0
chr1_8301745_8302017_2:0:0_1:0:0_0      181     chr1    8296984 0       *       =       8296984 0       GGATTTACTATTTGATCGATACTGCTTTGAACCCCAAATCAAACTGTGAGTTGAACAAATGGATTTACTATTTGATCGATACTGCTTTGAACCCCAAATC        2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222        AS:i:0  XS:i:0

In the second line, the summary of the FLAG is:
Summary:
read paired
mate unmapped
read reverse strand
mate reverse strand
first in pair

My question is why the seventh line is "=" rather than "". Does this mean that if I want to check whether the mate is mapped or not, I can NOT just select the seventh with "" and need to extract based on FLAG?

Best,
Shaojun Xie

BWA won't work when the first letter of the quality line in FASTQ file is '@' symbol.

If the first letter of the forth line in every reads, indicating the quality, is '@' symbol, will make BWA process terminated. For instance:

@sequences_name
ATTGCATGCATGC
+
@@dddad,AAFDF

To solve this problem, I have to replace the initio '@' in the quality line to 'A', hoping the replacement won't affect the coming analysis.
Also, if two reads are paired, and their names are almost identical but end with "-1" and "-2", which can distinguish them, BWA will abort too. I think you can identify two paired reads' names without considering the last number if it is "1" or "2".

(XT and XA tag conflict)Purpose of skipping in function bwa/bwase.c: bwa_aln2seq_core

if we look at this segment of code in bwa_aln2seq_core function:

if (set_main) {
		best = aln[0].score;
		for (i = cnt = 0; i < n_aln; ++i) {
			const bwt_aln1_t *p = aln + i;
			if (p->score > best) break;
			if (drand48() * (p->l - p->k + 1 + cnt) > (double)cnt) {
				s->n_mm = p->n_mm; s->n_gapo = p->n_gapo; s->n_gape = p->n_gape;
				s->ref_shift = (int)p->n_del - (int)p->n_ins;
				s->score = p->score;
				s->sa = p->k + (bwtint_t)((p->l - p->k + 1) * drand48());
			}
			cnt += p->l - p->k + 1;
		}
		s->c1 = cnt;
		for (; i < n_aln; ++i) cnt += aln[i].l - aln[i].k + 1;
		s->c2 = cnt - s->c1;
		s->type = s->c1 > 1? BWA_TYPE_REPEAT : BWA_TYPE_UNIQUE;
	}

we can see that the s->type, the final XT tag, will be determined by the cnt variable.
What's more, cnt variable is determined by the break in for loop 'if (p->score > best) break;'
This may result in a conflict in final sam records, e.g. when XT report unique hit, but XA may result in several alternative hits.
Could anyone please enlighten me why do we need this break, what does this part actually doing?
Thanks!

error with long comment+name

Running as:

bwa mem -pC $base long.fq.gz

gives nonsense for the read name (and alignemtns)

If I do:

bwa mem -pC  $base  <(zcat long.fq.gz | awk '{ print substr($0, 0, 300) }')

or any value smaller, then it aligns and outputs as expected.

long.fq.gz

how to Improve mapping results

HI,

I have been struggling to improve my alignment but I am not sure which options will do the trick. I have paired illumina reads of mean 200 bp in length and mean quality is 35. Still i manage to get alignment not more than 75 percent.
bwa mem -M ref.fa read1.fq read2.fq > aln-pe.sam
Many thanks,
zusman

bwa.1, line 1: recommend updating to reflect current release date and version

Pair information for ALT

Hi,

I'm using bwakit for grch38 aligned bam. I'm developing tools to support grch38 and it requires correct pair information for reads (properly paired reads). However, I've noticed most of ALT reads are not properly paired. Mate pair of ALT reads exist in nonALT instead of ALT, although I clearly see many examples where both ends of the pair were present at ALT.
This issue was addressed previously (https://sourceforge.net/p/bio-bwa/mailman/message/32914937/). Is there any update yet?

Thanks,
MJ

what's the difference between mapping PE fastqs to genome and to specified target region?

Dear Professor,
I build a new reference with a gene sequence.
Step 1 :
Using PE fastqs to mapp complete genome (hg19)
Step 2 :
Using PE fastqs to mapp part genome sequence ,eg: gene region

   what 's the difference between  Step 1 and Step 2 ?    I wanna use bwa mem way.

Best

bwa samse failed

Hi,

I failed to run
"bwa samse -n 1 ./p_ctg_formated.fa ./read.sai - | grep '(^@|XT:A:U)' | samtools view -S -h -b -F 0x4 - | samtools sort -T ./ -@ 20 -no - ./temporarySam > ./san_fct3_5k.map".
The error message is " [E::hts_open_format] fail to open file './temporarySam'
[bam_sort_core] fail to open './temporarySam': No such file or directory
[bam_sort] Note the <out.prefix> argument has been replaced by -T/-o options
[gzread] Bad file descriptor"
Anyone can help?

Thanks,
Quan

Split alignments are reported when one alignment contains another

For some reads, split alignments are written in which one alignment is fully contained in the other alignment using the SA tags (bwa 0.7.15).

HISEQ1:93:H2YHMBCXX:1:2214:10015:52441 2115 chrUn_JTFH01001478v1_decoy  12    1 250M    chr21   10325621       0        <snip>  NM:i:13 SA:Z:chr21,10325808,-,240M10S,9,19;

How are such split alignments to be interpreted? Can the documentation be updated to reflect the meaning of split alignments for which all bases can be aligned to one of the locations.

url for GRCh38 download in run-gen-ref in bwakit-0.7.15_x64-linux

I refer to bwa.kit version downloaded on 17Feb2017 using this link:

https://downloads.sourceforge.net/project/bio-bwa/bwakit/bwakit-0.7.15_x64-linux.tar.bz2

It seems that this version of bwa.kit contains an obsolete URL for GRCh38 download in run-gen-ref file.

It seems that the current URL for GRCh38 should be:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz

See URL change notice here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README_change_notice.txt

Could you check and correct, if necessary?
Thank you,
Alexey

Fails compile on POWER8: SSE2 intrinsics -> VSX/AltiVec

bwa fails to compile on IBM POWER8 platforms due to dependency on SSE2 intrinsics.

Make output:

[u0017592@sys-81783 bwa]$ make
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  ksw.c -o ksw.o
ksw.c:29:23: fatal error: emmintrin.h: No such file or directory
 #include <emmintrin.h>
                       ^
compilation terminated.
make: *** [ksw.o] Error 1

Make after removing ##include <emmintrin.h>:

[u0017592@sys-81783 bwa]$ make
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  ksw.c -o ksw.o
ksw.c:48:2: error: unknown type name ‘__m128i’
  __m128i *qp, *H0, *H1, *E, *Hmax;
  ^
ksw.c: In function ‘ksw_qinit’:
ksw.c:71:11: error: ‘__m128i’ undeclared (first use in this function)
  q->qp = (__m128i*)(((size_t)q + sizeof(kswq_t) + 15) >> 4 << 4); // align memory
           ^
ksw.c:71:11: note: each undeclared identifier is reported only once for each function it appears in
ksw.c:71:19: error: expected expression before ‘)’ token
  q->qp = (__m128i*)(((size_t)q + sizeof(kswq_t) + 15) >> 4 << 4); // align memory
                   ^
ksw.c: In function ‘ksw_u8’:
ksw.c:114:2: error: unknown type name ‘__m128i’
  __m128i zero, oe_del, e_del, oe_ins, e_ins, shift, *H0, *H1, *E, *Hmax;
  ^
ksw.c:130:2: warning: implicit declaration of function ‘_mm_set1_epi32’ [-Wimplicit-function-declaration]
  zero = _mm_set1_epi32(0);
  ^
ksw.c:131:2: warning: implicit declaration of function ‘_mm_set1_epi8’ [-Wimplicit-function-declaration]
  oe_del = _mm_set1_epi8(_o_del + _e_del);
  ^
ksw.c:139:3: warning: implicit declaration of function ‘_mm_store_si128’ [-Wimplicit-function-declaration]
   _mm_store_si128(E + i, zero);
   ^
ksw.c:146:3: error: unknown type name ‘__m128i’
   __m128i e, h, t, f = zero, max = zero, *S = q->qp + target[i] * slen; // s is the 1st score vector
   ^
ksw.c:147:3: warning: implicit declaration of function ‘_mm_load_si128’ [-Wimplicit-function-declaration]
   h = _mm_load_si128(H0 + slen - 1); // h={2,5,8,11,14,17,-1,-1} in the above example
   ^
ksw.c:148:3: warning: implicit declaration of function ‘_mm_slli_si128’ [-Wimplicit-function-declaration]
   h = _mm_slli_si128(h, 1); // h=H(i-1,-1); << instead of >> because x64 is little-endian
   ^
ksw.c:156:4: warning: implicit declaration of function ‘_mm_adds_epu8’ [-Wimplicit-function-declaration]
    h = _mm_adds_epu8(h, _mm_load_si128(S + j));
    ^
ksw.c:157:4: warning: implicit declaration of function ‘_mm_subs_epu8’ [-Wimplicit-function-declaration]
    h = _mm_subs_epu8(h, shift); // h=H'(i-1,j-1)+S(i,j)
    ^
ksw.c:159:4: warning: implicit declaration of function ‘_mm_max_epu8’ [-Wimplicit-function-declaration]
    h = _mm_max_epu8(h, e);
    ^
ksw.c:184:5: warning: implicit declaration of function ‘_mm_movemask_epi8’ [-Wimplicit-function-declaration]
     cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_subs_epu8(f, h), zero));
     ^
ksw.c:184:5: warning: implicit declaration of function ‘_mm_cmpeq_epi8’ [-Wimplicit-function-declaration]
ksw.c:190:3: warning: implicit declaration of function ‘_mm_srli_si128’ [-Wimplicit-function-declaration]
   __max_16(imax, max); // imax is the maximum number in max
   ^
ksw.c:190:3: warning: implicit declaration of function ‘_mm_extract_epi16’ [-Wimplicit-function-declaration]
ksw.c: In function ‘ksw_i16’:
ksw.c:235:2: error: unknown type name ‘__m128i’
  __m128i zero, oe_del, e_del, oe_ins, e_ins, *H0, *H1, *E, *Hmax;
  ^
ksw.c:251:2: warning: implicit declaration of function ‘_mm_set1_epi16’ [-Wimplicit-function-declaration]
  oe_del = _mm_set1_epi16(_o_del + _e_del);
  ^
ksw.c:265:3: error: unknown type name ‘__m128i’
   __m128i e, t, h, f = zero, max = zero, *S = q->qp + target[i] * slen; // s is the 1st score vector
   ^
ksw.c:269:4: warning: implicit declaration of function ‘_mm_adds_epi16’ [-Wimplicit-function-declaration]
    h = _mm_adds_epi16(h, *S++);
    ^
ksw.c:271:4: warning: implicit declaration of function ‘_mm_max_epi16’ [-Wimplicit-function-declaration]
    h = _mm_max_epi16(h, e);
    ^
ksw.c:275:4: warning: implicit declaration of function ‘_mm_subs_epu16’ [-Wimplicit-function-declaration]
    e = _mm_subs_epu16(e, e_del);
    ^
ksw.c:292:5: warning: implicit declaration of function ‘_mm_cmpgt_epi16’ [-Wimplicit-function-declaration]
     if(UNLIKELY(!_mm_movemask_epi8(_mm_cmpgt_epi16(f, h)))) goto end_loop8;
     ^
make: *** [ksw.o] Error 1

To support the POWER8 platform, this code will have to be changed to optionally use VSX/AltiVec.

More information: https://bitbucket.org/umarcts/ppc64le-wiki/wiki/vsx.md

Multithread support for bwa index

Hi all,

Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).

Searching for multithreaded support for bwa index I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.

Regardless, is there any ongoing or planned work to make bwa index parallelizable in some form?

bwa-0.7.13-r1126: calculate proper NM:i tag for reads contain N's

Hi,
samtools calmd often outputs a warning like this:

[bam_fillmd1] different NM for read 'HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038': 0 -> 1

Per discussion at https://sourceforge.net/p/samtools/mailman/samtools-help/thread/8aaa2f1f-c31e-47c0-5e89-6fe10990c847%40gmail.com/#msg35256271 it turns out bwa could do better and calculate NM: tag not using the randomly modified sequence with N's replaced to any nucleotide, but increasing the NM: value by number of N's originally present in the aligned region. Is that possible? In turn, "samtools calmd" would issue less warnings, and one could focus on real inconsistencies. Thank you.

A question for MD and NM meaning ?

HI,
like this read:
QAt17575867:1:0 99 1 10115 0 43M1I26M = 10189 126 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAAACCCAAACCCAAACCCAAACCTAAN ?<@?AF?@dcbf>?CBBF?@DCCG?@DCCG?@DCCCA?<8@@/(72AC**7*7@*173=@=<=<>+0)5# MC:Z:18S52M BD:Z:LLJNLPQMJNJMOKIMIMOKHLIMOKIMIMOKIMIIMOKIMIMOKCJNJOKCJOKPLELPMRNGNRNPLL MD:Z:48T5T5T7C0 RG:Z:VRa BI:Z:OONONTRMNOLRQLMNLRQLMNLRRMMOMSRMNOMMSRMNOMSRMGOPNSPHOPOTQIPQOTQIQRSRMO NM:i:5 MQ:i:0 AS:i:46 XS:i:46

NM value equal to mismatch plus deletion ? not equal to mismatch plus insertion ??
how to explain the meaning of this read MD tag ?

bwa mem 0.7.13-r1126: fix ARGV parsing not to choke on quoted empty strings

Seems a quoted empty string upsets ARGV parser.

$ bwa mem -t 16 '' -M ref.fasta.gz pairs.fastq | head
[E::bwa_idx_load_from_disk] fail to locate the index files
$

$ bwa mem -t 16 -M ref.fasta.gz pairs.fastq | head
[M::bwa_idx_load_from_disk] read 0 ALT contigs
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chr10        LN:133797422
$

making XA_drop_ratio an option

Now it is hard coded as 0.8. It would be useful if users can specify it manually.

Typo in README-alt.md file ?

In the README-alt.md file
https://github.com/lh3/bwa/blob/master/README-alt.md
there is a sentence saying:

“In theory, non-alt alignments from step 1 should be identical to alignments against the reference genome WITH alt contigs.”

It might be that there is a meaningful typo in this sentence. Should it be

“In theory, non-alt alignments from step 1 should be identical to alignments against the reference genome WITHOUT alt contigs.” ?

Please, check and correct if it is a typo.

how to learn bioinformatics algorithm

Dear professor ,
I am a new guy for bioinformatics algorithm , My major is biology . I wann learn bioinformatics, especially about algorithm , So can you give me some suggestions ?
Thank you very much!

validity of CIGAR string

I have noticed that the CIGAR string column from the SAM output file sometimes does not fully contain the alignment difference between the query sequence and the reference sequence. For example with a sample alignment entry with a CIGAR string of 90M:

SRR2650329.12.indiv_AAA252      83      R36106  686     60      90M     =       30947000        -146    GCTTCTTAACAGCAACTCGCTTAGCCTGGCAGCCAGCTGCACAAATGTGTCAGAGGAAACAGTTAAACTGATGACCAAAGTCAGCCTGCA      >DDDDEDDDCDDDDDDDDEEEEFFFFHHHGJIFJIJJIJIHJJJJJJJJJJJJJJJJJJJJJJIJJIIIJIGIFJJJJJJIJJJIJJHHH      NM:i:2  MD:Z:23A37A28   AS:i:80 XS:i:31

Based on the CIGAR string, I expect to see a perfect 90bp perfect match of the query sequence to the reference sequence. Instead, when i compare those two sequences, I have found two omitted substitution:

OBS:    GCTTCTTAACAGCAACTCGCTTAGCCTGGCAGCCAGCTGCACAAATGTGTCAGAGGAAACAGTTAAACTGATGACCAAAGTCAGCCTGCA
REF:    GCTTCTTAACAGCAACTCGCTTAACCTGGCAGCCAGCTGCACAAATGTGTCAGAGGAAACAATTAAACTGATGACCAAAGTCAGCCTGCA
NOT:    _______________________*_____________________________________*____________________________
QUA:    >DDDDEDDDCDDDDDDDDEEEEFFFFHHHGJIFJIJJIJIHJJJJJJJJJJJJJJJJJJJJJJIJJIIIJIGIFJJJJJJIJJJIJJHHH

obs - as observed; ref - as ref; not - as notation, qua - as phred qual

This is one of the many cases. Was it intentional for bwa not to recognize these base pair substitution event? Both past version (0.7.8) and the latest version (0.7.15) of bwa mem & aln method commit the same type of action.

Inclusion of literal tab characters in command line leads to SAM spec violation

Inclusion of literal tab characters in the bwa command line (notably, when passing a @rg read group line) leads to inclusion of literal tab characters in the @rg portion of the CL: field within the @pg line written in the SAM header. Tab control characters within fields (instead of as field separators) violates the SAM spec, and causes downstream htslib / htsjdk to choke.

See the issue I filed with htsjdk here (although the fix should ultimately be a bwa update):
samtools/htsjdk#677

This could be fixed in a couple of different ways, including (1) forbidding tab characters and (2) rewriting the CL field of the @pg line. For simplicity and cliarity sake, I vote to forbid the literal tab characters. Rewriting the @pg line does not tell the true story of the actual command line passed to bwa, which could be argued also violates the spec.

Anyway, I fixed this and a pull request is incoming.

explanation of insert sizes from BWA-MEM

I have noticed that insert sizes from BWA-MEM seem to start high at the beginning of a chromosome, and drop down towards the end. This isn't really an issue (feel free to close and ignore) but I wanted to put it out here to see if anyone has a good explanation.

I'm mapping 100bp PE Illumina data to a reference genome from another species. The reference (Eucalyptus grandis) is ~5% diverged from the study species. It has 11 good chromosomes, and a bunch of rubbish scaffolds. Gene order etc. is well conserved between the two.

I'm running BWA-MEM and looking at the output with Qualimap like this:

bwa mem $ref $R1 $R2 -t $threads > out.sam
qualimap bamqc -bam out.bam -outdir $outBWAMEM"qualimap_all/" -nt $threads -c

What I don't understand is the plot of insert sizes. My expectation is that most insert sizes will be ~200bp. But what I observe from BWA-MEM looks very different.

nevertheless, the insert-size histogram looks fine, and looking at the mapping in IGV looks fine too.

So, it looks like the start of chromosomes must contain a small number of pairs with enormous insert sizes. This result is not shared by other mappers like NextGenMap, bbmap, or Stampy (all run with default params). They all look more like the plot below (this from NextGenMap):

I'm struggling to understand the difference.

alt alignments and supplementary records

Hi,

The bwa alt README states the following:

" If there are both ALT and non-ALT hits, non-ALT hits will be primary and ALT hits be supplementary"

However, I am seeing alignments such as this:

HWI-ST1023:220:H84P1ADXX:1:1208:13259:86761 83 chr1 2539909 52 107M = 2539909 -107 GTCCCTGCACTGCACACTGCACCTGGCCAGGGGCACCAAGTCAGGGGGCAGGGCCCTCTACCCATCTCCCGATCGCAACCAACCTGGCCACACGGGGACCCGTAGGG G@GFGBGI@HDHHAH?GCHI@FGBEGFHEEEEGH@FHAEG?GEDEDEFHDDDFFCF@F@?DDF@@F@BB?D@@?DEA?BEA?AEBACAD?D?>ABBD@CC=D?A?B@ NM:i:0 MD:Z:107 AS:i:107 XS:i:100 XA:Z:chr1,+2539395,104M3S,1;chr1_KI270762v1_alt,-91099,107M,0;chr1_KI270762v1_alt,+90585,104M3S,1;

In the above, there is single SAM record output, with alt hits in the XA tag. No supplementary record is output.

I also see cases where the alt hits are output as supplementary records like the following:

HWI-ST1023:220:H84P1ADXX:2:1116:14501:71115 83 chr1 2650066 43 151M = 2649959 -258 CTGACAGCCTGGAATGGCACTCTGCATCCCCAGGTGAGCATCTGACAGCCTGGAATGGTACTCTACACCCTCAGGTGAGCTTCTGACAGCCTGGAATGGCACTCTACATCCCCAGGTGAGCTTCTGACAGCCTGGAATGGCACCCCACACC ?@A=BB?<>@@GA?BEGF@EAGBGH@AB;BDCEGBHDGH@@EBH@HCGEBBFGB@BBB;@EAC:-B?7EC>CDCDDGCEEAAGCF@GEFDFDBDA@CBDF?E@E@?EA@CBBECBCCDCDEA@ECD?EBCBDBAC@?BBDE@CCCF>D>C@ NM:i:5 MD:Z:37A70C2T31T1T5 AS:i:128 XS:i:121 SA:Z:chr1_KI270762v1_alt,200851,-,151M,50,0; pa:f:0.848
HWI-ST1023:220:H84P1ADXX:2:1116:14501:71115 2131 chr1_KI270762v1_alt 200851 50 151M chr1 2649959 0 CTGACAGCCTGGAATGGCACTCTGCATCCCCAGGTGAGCATCTGACAGCCTGGAATGGTACTCTACACCCTCAGGTGAGCTTCTGACAGCCTGGAATGGCACTCTACATCCCCAGGTGAGCTTCTGACAGCCTGGAATGGCACCCCACACC ?@A=BB?<>@@GA?BEGF@EAGBGH@AB;BDCEGBHDGH@@EBH@HCGEBBFGB@BBB;@EAC:-B?7EC>CDCDDGCEEAAGCF@GEFDFDBDA@CBDF?E@E@?EA@CBBECBCCDCDEA@ECD?EBCBDBAC@?BBDE@CCCF>D>C@ NM:i:0 MD:Z:151 AS:i:151 XS:i:128 SA:Z:chr1,2650066,-,151M,43,5; XA:Z:chr1,-2650066,151M,5;chr1,-2650148,151M,6;

Is there known logic that determines whether a supplementary record is output for alt hits?

Thanks!