kennedy-lab-uw / duplex-sequencing Goto Github PK

License: Other

HTML 24.16% Python 70.91% Shell 4.94%

duplex-sequencing's Introduction

Duplex Sequencing

This version is no longer maintained. We encourage you to download the latest version of this software Here

Duplex Sequencing software Version 3.0 July 11, 2016 Programs by Scott Kennedy(1) (1) Department of Pathology, University of Washington School of Medicine, Seattle, WA 98195

####It is strongly recommended to use UnifiedConsensusMaker.py, as it is faster and easier to use. ####The version outlined in Kennedy et al 2014 can be found in the Nat_Protocols_Version directory

Glossary

Single Stranded Consensus Sequence (SSCS)

A construct created by comparing multiple reads and deciding ambiguities by simple majority.

Duplex Consensus Sequence (DCS)

A construct created by comparing two SSCSs.

Family

A group of reads that shares the same tag sequence.

Summary of process

UnifiedConsensusMaker is meant to result in the transformation of an unaligned *.bam file derived from data obtained on an Illumina sequencing run into two final *.fastq files that have been collapsed into their final Duplex Consensus Sequences (DCS). This workflow will also generate a file containing a list of every tag that is present and how many times it occurred, as well as a file containing SSCSs that didn't have a mate and were unable to make a DCS (extraConsensus.bam).

Dependencies

The following programs and packages must be installed on your computer.

Package	Written with version
Samtools	>=1.3.1
Python	2.7.X
Pysam	>=0.9.0
MatPlotLib	>=1.5.1 (optional)

Input

UnifiedConsensusMaker.py takes an unaligned bam file generated by Picard FastqToSam. Picard version >2.2.X is known to work, but earlier versions are likely to work, as well.

Usage

python UnifiedConsensusMaker.py --input unaligned_bam_file.bam --prefix name

Default values are in brackets

Arguments: -h, --help show this help message and exit

--input IN_BAM Path to unaligned, paired-end, bam file.

--taglen TAG_LEN Length in bases of the duplex tag sequence.[12]

--spacerlen SPCR_LEN Length in bases of the spacer sequence between duplex tag and the start of target DNA. [5]

--tagstats output tagstats file

--minmem MINMEM Minimum number of reads allowed to comprise a consensus. [3]

--maxmem MAXMEM Maximum number of reads allowed to comprise a consensus. [200]

--cutoff CUTOFF Percentage of nucleotides at a given position in a read that must be identical in order for a consensus to be called at that position. [0.7]

--Ncutoff NCUTOFF With --filt 'n', maximum fraction of Ns allowed in a consensus [1.0]

--write-sscs Print the SSCS reads to file in FASTQ format [False]

--without-dcs Don't print final DCS reads [False]

--rep_filt REP_FILT Remove tags with homomeric runs of nucleotides of length x. [9]

--prefix PREFIX Sample name to uniquely identify samples that will be appended as a prefix to the output files [None]

Required arguments are --input and --prefix.

Data Outputs

Default output are two fastq files consisting of the final DCS with the reads in the read 1 and read 2 fastq files being in register for proper paired-end analysis. The header line of each entry consists of the tag in alpha/beta or beta/alpha format (see Schmitt et al 2012 and Kennedy et al 2014 for details) such that the first half of the tag is always alphanumerically "less than" the second half. The third line indicates the number of reads that make up the SSCS that go on to form the final DCS. UnifiedConsensusMaker.py also calculates the final quality score for each position based on the quality scores of the underlying raw reads. If a quality score is found to be >Q40 (almost all are), the quality score is capped at 'J', which is the highest quality score currently supported by Illumina.

If --write-sscs is invoked, two fastq files are produced, one for each read. These reads are in register for paired-end sequencing and can be directly aligned using the aligner of your choice. Similarly to the DCS, the quality score is calculated from the raw reads and capped at a maximum value of Q40.

If --tagstats is invoked, a plots of the family size and the correlation between the family sizes that make up a DCS are generated. This option requires that MatPlotLib be installed.

duplex-sequencing's People

Contributors

Stargazers

Watchers

duplex-sequencing's Issues

ConsensusMaker.py - Little disagreement in defaults

Just a tiny thing:

line 226 sets a default value of 84 bp, but
line 227 documents it as 80 bp.

I have too much free time ;)

Is the code for corrected_qual_score correct line 230 ?

The line #230 in UnifiedConsensusMaker.py reads:
corrected_qual_score = map(lambda x: x if x < 41 else 41, qual_dict['ba:1'])

Should it have been (for second ba read ba:2):
corrected_qual_score = map(lambda x: x if x < 41 else 41, qual_dict['ba:2'])
?

tag_to_header.py / alignment error

Dear loeblab,

I am following your duplex sequencing bioinformatics pipeline and have encountered an error once trying to align the output from the tag_to_header.py script.

At first, I got an error with the script because my read headers did not consist of a "/". However, I changed this so that a "" would split the string right before the index sequence. For example:

initial read header:
@HWI-ST892:245:C41KRACXX:6:1101:1898:2153 1:N:0:CGATGTAT

new read header:
@HWI-ST892:245:C41KRACXX:6:1101:1898:2153 1:N:0/CGATGTAT

I then used these modified fastq files as input into the tag_to_header.py script and it seemed to work, generating the following output for the first read pair as an example (these are in separate files):

@HWI-ST892:245:C41KRACXX:6:1101:1898:2153_1:N:0|AAAAGCCGCGCC/CGATGTAT

@HWI-ST892:245:C41KRACXX:6:1101:1898:2153_2:N:0|AAAAGCCGCGCC/CGATGTAT

However, when I try to align the reads to create a SAM file using these files, I get the following error:

paired reads have different names: "HWI-ST892:245:C41KRACXX:6:1101:1898:2153_1:N:0|AAAAGCCGCGCC/CGATGTAT", "HWI-ST892:245:C41KRACXX:6:1101:1898:2153_2:N:0|AAAAGCCGCGCC/CGATGTAT"

However, when I align read pairs with different names without using the tag_to_header.py script, and using just the raw original fastq files, I do not encounter this error and the alignment proceeds without error.

Do you have any suggestions about why this isn't working? I would be very grateful.

Many Thanks.

How to account for upstream sample barcodes?

How would a region upstream of the random 12-nucleotide barcode be accounted for? e.g., sample barcodes. The following sequence is how I set it up:

5' P5 region, Read 1, Sample Barcode, Universal Primer, Random Barcode, Spacer, TARGET SEQUENCE, Spacer, Random Barcode, Universal Primer, Sample Barcode, Read 2*, P7 region 3'

In other words, I have 72 bases upstream of the Random Barcode

mut-position.py

Hi,
I am running the SRR1799911.sra data down from NCBI. After the mut-position.py step, I get the finally result of the data (file name: SRR1799911.dcs.d100-c0-0.001.mutpos). But it is difficult to understand the result file. Would you explain the mean of file? The first 5 lines of the file showed blow. Thank you!
1 G 169796677 104 0 0 0 0 0 0 0 0
1 C 169796678 106 0 0 0 0 0 0 0 0
1 T 169796679 108 0 0 0 0 0 0 0 0
1 G 169796680 109 0 0 0 0 0 0 0 0
1 A 169796681 108 0 0 0 0 0 0 0 0

problem with muts_by_read_position.py

Hi,
I am getting an error with the script. I have checked the size of the reads and they are all the same length (I filter the length to 133).

Error:
File "~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py", line 253, in
main()
File ~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py", line 205, in main
counter.reads[readNum-skips].addN()
IndexError: list index out of range

My commands are:
samtools mpileup -B -A -d 500000 -f ~/refs/hifi.fa DCS-final.bam > DCS-final.bam.mpileup

python2.7 ~/apps/Duplex-Sequencing-master/Nat_Protocols_Version/muts_by_read_position.py -i DCS-final.bam.mpileup -l 133 -o test -C 0.3

The protocol I use to get to generate the input bam file is more or less what is in the Nat Protocols paper, with the only exceptions being:

I align with bwa mem
I filter the reads to only keep 133bp reads after running the UnifiedConsensusMaker.py using VariantBam. This is because if I don't, I get ~200 reads that are 30-70bp (mostly 49) compared to ~ 100,000 that are 133bp, and I am not sure if they are increasing the mutation rate.

I think this script is very useful to know if I am clipping enough bases on either end, so it would be really great to get it to work!

AttributeError wtih pysam

When running the UnifiedConsensusMaker.py on an unaligned bam-file I get the following error:

Parsing tags...
Traceback (most recent call last):
File "CapSeq_muts/UnifiedConsensusMaker.py", line 326, in
main()
File "CapSeq_muts/UnifiedConsensusMaker.py", line 140, in main
temp_bam_entry.set_tag('X?', temp_read1_entry.query_name, 'Z')
AttributeError: 'pysam.calignmentfile.AlignedSegment' object has no attribute 'set_tag'

I have created the unaligned bam-file as specified with FastqToSam using Picard-2.8.3 (only using the arguments required for paired end: F1, F2, O, and SM).
I have the following dependencies installed:
Samtools-1.3.1
Pysam-0.10.0
I ran the python-script using only the required arguments input and prefix.

Do you have any idea what caused the error and how I might resolve the issue?

Error running CountMuts.py

Dear Loeb Lab,

I am trying to run the CountMuts.py script on my pileup file as produced by Samtools pileup and using the exact commands as indicated in the 2014 Nature Protocols publication. I am encountering the following errors each time:

Traceback (most recent call last):
File "CountMuts.py", line 254, in
main()
File "CountMuts.py", line 250, in main
CountMutations(o, f, fOut)
File "CountMuts.py", line 129, in CountMutations
elif (float(linebins[4].count('N'))/(float(depth) + float(linebins[4].count('N')))) > o.n_cutoff:
ZeroDivisionError: float division by zero

I have tried changing the cut-off values for the % clonality and depth (e.g. -d 10 -C 0.1) and encounter the same error.

Can you suggest any solutions to this problem?

Many Thanks.

How to set --readlength while using ConsensusMaker.py

Hi,

Could I please ask you a question here?

After all the steps before, my "PE.sort.bam" includes reads with different length.

e.g., samtools view PE.sort.bam | cut -f 6 | sort -u
*
100M
100M1D17M
100M1D18M
et.al

Since the length of reads is not fixed, how should I set --readlength while I run ConsensusMaker.py?

I appreciate your help.

Best,
Emily

Fastq file has read bases as N

Hi,

I am running the UnifiedConsensusMaker.py on the unaligned bam file generated by Picard FastqToSam.
I used taglen as 3 becuase the UMI is the first three bases of the data read for both read 1 and read 2. The actual data read starts at Base 6 for both read 1 and read 2 so I used the --spacerlen as 2.

I used this command to run:
python ~/Git_Repos/Duplex-Sequencing/UnifiedConsensusMaker.py --input $name.unaligned.bam --prefix $name --taglen 3 --spacerlen 2

I get three outputs from running this command. First, a $name_temp.sort.bam file and 2 fastqs. My question is: Is this the bam file after collapsing? If so, I can see all the reads in this bam. However, my fastq 1 and 2 both have N's as the bases in my reads. Also their sizes are very small (~2015 reads), whereas my bam is pretty huge ~2.2 GB. Is there something that I am doing wrong. Can you please help me fix the issue.

Thank you.

Read name issues in ConsensusMaker.py and DulplexMaker.py

In the SRR1613972 when running the duplex seq pipeline, I am getting the same duplex barcode across mutiple duplexes that silently leads to inconsistent results.

DuplexMaker.py could write the same read name to the final DCS BAM file. I have fixed that here: 93109ae
ConsensusMaker.py could produce SSCS's with the same name.

ACCTAAGCGGATACGAGTGCGCAT:2:6    163 chr1    24615001    255 84M =   24615169    252 GTCGGTTTGATGTTACTGTTGCTTGATTTAGTCGGCCTGGGATGGCATCAGTTTTAAGTCCTAGGGAGGGGACTGCTCATGAGT    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ACGAGTGCGCATACCTAAGCGGAT:1:6    99  chr1    24615001    255 84M =   24615169    252 GTCTGTTTGATGTTACTGTTGCTTGATTTAGTCGGCCTGGGATGGCATCAGTTTTAAGTCCTAGGGAGGGGACTGCTCATGAGT    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ACCTAAGCGGATACGAGTGCGCAT:1:6    83  chr1    24615169    255 84M =   24615001    -252    GTTCACCAGGTTTTAGGTCGTTTGTTGGGATTATATATGAATCAAAGCATAGGTCTTCATAGTCAGTATATTCGTAGCTTCAGT    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ACGAGTGCGCATACCTAAGCGGAT:2:6    147 chr1    24615169    255 84M =   24615001    -252    GTTCACCAGGTTTTAGGTCGTTTGTTNGGATTATATATGAATCAAAGCATAGGTCTTCATAGTCAGTATATTCGTAGCTTCAGT    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ACCTAAGCGGATACGAGTGCGCAT:1:13   99  chrM    7326    255 84M =   7494    252 ACTGAAGCTACGAATATACTGACTATGAAGACCTATGCTTTGATTCATATATAATCCCAACAAACGACCTAAAACCTGGTGAAC    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ACCTAAGCGGATACGAGTGCGCAT:2:13   147 chrM    7494    255 84M =   7326    -252    ACTCATGAGCAGTCCCCTCCCTAGGACTTAAAACTGATGCCATCCCAGGCCGACTAAATCAAGCAACAGTAACATCAAACCGAC    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

when fed into DulplexMaker.py I would have expected one paired duplex and one unpaired duplex. Instead I get only one unpaired duplex. I think this is because DulplexMaker.py only uses the duplex barcode in the read name assuming a duplex barcode is unique to a duplex.

consensus_maker.py not identifying first read in sorted BAM

Every time I run Consensus_Maker.py script, it drops exactly 1 read from the sorted bam file. When I went back to trace down this error using the tagcounts file and the corresponding sam file, I found that it always misses the first read in the (sorted) sam file (so, the first read in the corresponding sorted bam file as well). I am not a strong python coder, so not sure where/why this is occurring, but I'm nearly positive it isn't my use of samtools view or sort in the intervening steps between bwa sampe and Consensus_Maker.py.

Error with PE_BASH_MAKER.py

I was getting this error when running the PE_BASH_MAKER.py

Traceback (most recent call last):
File "PE_BASH_MAKER.py", line 183, in
main()
File "PE_BASH_MAKER.py", line 131, in main
inBash = open(spath + "/bash_template.sh", "r")
IOError: [Errno 20] Not a directory: 'PE_BASH_MAKER.py/bash_template.sh'

I changed the line 131 by erasing the "spath +" and it run fine. Let me know if by doing that I messed anything up.

Thank you

Where can I find file "read_1.fq", "read_2.fq" ??

I'm a student, I'm reading the paper "Detecting ultralow-frequency mutations by Duplex Sequencing", and try to re-do the "bioinformatic processing" in the paper.
I have installed bwa, python in linux, but "tag_to_header.py" need file"read_1.fq" to run.
Can you offer the fg files, then let me to run the program. Thx.

IndexError: list index out of range

I have download the SRR1613972.sra from http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR1613972

after fastq-dump -I --split-file SRR1613972.sra

I run tag_to_header.py and I get the following error.

Thanks for you support!

python tag_to_header.py --infile1 SRR1613972_1.fastq --infile2 SRR1613972_2.fastq --outfile1 read_1.fq.smi --outfile2 read_2.fq.smi --barcode_length 12 --spacer_length 5

Traceback (most recent call last):
File "tag_to_header.py", line 196, in
main()
File "tag_to_header.py", line 163, in main
read1.name = hdrRenameFxn(read1.name, tag1, tag2)
File "tag_to_header.py", line 121, in hdrRenameFxn
return("%s|%s%s/%s" % (x.split("/")[0], y, z, x.split("/")[1]))

Update docstring on tag_to_header.py to reflect new arguments

tag_to_header.py docstring still references --outfile1 and --outfile2
should be --outprefix

ConsensusMaker.py for Python3

Thanks for very useful codes.
I changed two parts of ConsensusMaker.py to run on Python3.
I will tell you about those two.
1)
xrange --> range
2)
line286
read_window = [bam_entry.next(), ''] # Get the first read

ConsensusMaker.py

Hi,
I am running the SRR1799908.sra data down from NCBI. In the ConsensusMaker.py step, I am encountering the following errors each time.

Traceback (most recent call last):
File "/home/jyc/zwl/soft/Duplex-Sequencing-master/ConsensusMaker.py", line 428, in
main()
File "/home/jyc/zwl/soft/Duplex-Sequencing-master/ConsensusMaker.py", line 354, in main
if (consensus.count("N" )/ len(consensus) <= o.Ncutoff and 'n' in o.filt) or ('n' not in o.filt):
ZeroDivisionError: integer division or modulo by zero

Empty output "SSCS.bam" (ConsensusMaker.py)

Thanks for your help. This issue could be cancelled if it's fine with you.

Consensus maker

Hi,
I am running the Consensus make script as part of the duplex pipeline. But I do not see the .UP.bam file and the tagcounts file being generated. I need to know why these files are absent. I have the LCC.bam and the NM.bam and DCS.bam files.
Thank you,

Arnavaz

UnifiedConsensusMaker.py IOError

When I use UnifiedConsensusMaker.py to make Consensus of the test reads SRR1613972, when the program sorting reads on tag sequence, I get this error message:

Traceback (most recent call last):
  File "./Duplex-Sequencing/UnifiedConsensusMaker.py", line 326, in <module>
    main()
  File "./Duplex-Sequencing/UnifiedConsensusMaker.py", line 170, in main
    in_bam_file = pysam.AlignmentFile(o.prefix + '.temp.sort.bam', "rb", check_sq=False)
  File "pysam/calignmentfile.pyx", line 311, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4929)
  File "pysam/calignmentfile.pyx", line 480, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:6905)
IOError: file `USRR2.temp.sort.bam` not found

How can I solve this problem?
When I use UnifiedConsensusMaker.py to make Consensus . Do I need to use the SRAFixer.py to fix the reads of SRR1613972 ? Do I need to use the tag_to_header.py to modify the reads of SRR1613972 ?

When I do Duplex-Sequencing to find ultralow-frequency mutations, how high the sequencing depth should be?

ConsensusMaker.py error

HI，I'm following the pipline to do the analysis,but when I running ConsensusMaker.py ,there always an error:
ConsensusMaker.py: error: argument --tag_file is required

I can't understand it ,I clearly set this parameter，Why does this error occur?the tag_file is not an output_file ?