Code Monkey home page Code Monkey logo

capsim's Introduction

capsim

Capsim is a tool to simulate capture sequencing. It was developed as part of the Japsa package.

Quick installation guide:

On a Linux/Mac machine with make' and git' installed, the software can be installed with

git clone https://github.com/mdcao/japsa
cd japsa
make install \
  [INSTALL_DIR=~/.usr/local \] 
  [MXMEM=7000m \] 
  [SERVER=true \]

Details of installation (including for Windows) and usage of Japsa can be found in its documentation hosted on ReadTheDocs

Usage

Before running capsim, the probe sequences need to be aligned to the reference sequence (or the genome sequence to simulate sequencing). We recommend using bowtie2 for the alignment.

#Skip this step if the index has been generated
bowtie2-build ref.fasta ref

#Align the probes into the reference
bowtie2 --local --very-sensitive-local --mp 8 --rdg 10,8 --rfg 10,8 -k 10000 -f -x ref -U probes.fa -S probes.sam

Note: for some reason, bowtie2 only accepts the query fasta file (probes.fa) containing one sequence per line.

After alignment, sort the and index the bam file with samtools

 samtools view -bSU probes.sam | samtools sort -o probes.bam -

Capsim takes the bam file as the input:

jsa.sim.capsim --reference ref.fasta --probe probes.bam --ID someid --fmedian 5000  --pacbio output --pblen 3000 --num 20000000

or

jsa.sim.capsim --reference ref.fasta --probe probes.bam --ID someid --fmedian 500  --miseq output --illen 300 --num 20000000

Options,

      --reference=s                    Name of genome to be
                                       (REQUIRED)
      --probe=s                        File containing probes mapped to the reference in bam format
                                       (default='null')
      --logFile=s                      Log file
                                       (default='-')
      --ID=s                           A unique ID for the data set
                                       (default='')
      --miseq=s                        Name of read file if miseq is simulated
                                       (default='null')
      --pacbio=s                       Name of read file if pacbio is simulated
                                       (default='null')
      --fmedian=i                      Median of fragment size at shearing
                                       (default='2000')
      --fshape=d                       Shape parameter of the fragment size distribution
                                       (default='6.0')
      --smedian=i                      Median of fragment size distribution
                                       (default='1300')
      --sshape=d                       Shape parameter of the fragment size distribution
                                       (default='6.0')
      --tmedian=i                      Median of target fragment size (the fragment size of the data).
                                       If specified, will override fmedian and smedian.
                                       Othersise will be estimated
                                       (default='0')
      --tshape=d                       Shape parameter of the effective fragment size distribution
                                       (default='0.0')
      --num=i                          Number of fragments
                                       (default='1000000')
      --pblen=i                        PacBio: Average (polymerase) read length
                                       (default='30000')
      --illen=i                        Illumina: read length
                                       (default='300')
      --ilmode=s                       Illumina: Sequencing mode: pe = paired-end, mp=mate-paired and se=singled-end
                                       (default='pe')
      --seed=i                         Random seed, 0 for a random seed
                                       (default='0')
      --help                           Display this usage and exit
                                       (default='false')

CapSim will output sequence reads in fastq format. Users can perform subsequenct analysis by aligning the simulated reads to the reference genome.

Off-target analysis

A script file off_target_probes.sh used to identify off target probes is provided in this repository. To run this script file,

bash off_target_probes.sh -b target_regions.bed -r ref.fasta -a cap_sim.bam -w 1000 -d 10000 -x 500 -q probes.txt -t 4 -p out

where,

      -b/--target-bed                  Bed file of the target regions.
      -r/--reference                   Reference genome fasta file.
      -a/--bam                         Bam file of CapSim simulated reads aligned to reference
                                       genome.
      -w/--window-size                 Window size for statistics of the depth of coverage of
                                       the off target regions (default 1000).
      -d/--min-depth                   Minimum depth of base coverage of the off target regions
                                       to analyse (default 10000).
      -x/--padding-size                Extension/padding size to the up and downstream of the
                                       target regions (default 500).
      -q/--probe-seq                   Text file containing the probe ID and sequence.
      -t/--threads                     Number of threads for alignment (default 1).
      -p/--prefix                      Prefix of the output files (default ./out).

The following tools should be installed and added to system path.

The script file consists of 12 main commands which could be run step by step.

1. generate reference index file,

samtools faidx ref.fasta

2. generate genome file for bedtools,

awk '{print $1"\t"$2}' ref.fasta.fai > ref.genome

3. generate fasta file of probe sequences,

tail -n +2 probes.txt | awk '{printf(">%s:%s\n%s\n",$1,$2,$3)}' > probes.fa

4. generate target bed file with extra regions upstream and downstream,

bedtools slop -i target_regions.bed -g ref.genome -b 500 > target_regions_500bp.bed

5. generate a bed file of regions in a genome that are not covered by the target bed file,

bedtools complement -i target_regions_500bp.bed -g ref.genome | awk -v w=100 '{num=($3-$2)/w; for(i=0;i<num-1;i++) print $1"\t"($2+w*i)"\t"($2+w*(i+1)-1); if(w*int(num)!=$3-$2) print $1"\t"($2+w*int(num))"\t"$3;}' > target_regions_complement.bed

This file will be used to calculate the depth of coverage of the bam file across the off target regions. Window size other than 1Kb could be used here. A smaller window size will generally result in more precise statistics but will be more time-consuming. The window size could be changed by the awk parameter w.

6. generate a bam file of alignments which overlap with the bed file,

bedtools intersect -abam cap_sim.bam -b target_regions_complement.bed > out_complement.bam

7. generate an alignment index,

samtools index out_complement.bam out_complement.bai

8. generate a base coverage,

samtools bedcov target_regions_complement.bed out_complement.bam | awk -v d="10000" '{if ($4>d) {print $1"\t"$2"\t"$3"\t"}}' > out_off_target_regions.bed

The threshold of the base coverage filtering could be specified by the awk parameter d.

9. generate a fasta sequence file for off_target regions,

bedtools getfasta -fi ref.fasta -bed out_off_target_regions.bed -fo out_off_target_regions.fas

10. generate bwa index files for the off target region fasta file,

bwa index -a bwtsw out_off_target_regions.fas

11. align the probe sequences to the off target region fasta file,

bwa mem -t 4 -Y -c 1000 out_off_target_regions.fas probes.fa | samtools view -@4 -bS | samtools sort -@4 -o out_off_target_regions.bam && samtools index out_off_target_regions.bam out_off_target_regions.bai

12. extract the off target alignment probe sequences,

samtools view -F4 out_off_target_regions.bam | cut -f1 | sort -u > out_off_target_probes.txt

capsim's People

Contributors

devika1 avatar mdcao avatar

Stargazers

gudeqing avatar  avatar  avatar Alejandro Reyes avatar Ousman Mahmud avatar

Watchers

James Cloos avatar  avatar Chenxi Zhou avatar  avatar

capsim's Issues

No index is available for this BAM file

Introduction

I guess I might be doing some wrong but I am getting "No index is available for this BAM file" error.

File listing:

GCF_000001405.26_GRCh38_genomic.dict       GCF_000001405.26_GRCh38_genomic.fna.fai        output.fastq.gz
GCF_000001405.26_GRCh38_genomic.fna        GCF_000001405.26_GRCh38_genomic.fna.rev.1.bt2  probes.bam
GCF_000001405.26_GRCh38_genomic.fna.1.bt2  GCF_000001405.26_GRCh38_genomic.fna.rev.2.bt2  probes.fasta
GCF_000001405.26_GRCh38_genomic.fna.2.bt2  GRCh38_header.txt                              probes.sam
GCF_000001405.26_GRCh38_genomic.fna.3.bt2  japsa                                          unaligned_reads.bam
GCF_000001405.26_GRCh38_genomic.fna.4.bt2  jsa.sim.capsim

jsa.sim.capsim command that generates the error

./jsa.sim.capsim --reference GCF_000001405.26_GRCh38_genomic.fna --probe probes.bam --ID someid --fmedian 5000  --pacbio output --pblen 3000 --num 20000000
#Mark capturable regions
Exception in thread "main" htsjdk.samtools.SAMException: No index is available for this BAM file.
	at htsjdk.samtools.BAMFileReader.getIndex(BAMFileReader.java:408)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:924)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:584)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:528)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at japsa.tools.bio.sim.SimulateCaptureCmd.main(SimulateCaptureCmd.java:406)

read length setting cannot work

Hi,

I am using CapSim with the following commands:

jsa.sim.capsim --reference ucsc.hg19.fasta
--probe chr11_chr16_rescue_probe.bam
--ID test_id
--miseq test_miseq
--fmedian 500
--illen 100
--num 5000

log file is as follows:
[main] INFO japsa.tools.bio.sim.SimulateCaptureCmd - Mark capturable regions
[main] INFO japsa.tools.bio.sim.SimulateCaptureCmd - Generated 7661 selected 5000; reject1 = 2; reject2 = 340; reject3 = 2319; reject4 = 0
Parameters for simulation
reference = ucsc.hg19.fasta
probe = chr11_chr16_rescue_probe.bam
logFile = -
ID = test_id
miseq = test_miseq
pacbio = (null)
fmedian = 500
fshape = 6.0
num = 5000
pblen = 30000
illen = 100
seed = 0
help = false
#Seed 2019171277
Read 93 chr 3137161264bp

However, the simulated reads are all with a read length of 250. And it remains 250bp whatever I changed the illen to 50,100, et. al.

Below is a simulated reads of 250bp in length:
@test_id_chr16_244488_245154
GTACCTGGCTTAAATGCCATCCTTGAACATCACCTTAAAGGTGTCCATCCATGTGTGCCCTGCTTCAGACACAGGATGAGCATGGGTGAAAGTTTAGAGCTGGAGATACCAGTCACCCAGCTCTGCATACTTTCCGTGCCCTTGTGGGGAGCGTGCTCTCATCCACACTGGGTATGGAAGATTATCCTTATGCCATCGTGGCTCTCTGGATGGAGAAGACTCGTGATTCTTTGTCTTACTGGCTCTGACT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.