Code Monkey home page Code Monkey logo

splitreader's Introduction

SPLITREADER

SPLITREADER is a bioinformatic pipeline dedicated to the discovery of non-reference TE insertions with Target Site Duplications (TSDs) through the use of Illumina short genome sequence reads and a reference genome.

Table of contents

Table of contents generated with markdown-toc

Installation

You can either download the code by clicking the ZIP link on this webpage or clone the project using:

git clone https://github.com/LeanQ/SPLITREADER.git

Dependencies

SPLITREADER requires the following softwares:

SPLITREADER main steps

SPLITREADER pipeline consists in four steps:

1- Extraction of reads not mapping to the reference genome (using SAM flag 4)

2- Forced mapping to a collection of reference TE sequences (constructed using prepare_TEs.sh as indicated below). SPLITREADER then identify all reads with one end (>=20nt) mapping to a TE extremity (by locating reads where the CIGAR string starting or ending with 'S' character with a value equal or greater to 20)

3- Reads are then recursively soft clipped by 1nt from one end and mapped to the reference genome using Bowtie2 until the soft-clipped read length reached 20nt.

4- Read clusters composed by reads clipped from the same extremity and overlapping with read clusters composed of reads clipped from the other extremity were taken to indicate the presence of a bona fide TE insertion only if the size of the overlap was equal or less than 2-fold longer than that reported for the corresponding TE family.

Input data

This pipeline requieres as an input:

  1. A BAM file of reads mapped to the reference genome

  2. A tab-delimited file containing TE coordinates across the genome and the expected size for the TSDs. The file should have the following format:

    • Chromosome name (should be the same nomenclature as in the reference genome fasta file)
    • TE start
    • TE end
    • TE_ID
    • TSD

For instance as indicated in the ./Test_data/TE_list/test_TE_coordinates.bed file:

cat test_TE_coordinates.bed
#Chr start end  TE_ID TSD
Chr1 21524995 21525295 TE1  5
Chr1 21529551 21529851 TE1  5
Chr3 13369174 13369474 TE1  5
Chr3 13373808 13374108 TE1  5
Chr3 22059535 22059835 TE1  5
Chr3 22064029 22064329 TE1  5
Chr3 22695566 22695866 TE1  5
Chr3 22700222 22700522 TE1  5
Chr5 4208083 4208383 TE1 5
Chr5 4212784 4213084 TE1 5

NOTE: Better performance is obtained if the expected size for the TSDs are provided. If no TSD is provided, it will be set to 3.

Usage

Prepare index files for reference genome

Use the test data in order to quickly test the scripts. To do so, one needs to download the TAIR10 Arabidopsis thaliana fasta file and generate the bowtie2 indexes.

# Download the fasta file
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas

# Add Chr as prefix for each chromosome
sed -i 's/>\([1-5]\)/>Chr\1/' TAIR10_chr_all.fas

# Index the file with bowtie2 (use prefix TAIR10 for the index files)
bowtie2-build TAIR10_chr_all.fas TAIR10

The generated index files should be:

TAIR10.1.bt2
TAIR10.2.bt2
TAIR10.3.bt2
TAIR10.4.bt2
TAIR10.rev.1.bt2
TAIR10.rev.2.bt2

Edit the configuration file with the correct path to the dependencies SPLITREADER_configuration_file_example.txt.

Prepare TEs

Before runing TE detection, one must create a Bowtie2 index for each input TE.

bash prepare_TEs.sh -i <Input TE coordiantes described above>
	-g <Genome.fa>
	-d <directory used to store created TE indexes. If it contains a collection of indexed TE, the script will check if the index already exists>
	-p </path/to/bowtie2-build> 

In our example, it will be:

bash prepare_TEs.sh -i ./Test_data/TE_list/test_TE_coordinates.bed \
	-g ./TAIR10.fa \
	-d ./Test_data/TE_indexes  \
	-p /path/to/bowtie2-build

Detect TE insertions

Once the TEs are indexed, one can edit the configuration file and procceed with the discovery of non-reference TE insertions using the main script:

bash SPLITREADER-beta1.2.sh -c SPLITREADER_configuration_file.txt

Output

SPLITREADER outputs the location of the detected non-reference TE insertions and a BAM file for visualisation purposes. The output file have the following format:

- Chromosome name
- Insertion start
- Insertion end
- Reconstructed TSD size
- # Reads supporting the 5'end of the TE insertion 
- # Reads supporting the 3'end of the TE insertion 

For our example output, the expected output files are:

test-TE1-insertion-sites.bed
test-TE1-split.bam
test-TE1-split.bam.bai

And the content of test-TE1-insertion-sites.bed should be:

cat test-TE1-insertion-sites.bed
Chr1 23655713 23655723 TE1  9  8 3
Chr1 23655713 23655723 TE1  9  8 4

Cite

If you use this software, please cite:

The Arabidopsis thaliana mobilome and its impact at the species level. Quadrana L, Bortolini Silveira A, Mayhew GF, LeBlanc C, Martienssen RA, Jeddeloh JA, Colot V. Elife. 2016 Jun 3;5. pii: e15716. doi: 10.7554/eLife.15716. PubMed PMID: 27258693. (https://elifesciences.org/content/5/e15716)

splitreader's People

Contributors

leanq avatar johanzi avatar

Stargazers

 avatar  avatar Snowseed avatar Ivy avatar Arya avatar  avatar chenziru avatar Shuai Chen avatar Manyi Sun avatar Shangzhe Zhang avatar Michael Alonge avatar  avatar Wenbin Mei avatar Rahul Pisupati avatar leze avatar TOM YAN avatar  avatar  avatar LMann avatar Songtao Gui avatar balabala avatar Trevor Tanner avatar  avatar

Watchers

James Cloos avatar

splitreader's Issues

Ask for help

I used your test data, but I couldn't get the results in the last step.

code: bash SPLITREADER-beta1.2.sh -c SPLITREADER_configuration_file_example.txt

Error:
[E::hts_open_format] Failed to open file "/data1/home/mingzhang/SPLITREADER/SPLITREADER/Tmptest/QD-64848/results-64848/test-TE1.sam" : No such file or directory
samtools view: failed to open "/data1/home/mingzhang/SPLITREADER/SPLITREADER/Tmptest/QD-64848/results-64848/test-TE1.sam" for reading: No such file or directory

I'm pretty sure I wrote the right path of samtools.

I wonder if you could give me some advice.
Thanks a lot!
--Ming

Overlap reads between $BAMname-$TE-split-5+.bam and $BAMname-$TE-split-local-down.bam

It seems that reads in $BAMname-$TE-split-local-down.bam with more (LENGTH-20)/2 hard mapping should be contained in $BAMname-$TE-split-5+.bam.

(Take "SRR492382.16324611/1" for example)

So there are duplicate reads in $BAMname-$TE-up.bam which would influence the calculation of depth of $BAMname-$TE-up.bam. It will also influence the calculation of "($4+$8)>=rea "(depth of up.banm + depth of down.bam) when deciding the insertion of TE.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.