Code Monkey home page Code Monkey logo

ectools's Introduction

Long Read Correction and other Correction tools

This package is a loose collection of scripts. To run the correction
routine see the section below. Descriptions of the other scripts
are at the bottom of this file.

Contact: [email protected]

###
#Correction
###

In short, the correction algorithm takes as input the unitigs from a short
read assembly and uses them to correct long read data.
More background information for the algorithm can be found:
http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf

Running Correction.

ECTOOLS_HOME=/path/to/this/directory

1. Create Celera Unitigs from Illumina (or other high identity) data
   
   Output: organism.utg.fasta


2. Create a new directory to house the correction
   
   $> mkdir organism_correct

3. Simlink your organism.utg.fasta into the correction dir

   $> cd organism_correct 
   $> ln -s /path/to/organism.utg.fasta

3. Generally we only correct reads greater than 1kb. Filter out shorter
   reads if possible. If you have it, 20-30x of raw long read coverage
   is preferred so that you have ~20x remaining after correction.

4. Partition pacbio reads into small batches

   $> python ${ECTOOLS_HOME}/partition.py 20 500 pbreads.legnth_filtered.fa 
   
   Output: Series of directories and a ReadIndex.txt which maps 
   reads to partitions 
   
   *Note: You probably only want a few reads 
   per file (20 is usually good)

5. Copy correction script to working dir
   
   $> cp ${ECTOOLS_HOME}/correct.sh .

6. Modify the global variables a the top of correct.sh

7. Ensure Nucmer suite is in your PATH
   
   $> nucmer

8. Launch Correction The correct.sh script should be run in each of
   the partition directories. The number of files in directory should
   be specified as the array parameter to grid engine.  This simple
   for loop works well:
   
   $> for i in {0001..000N}; do cd $i; qsub -cwd -j y -t 1:${NUM_FILES_PER_PARTITION} ../correct.sh; cd ..; done
   
   Where N is the number of partitions (directories) and
   NUM_FILES_PER_PARTITION is the number of files per partition
   specified in the partitions script (500 in this case)


9. Wait for correction to complete.  Concatenate all corrected output
   files from all parititons into a single fa file:
   
   $> cat ????/*.cor.fa > organism.cor.fa


10. Run Celera on the output file.

   *Notes: 
   Use convert-fasta-to-v2.pl to make celera frg file from
   organism.cor.fa 

   example:
   $> convert-fasta-to-v2.pl -l organism_pbcor -s organism.cor.fa \
      -q <( python ${ECTOOLS_HOME}/qualgen.py organism.cor.fa) \
      > organism.cor.frg
   

   Celera's default parameters work pretty well with
   corrected PB data, you can also drop the utgGraphErrorRate to 0.01.  

   Make sure you use OverlapBasedTrimming (it is on by default).
   

11. (Optional)
If SGE is not available in your compute environment, you can use
GNU parallel to run ECTools. Although not the recommended method,
as alignment can be very compute intensive, for small genomes 
(bacteria), this method can be used on a single machine.

For each of the directories created by the partition script (0001..000N),
cd into the directory and run:

   $>for j in {1..500}; do echo "SGE_TASK_ID=$j TMPDIR=/tmp ../correct.sh"; done  | \
     parallel -j <# of compute cores>


###
#Script Descriptions
###

@@ schtats @@
Calculates read length statistics for sequence files.


@@ simulate_reads.py @@
Simple read simulator. Takes a genome and a file of line
separated read lengths. A read is produced for each given read
length at the specified error rate.

@@ gccontent.py @@
Builds an ascii graph of gc content for each item in a fasta file

@@ io.py @@
Has various parsers for parsing delta files

@@ join.py @@
general purpose script to do database like join on flat file.

@@ group_by.py @@
general purpose script for grouping sorted line oriented data
by a column.

@@ filter.py @@
general purpose script for filtering line separated data.
Takes a database and uses it to filter the query file.

@@ gc_count.py @@
summarizes gc content from a set of alignments

@@ fix_delta_header.py @@
Because the correct.sh writes all output to a temporary directory in
SGE and the copies it back to the original working directory, delta
files will have the header of the temporary directory. This script
just fixes the header so that downstream mummer tools work after
the pipeline has completed. 


                   

ectools's People

Contributors

jgurtowski avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.