Code Monkey home page Code Monkey logo

rhoterm-peaks's Introduction

Rho termination peak-caller

This tool was designed to call peaks at Rho termination regions. The tool identifies genomic regions where read coverage is significantly higher in the 'Treated' sample vs 'Untreated'.

Usage

Prerequisite: Python 3.x or higher, SciPy.

Required input is 2 tab-delimited files ('Untreated' and 'Treated') with genome coordinates, read coverage, strand information and no headers. Strand information can be represented as '+' or '-' in a separate column, or as a positive/negative read coverage value.

Input parameters must be first entered in the parameters file:

Rho-termination_peak-caller_Input-parameters.json
  • 'optional' parameters can be left blank and will default to current working directory; if no input directory is provided, input, parameters and python executable files need to be in the same folder;
  • Parameters and paths need to be in double quotation marks;
  • 'window' : Number of positions to include upstream and dowstream of each coordinate when adding read coverage;
  • 'single_peak_window' : Min distance between regions with significant difference in transcriptional readthrough when calling peaks;
  • 'p_value_threshhold' : Max p value from fisher's exact test to use for selecting transcriptional readthrough regions;
  • 'ratio_threshhold' : Min 1/(untreated/treated) ratio direction to use when calling peaks; e.g. ratio > 1 means treated sample has a higher read coverage;
  • 'genome_size' : Required to calculate transcriptional readthrough for each position and select the best peak in the region;
  • 'coordinate_column' : Column number in the tab-delimited input file that holds genome coordinate;
  • 'read_count_column' : Column number in the tab-delimited input file that holds read count number for a specific coordinate;
  • 'optional_strand_column' : Can be blank if read count column has negative values to represent minus strand;
  • 'optional_input_file_directory' : Can be left blank if input read count files are in the same directory as the program;
  • 'untreated_file_name' : Full file name with extension; must be tab-delimited;
  • 'treated_file_name' : Full file name with extension; must be tab-delimited;
  • 'optional_output_file_directory' : Can be left blank and will save output files to to current working directory;
  • 'optional_out_peak_file_name' : Full file name with extension; if left blank, output files will have an auto-generated file name with parameters; Reports just selected peaks with associated information;
  • 'optional_out_sum-score_file_name' : Summary file; Full file name with extension or if left blank, output files will have an auto-generate a file name with parameters; Reports all positions that meet significance/pvalue threshold and were used for peak calling and associated information.

Run python executable file.

The program takes ~4h to run with a 4.6 Mbp genome. It can be tested by setting the 'genome_size' to a lower value (e.g. 100,000) and using test .gff files that include read coverage up to the genome position 100,000. Test output files are also provided.

Algorithm

1. The program first calculates read coverage upstream and downstream of each genomic position from the 'Untreated' and 'Treated' samples in a given window size;

2. 'Rho score' (1/((untreated_ds/untreated_us)/(treated_ds/treated_us))) and 'Significance score' (p value from fisher's exact test) is then calculated using the four read read coverage values; if those scores/positions meet set thresholds, they will be used for peak calling and saved in a 'Summary' file;

3. 'Rho termination peak' is called as a single position with the highest 'Rho score' in a region. Minimal required distance between peaks should be set as 'single_peak_window'.

Output

  1. Summary file for all positions that met 'Rho score' and 'Significance score' thresholds represents clusters/regions of significant Rho termination (increase in read coverage in the 'Treated' file conpared to 'Untreated' sample).
  2. Peak file of positions with highest 'Rho score' in a given region.

Both files include

  • Coordinate;
  • Strand;
  • Read coverage in a given window upstream and downstream of the coordinate;
  • Rho Readthrough score (R);
  • Significance score (p value from fisher's exact test).

Caveats

  • When 'Significance score' (p value) is too low to accurately report and is shown as '0.0';
  • If one of the regions used in the calculation has no read coverage, it is difficult to calculate 'Rho Readthrough score' (1/((untreated_ds/untreated_us)/(treated_ds/treated_us))) due to division by 0 error. In those cases, 'Readthrough score' is estimated and if it passes the set threshold, score is reported as 'N/A'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.