Code Monkey home page Code Monkey logo

currbs's Introduction

cuRRBS

Created by Daniel E. Martin-Herranz, Antonio J.M. Ribeiro and Thomas M. Stubbs.

Copyright (C) 2016,2017 D.E. Martin-Herranz, A.J.M. Ribeiro, T.M. Stubbs.

DOI

What is cuRRBS?

cuRRBS (customised Reduced Representation Bisulfite Sequencing) is an easy-to-use computational method that predicts what is the optimal combination of restriction enzymes and size range that should be used in order to enrich for any given set of sites of interest in any genome. In other words, by modifying two steps of the original protocol, cuRRBS generalises RRBS. This allows the user to create 'personalised' reduced representations (i.e. subsets) of the genome, that make DNA methylation next-generation sequencing experiments more cost-effective.

Citing cuRRBS

If you found cuRRBS useful for your research, please cite our original publication:

Daniel E. Martin-Herranz, António J. M. Ribeiro, Felix Krueger, Janet M. Thornton, Wolf Reik, Thomas M. Stubbs; cuRRBS: simple and robust evaluation of enzyme combinations for reduced representation approaches, Nucleic Acids Research, , gkx814, https://doi.org/10.1093/nar/gkx814

Getting started

Before being able to run cuRRBS in your computer, you will need to download and install the following software:

  1. Python 2.7. The programming language in which cuRRBS is written.
  2. NumPy. A Python module which contains many useful functions used by cuRRBS.
  3. Terminal. You will need it to run cuRRBS commands.

Afterwards, you can install cuRRBS in your computer by direct download or cloning it from this GitHub repository:

git clone https://github.com/demh/cuRRBS.git

Preparing cuRRBS input files

cuRRBS requires the following files as input:

1. Isoschizomer annotation file.

Restriction enzymes that cut the genome in the same way (i.e. generate the same fragment length distribution) are grouped in isoschizomer families. Furthermore, this file also contains information regarding the methylation-sensitivity (in a concrete genomic context) of the commercially-available restriction enzymes. These files can be found in utils/isoschizomers_*_annotation.csv but the user can also create his own file. By default, cuRRBS uses isoschizomers_CpG_annotation.csv, which should be used when considering only methylation-sensitivity in CpG contexts (i.e. for most vertebrate genomes).

2. Pre-computed files.

They contain information for the in silico digestions of a genome with certain restriction enzymes. There are two ways to obtain these files:

  • Download them (recommended option for human, mouse and Arabidopsis thaliana genomes). Only the enzymes which belong to an isoschizomer family with CpG methylation-insensitive members (see utils/isoschizomers_CpG_annotation.csv) are available. They need to be extracted using the following command in the terminal:

    tar xvjf pre_computed_files_for_genome_of_interest.tar.bz2 /path/to/precomputed/files/folder/
    
  • Generated by the user (recommended only when using a novel genome). The script src/pre_compute_digestions.py can be used for these purposes:

    python pre_compute_digestions.py enzymes_to_pre_compute.txt genome.fa working_directory
    

3. Enzymes to check.

This file contains the isoschizomer families that should be considered by cuRRBS (i.e. those that contain at least one methylation-insensitive member). When working with CpG methylation-insensitive enzymes, the file utils/enzymes_to_check_CpG.txt should be used.

4. Sites annotation file.

This file stores the information regarding the sites of interest (i.e. genomic coordinates) that you want to enrich for with the new customised protocol. This file needs to be created from scratch, since it will be different depending on the sites that need to be targeted. The file is provided with a CSV (comma-separated values) format, with each row containing the information for a site of interest and the following columns:

  • Site_ID: unique ID for the current site of interest. It can not contain the '_' or the ',' characters.

  • Chr: chromosome where the current site of interest is located. The same chromosome names should be used as in the case of the pre-computed files. e.g. in the case of the human (hg38) and mouse (mm10) pre-computed files, we use the chromosome nomenclature from UCSC (i.e. chr1, chr2, ...).

  • Coordinate: 1-based genomic coordinate where the current site of interest is located. Please make sure that the coordinates are based on the same genome assembly as the one used to generate the pre-computed files.

  • Weight: in case there is a preference in recovering certain sites of interest, their weights should be higher. The weights are always positive and are used afterwards to calculate the Score (see Interpreting cuRRBS output section).

The first line of the sites annotation file must be a header containing the column names. Some examples of these type of files for different biological systems can be found in the examples/ folder.

Running cuRRBS

There are two parameters that you definitely need to consider before running cuRRBS:

  • C_Score constant (compulsory). It provides a threshold for the minimum Score that needs to be obtained in order to include a certain enzyme combination in the output. This is important since there is a trade-off between the number of sites of interest that will be sequenced and the cost associated with the new protocol (i.e. higher C_Score values will be associated with higher sequencing costs). We recommend running the software initially with a value of 0.25, which will force the reported customised protocols to be able to capture at least 25 % of the maximum Score (i.e. in those cases where all the site weights are the same, this is equivalent to 25 % of the sites of interest).

  • Experimental error (default: 20 bp). You need to specify what is your estimated experimental error during the size selection step. In general, this error should be higher for protocols that use AMPure XP beads as compared to gel-slicing. If you run cuRRBS with a higher experimental error, less size ranges will be checked and the software will be faster. This parameter will also influence the robustness of the protocol (higher experimental errors generally imply lower robustness).

You can find more information regarding cuRRBS parameters in the help page of the software:

python /path/to/cuRRBS/cuRRBS.py -h

A typical cuRRBS command (using the default parameters) would look like:

python /path/to/cuRRBS/cuRRBS.py -o /path/to/output/folder/ -p /path/to/hg38/pre-computed/files/folder/ -e /path/to/cuRRBS/utils/enzymes_to_check_CpG.txt -a /path/to/cuRRBS/examples/epigenetic_clock_human_hg38_sites_annotation.csv -r 75 -s 120 -c 0.25 -g 3088.286401

Interpreting cuRRBS output

cuRRBS output is a CSV file (final_cuRRBS_output.csv) that can be opened with Excel. It contains different customised RRBS protocols (restriction enzymes and size range) ranked by their ability to enrich for the sites of interest. The output file contains the following columns:

  • Enzyme(s). Restriction enzymes that should be used in the customised protocol. Several methylation-insensitive isoschizomers can be reported (separated by OR and surrounded by parenthesis) if available, so you can choose which ones to use dependening on their experimental conditions or price. e.g. if the output is '(BmeT110I OR BsoBI) AND (BtgI)', the double-digestions BmeT110I+BtgI or BsoBI+BtgI could be used.
  • Experimental_size_range. Size range that needs to be implemented in the experimental protocol. This is equivalent to the theoretical_size_range plus the size of the adapters (specified with the -s parameter).
  • Theoretical_size_range. Size range that is used internally by cuRRBS to make its calculations.
  • Score. Sum of the weights of the sites of interest that are sequenced with the customised protocol. cuRRBS attempts to maximise this variable.
  • %_max_Score. Percentage of the maximum Score (i.e. if all the sites of interest were sequenced) achieved.
  • NF/1000. Number of digested fragments that will be sequenced, divided by 1000. Since we assume this variable is proportional to the sequencing costs, cuRRBS attempts to minimise it.
  • Cost_Reduction_Factor. Estimated fold-reduction in sequencing costs as compared to Whole Genome Bisulfite Sequencing (WGBS).
  • Enrichment_Value. Variable which combines the Score and NF/1000 (see cuRRBS paper for more details). The final goal of cuRRBS is minimising this variable.
  • Robustness. It reflects how sensitive the customised protocol is to experimental errors during the size selection step. It takes values in the interval (0,1]. Generally speaking, protocols with Robustness > 0.95 can be considered robust (i.e. the prediction made by the software will mostly hold even if small experimental errors happen).
  • C_Score|C_NF/1000. Thresholds used for the Score and NF variables.
  • Number_of_sites. Number of sites of interest that will be sequenced with this customised protocol.
  • Sites_IDs (only if the -i option is specified). IDs of the sites of interest that will be sequenced with this customised protocol.

Contacting us

If you experience any issues with cuRRBS or have any suggestions, please contact us at [email protected], [email protected] or [email protected].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.