Code Monkey home page Code Monkey logo

tftarget's Introduction

tfTarget

Transcription factors (TFs) regulate complex programs of gene transcription by binding to short DNA sequence motifs within transcription regulatory elements (TRE). Here we introduce tfTarget, a unified framework that identifies the "TF -> TRE -> target gene" networks that are differential regulated between two conditions, e.g. experimental vs. control, using PRO-seq/GRO-seq/ChRO-seq data as the input. The package provies a convenient interface for users without assuming knowledge with R environment, users can directly run the scipts in linux console.

Cloud Computing Service:

We provide a computational gateway to run tfTarget on GPU server, the users don't need to install any software, only upload the bigWig files and wait for the results, it is simple and easy. Please click the link to try this site:

https://dreg.dnasequence.org/

Cite tfTarget:

Chromatin run-on and sequencing maps the transcriptional regulatory landscape of glioblastoma multiforme

Tinyi Chu, Edward J Rice, Gregory T Booth, Hans H Salamanca, Zhong Wang, Leighton J Core, Sharon L Longo, Robert J Corona, Lawrence S Chin, John T Lis, Hojoong Kwak, Charles Danko

Nature Genetics https://www.nature.com/articles/s41588-018-0244-3


A detailed example of running dREG and tfTarget from raw sequencing data can be found at https://doi.org/10.1002/cpbi.70 .

Workflow of tfTarget

Requires

  • R packages:

    rphast, rtfbdbs, grid, cluster, apcluster, DESeq2, gplots.

    rtfbsdb (https://github.com/Danko-Lab/rtfbs_db)

    bigWig (https://github.com/andrelmartins/bigWig.git)

  • bioinformatics tools or exterior command:

    awk, sort: Unix commands

    bedtools (http://bedtools.readthedocs.org/en/latest/)

    sort-bed (http://bedops.readthedocs.org/en/latest/index.html)

    twoBitToFa, faToTwoBit (http://hgdownload.cse.ucsc.edu/admin/exe/)

  • 2bit files for your genome of interest. Find links to these here:

    http://hgdownload.cse.ucsc.edu/downloads.html

  • tfs object file for the species of interests, in .rdata format, which contains the curated transcription factor motifs database. For Homo_sapiens, it is provided by tfTarget package, and will be used by default. For others species, we provide a convenient script get.tfs.R to call rtfbsdb, and generate the species.tfs.rdata. Please use -tfs.path to specify the tfs object file rather than Homo_sapiens when you call run_tfTarget.bsh

    example:

     R --vanilla --slave --args Mus_musculus < get.tfs.R
     
     R --vanilla --slave --args your_species your_cisbp_zip_file < get.tfs.R
    

    The look-up table for species name can be found here: The "species" column (1st column) of http://cisbp.ccbr.utoronto.ca/summary.php?by=1&orderby=Species

  • TREs regions identified by dREG, or equivalent tools, in bed format.

    To prepare the input TRE files, users are recommended to merge dREG sites from query and control samples, using bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html), e.g.,

     cat query.dREG.peak.score.bed control.dREG.peak.score.bed \
     | LC_COLLATE=C sort -k1,1 -k2,2n \
     | bedtools merge -i stdin > merged.dREG.bed
    

    Please notice to use zcat for bed.gz files.

  • Gene annotation file in bed6 format. Can be prepared from gencode or Refseq gtf files. We recommend to use gene ID and gene name for the 4th and 5th columns. The information will show up in the output. https://www.gencodegenes.org/releases/current.html

    gtf.gz files can be converted to the gene annotation file for tfTarget input using the following command as an example:

     zcat gencode.v19.annotation.gtf.gz \
     |  awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$18,$7}}' \
     | tr -d '";' > gencode.v19.annotation.bed
    

    The following table illustrates the head of 'gencode.v19.annotation.bed', which includes chromosome, start, end position, gene id, gene name and strand.

     chr1    11868   14412   ENSG00000223972.4       DDX11L1 +
     chr1    14362   29806   ENSG00000227232.4       WASH7P  -
     chr1    29553   31109   ENSG00000243485.2       MIR1302-11      +
     chr1    34553   36081   ENSG00000237613.2       FAM138A -
     chr1    52472   54936   ENSG00000268020.2       OR4G4P  +   
     chr1    62947   63887   ENSG00000240361.1       OR4G11P +
     chr1    69090   70008   ENSG00000186092.4       OR4F5   +
     chr1    89294   133566  ENSG00000238009.2       RP11-34P13.7   
    
  • bigWigs files of query and control replicates. The same requirement for preparing the input files for dREG. See this https://github.com/Danko-Lab/dREG#data-preparation

Installation

  • If all dependent packages and commands have been installed, please use the following codes to install/update the package in R terminal.
library("devtools");
install_github("Danko-Lab/tfTarget/tfTarget")

If you want to run bash script (run_tfTarget.bsh), you have to download all files after the package is installed, Use the following command in UNIX/Linux terminal.

git clone https://github.com/Danko-Lab/tfTarget.git
cd tfTarget

Usage

To use the tfTarget package, after installing tfTarget package, simply run "bash run_tfTarget.bsh ... " under the directory of bash run_tfTarget.bsh and main.R, with ... specifying arguments for tfTarget detailed as below.

Required arguments:

-query: query file names of PRO-seq in bigWig format, ordered by plus and minus pairings, 
	e.g. query1.plus.bw query1.minus.bw query2.plus.bw query2.minus.bw ... 
	(The default directory is the current working directory, use -bigWig.path to specify if otherwise.)

-control: control file names of PRO-seq in bigWig format, ordered by plus and minus pairings, 
	e.g. control1.plus.bw control1.minus.bw control2.plus.bw control2.minus.bw ...
	(The default directory is the current working directory, use -bigWig.path to specify if otherwise.)

-prefix: prefix for the output pdfs and txts. 

-TRE.path: input TRE regions, e.g. dREG sites, in bed3 format. Only the first three columns will be used. 

-gene.path: Gene annotation file in bed6 format. Can be prepared from gencode or Refseq gtf files. 
	We recommend to use gene ID and gene name for the 4th and 5th columns. 
	The information will show up in the output. https://www.gencodegenes.org/releases/current.html

-2bit.path: 2bit files for your genome of interest. 
	Find links to these here: http://hgdownload.cse.ucsc.edu/downloads.html

Optional arguments:

Optional system arguments:
-bigWig.path: path to the bigWig files. 
	Default="./"
-ncores: number of threads to use. 
	Default=1.
-deseq: Use this tag indicates to run DEseq2 only. 
	No arugment is required. Default is off.
-rtfbsdb: Use this tag indicates to run DEseq2 and then rtfbsdb only. 
	No arugment is required. Default is off.

Optional DEseq2 arguments:
-pval.up: adjusted pvalue cutoff below which indicates differentially transcribed TREs. 
	Default=0.01.
-pval.down: adjusted pvalue cutoff above which indicates TREs that are not significantly changed between query and control. 
	Default=0.1

Optional rtfbsdb arguments:
-tfs.path: use this tag to specify tfs object from non-Homo sapiens species. 
	Can be prepared using get.tfs.R. See the "requires" section above.
-cycles: how many cycles to run GC-subsampled motif enrichment test. 
	Default=2.
-mTH: threshold over which the TF motif is defined as significant different from the HMM background. 
	Default=7.
-fdr.cutoff: cutoff of the median of pvalues from multiple GC-subsampled runs, above which defines significantly enriched motifs.
	Default=0.01	

Optional mapTF arguments:	
-dist: the distance cutoff (in base pair) for associating TRE to the nearest annotated transcriptional start site. 
	Default=50000.
-closest.N: use this tag to report only the first nth genes to the TRE, can be used in combination with -dist. 
	Default is 2. To disable it, use "-closest.N off".
-pval.gene: use this tag to report only genes that are significantly differentially transcribed genes 
	1) at the same direction as the regulator TRE, and 
	2) with adjusted pval lower than the cutoff specified. 
	Default is 0.05. To disable it, use "-pval.gene off".
The default parameters were chosen based on the ChRO-seq paper "https://www.biorxiv.org/content/early/2018/05/13/185991".

Output

The output of an complete run of the main.R function will output .pdf files and .txt files.

Specifically,
1) .cor.heatmap.pdfs for TF motifs clustered by genomic locations,
2) .motif.ordered.pdfs for the visualization of TF motifs and their enrichment statistics
   ordered by clusters in 1).
3) .TRE.deseq.txt for each TREs and their DESeq2 statistics.
4) .gene.deseq.txt for each annotated gene body and their DESeq2 statistics.
   Rows with all NA value means that the length of gene is too short (<=1Kb) to be included for DESeq2 runs.
5) .TF.TRE.gene.txt for each TF whose motif is enriched in up/down TREs, 
   and the TREs that contains the motif, and the putative target genes for the TRE.
   We recommend users to further filter the target genes of log2foldchange (and the pvalues) 
   with the same direction of change as the TRE by which it is regulated.

Documents

  • R vignette: (Coming soon)

  • R manual: (Coming soon)

tftarget's People

Contributors

tinyi avatar wzhy2000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tftarget's Issues

Error when associating TFs to TREs and genes (mapTF / get.proximal.genes)

When running tfTarget via run_tfTarget.bsh with the following command:
bash run_tfTarget.bsh \ -query $TREATMENT_SAMPLES \ -control $CONTROL_SAMPLES \ -bigWig.path $BIGWIG_PATH \ -prefix gencode_test \ -TRE.path $TRE_MERGED_BED \ -gene.path $ANNOTATION_BED \ -2bit.path $HG19_2BIT \ -pval.up 0.1 \ -pval.down 0.1 \ -ncores 3 \ -dist 50000 \ -closest.N 2 \ -pval.gene 0.1

I am getting the following error:

[1] "associating TFs to TREs and genes"
awk: syntax error at source line 1
context is
BEGIN{OFS=" "} {print >>> $1,$6== <<<
awk: illegal statement at source line 1
awk: illegal statement at source line 1
Error in $<-.data.frame(*tmp*, "closest.N", value = c(1L, 2L, 1L, :
replacement has 36 rows, data has 37
Calls: mapTF -> get.proximal.genes -> $&lt;- -&gt; $&lt;-.data.frame
Execution halted

This appears to be related to the awk command at lines 18-20 or 43-45 of mapTF.R

R session info with tfTarget loaded:

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] tfTarget_1.0

loaded via a namespace (and not attached):
[1] bitops_1.0-6 matrixStats_0.54.0 rtfbsdb_0.4.5
[4] bit64_0.9-7 RColorBrewer_1.1-2 GenomeInfoDb_1.18.1
[7] tools_3.5.1 backports_1.1.3 R6_2.3.0
[10] KernSmooth_2.23-15 rpart_4.1-13 sm_2.2-5.4
[13] Hmisc_4.1-1 DBI_1.0.0 lazyeval_0.2.1
[16] BiocGenerics_0.28.0 colorspace_1.3-2 nnet_7.3-12
[19] tidyselect_0.2.5 gridExtra_2.3 DESeq2_1.22.1
[22] bit_1.1-14 compiler_3.5.1 Biobase_2.42.0
[25] htmlTable_1.12 DelayedArray_0.8.0 rphast_1.6.9
[28] caTools_1.17.1.1 scales_1.0.0 checkmate_1.8.5
[31] genefilter_1.64.0 stringr_1.3.1 apcluster_1.4.7
[34] digest_0.6.18 foreign_0.8-71 XVector_0.22.0
[37] vioplot_0.3.0 base64enc_0.1-3 pkgconfig_2.0.2
[40] htmltools_0.3.6 htmlwidgets_1.3 rlang_0.3.0.1
[43] rstudioapi_0.8 RSQLite_2.1.1 bindr_0.1.1
[46] zoo_1.8-5 BiocParallel_1.16.5 bigWig_0.2-9
[49] gtools_3.8.1 acepack_1.4.1 dplyr_0.7.8
[52] RCurl_1.95-4.11 magrittr_1.5 GenomeInfoDbData_1.2.0
[55] Formula_1.2-3 Matrix_1.2-15 Rcpp_1.0.0
[58] munsell_0.5.0 S4Vectors_0.20.1 stringi_1.2.4
[61] yaml_2.2.0 rtfbs_0.3.9 SummarizedExperiment_1.12.0
[64] zlibbioc_1.28.0 gplots_3.0.1 plyr_1.8.4
[67] grid_3.5.1 blob_1.1.1 gdata_2.18.0
[70] parallel_3.5.1 crayon_1.3.4 lattice_0.20-38
[73] splines_3.5.1 annotate_1.60.0 locfit_1.5-9.1
[76] knitr_1.21 pillar_1.3.0 GenomicRanges_1.34.0
[79] geneplotter_1.60.0 stats4_3.5.1 XML_3.98-1.16
[82] glue_1.3.0 latticeExtra_0.6-28 data.table_1.11.8
[85] gtable_0.2.0 purrr_0.2.5 assertthat_0.2.0
[88] ggplot2_3.1.0 xfun_0.4 xtable_1.8-3
[91] survival_2.43-3 tibble_1.4.2 AnnotationDbi_1.44.0
[94] memoise_1.1.0 IRanges_2.16.0 bindrcpp_0.2.2
[97] cluster_2.0.7-1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.