Code Monkey home page Code Monkey logo

fishhook's Introduction

R-CMD-check codecov.io

fishHook

R package for applying Gamma-Poisson regression to identify statistical enrichment or depletion of somatic mutations in arbitrary (sets of) genomic intervals after correcting for genomic covariates, e.g. replication timing, sequence context, chromatin state.

License: GPL v3

Installation

  1. Install devtools from CRAN (if you don't have it already)
install.packages('devtools')
  1. Install gUtils
devtools::install_github('mskilab/gUtils')
  1. Install fishHook
devtools::install_github('mskilab/fishHook')

Documentation

fishHook Tutorial

fishHook Developer Reference

Attributions

Marcin Imielinski - Assistant Professor, Weill-Cornell Medical College. Core Member, New York Genome Center.

Zoran Gajic - Undergraduate Research Assistant, Rutgers University, Weill Cornell Medicine, New York Genome Center.

fishhook's People

Contributors

evanbiederstedt avatar imielinski avatar mskilab avatar stevenschumacher avatar walaj avatar xtyao avatar zining01 avatar zoranzg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fishhook's Issues

Nucleotide context

Hi there @mskilab @imielinski
I have a maf file containing ssvs for 100 samples.
How do i get the nucleotide context for these samples?
Do i need to generate nucleotide context for each one and average across the samples?
Or is there a generic nucleotide context that can be used?
I am using hg38.
Thanks for your help!
A

Error: covariates must be of class Covariate

Hi,
I am using fishHook. I can run the basic model (without covariates) without an issue.

When I add in the covariates, I am getting the following error -

fish$covariates = c(reptime, gc, hetchrom)
Error in (function (value) :
Error: covariates must be of class Covariate

I am not sure what is going wrong here.
The first class of the 3 variables reptime, gc and hetchrom is definitely "covariate"

class(reptime)
[1] "Covariate" "R6"
class(gc)
[1] "Covariate" "R6"
class(hetchrom)
[1] "Covariate" "R6"

reptime
1 Covariates with features:
name type class field signature na.rm pad grep
1: ReplicationTiming numeric GRanges score NA 0 NA
gc
2 Covariates with features:
name type class field signature na.rm pad grep
1: C numeric GRanges C NA 0 NA
2: G numeric GRanges G NA 0 NA
hetchrom
1 Covariates with features:
name type class field signature na.rm pad grep
1: Heterochromatin numeric GRanges het NA 0 NA

Here is my session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] MASS_7.3-51.4 org.Hs.eg.db_3.8.2
[3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.36.3
[5] AnnotationDbi_1.46.0 Biobase_2.44.0
[7] rtracklayer_1.44.0 fishHook_0.1
[9] plotly_4.9.0 ggplot2_3.2.0
[11] gUtils_0.2.0 data.table_1.12.2
[13] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
[15] IRanges_2.18.1 S4Vectors_0.22.0
[17] BiocGenerics_0.30.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lattice_0.20-38 tidyr_0.8.3
[4] prettyunits_1.0.2 Rsamtools_2.0.0 Biostrings_2.52.0
[7] zoo_1.8-6 assertthat_0.2.1 zeallot_0.1.0
[10] digest_0.6.20 R6_2.4.0 backports_1.1.4
[13] RSQLite_2.1.1 httr_1.4.0 pillar_1.4.2
[16] progress_1.2.2 zlibbioc_1.30.0 rlang_0.4.0
[19] lazyeval_0.2.2 rstudioapi_0.10 blob_1.2.0
[22] Matrix_1.2-17 BiocParallel_1.18.0 stringr_1.4.0
[25] htmlwidgets_1.3 biomaRt_2.40.1 RCurl_1.95-4.12
[28] bit_1.1-14 munsell_0.5.0 DelayedArray_0.10.0
[31] compiler_3.6.0 pkgconfig_2.0.2 htmltools_0.3.6
[34] tidyselect_0.2.5 SummarizedExperiment_1.14.0 tibble_2.1.3
[37] GenomeInfoDbData_1.2.1 matrixStats_0.54.0 XML_3.98-1.20
[40] viridisLite_0.3.0 crayon_1.3.4 dplyr_0.8.3
[43] withr_2.1.2 GenomicAlignments_1.20.1 bitops_1.0-6
[46] grid_3.6.0 jsonlite_1.6 gtable_0.3.0
[49] DBI_1.0.0 magrittr_1.5 scales_1.0.0
[52] stringi_1.4.3 XVector_0.24.0 vctrs_0.2.0
[55] tools_3.6.0 bit64_0.9-7 glue_1.3.1
[58] purrr_0.3.2 hms_0.5.0 yaml_2.2.0
[61] colorspace_1.4-1 memoise_1.1.0

offset(eligible) should be offset(log(eligible))?

@karinkumar and I were debugging our model and noticed a math bug. Because the covariate terms are exponentiated in the likelihood function for a negative binomial regression, and the offset is a scaling term (e.g. half as much eligible territory should predict half as many events), the offset term should be log'ed first.

See pages 1 and 2 of https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Negative_Binomial_Regression.pdf noting that t_i is ln(t_i) in the likelihood.

It seems everywhere I look for glm.nb, they all use offset(log(x)). e.g. https://stats.stackexchange.com/questions/180260/r-glm-nb-and-when-to-consider-using-an-offset-or-including-a-covariate

This makes a difference. See the covariates plot below. Darkest red is evenly spaced bins with offset(log(eligible)). Medium-red is offset(eligible). The green bars are binned so that each bin covers almost exactly the same amount of eligible territory, so the parameterization of the offset should not matter.

image

interpreting output from WES fishHook analysis

Hi, I ran fishHook with WES data. Mutations were input as a granges object obtained from a maf file. Gene annotations were from the gencode v38 gtf file. For coverage, I lifted over the bigwig file for the hg19 version (given in the tutorial) to the hg38 version and used that. I could get the q-q plot but when I looked at the results, I found that there were multiple rows for a single gene. For example, for TP53 I found 221 rows (please see attached file). Could you please help me interpret this result? Thank you.
TP53.xlsx

How to interpret the results?

in tutorial data,

seqnames | start | end | strand | width | gene_name | hid | p | fdr | count | effectsize | count.pred | count.density | count.pred.density | eligible | p.neg | fdr.neg
17 | 7565097 | 7590856 |   | 25760 | TP53 | 14970 | 1.1e-29 | 0.00000 | 99 | 5.58 | 2.070 | 0.0878 | 0.00184 | 1127 | 1 | 1
2 | 79384132 | 79386879 |   | 2748 | REG3A | 2447 | 9e-08 | 0.00079 | 16 | 4.05 | 0.964 | 0.0305 | 0.00184 | 525 | 1 | 1
19 | 10596796 | 10614417 |   | 17622 | KEAP1 | 16561 | 1.2e-07 | 0.00079 | 30 | 3.60 | 2.480 | 0.0222 | 0.00184 | 1350 | 1 | 1
8 | 88882973 | 88886296 |   | 3324 | DCAF4L2 | 8280 | 4.3e-06 | 0.01800 | 21 | 3.27 | 2.180 | 0.0177 | 0.00184 | 1185 | 1 | 1

Interpretation of this result is that most likely a driver mutation is in the TP53 gene region. Should I interpret it like this? One step over, are TP53, REG3A, KEAP1, and DCAF4L2 statistically significant mutations developing lung adenocarcinoma?
But when you look at other programs, the resolution is very precise, not on a genetic basis, but on a single loci.

I'm not sure if FISHHOOK is a gene prioritization program or finding driver mutation program.

Please fill me in.
Thanks.

Let me get this straight about 'eligible territory'

Thanks. for your reply before.
But still i don't understand exactly for making coverage.

First of all, I know which tool to use to create the coverage.

But what I still don't know is whether I can define the GRCh38 genome file as eligible territory. In other words, should GRCh38 gtf be converted to bed to bigwig?

Or, read coverage varies from sample to sample, so I have to use different coverage files for each sample.

I'm very thank you for your kindness.

annotateTargets: Error: attempt to apply non-function

Hi,

Thank you for sharing this algorithm.

I'm attempting to run your demo script using the supplied data but I'm running into some troubles and was wondering if this is something known. The following command generates an error message.

anno = fish$annotateTargets(maxPtGene = 1, verbose = FALSE, PtIDCol = "patient_code")
Error: attempt to apply non-function

Two questions:

  1. The function seems to be named annotate.targets in the manual, I don't know if this matters but is this only a typo?
  2. According to the manual the Annotated-object used in later stages should be generated by calling FishHook$annotateTargets(). However, even if I swap to annotate.targets I can't seem to pass the fish variables into the function, it only responds with Error: attempt to apply non-function. I can use it by supplying the variables directly to the function but can't get the covariates to work (Error: object of type 'closure' is not subsettable when supplying the Covariates variable from c(rept). Do you have any input for this?

Working in R 3.4.2 on MacOS 10.12.6

Thanks

Fish error; there are no seqlevels of events that match hypothesis

library(fishHook)
library(gTrack)
library(rtracklayer)

mutations = dt2gr(fread('test_150_maftools.maf'))
genes = gr.sub(import('gencode.v27.annotation.gtf'))
genes = genes %Q% (gene_type == 'protein_coding') %Q% (level<3)
cds = import("gencode.v27.annotation.gtf")
exomecov = import('Z_NSLA_1004_T.bw')

head(mutations[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
eligible = exomecov %Q% which(score>0.95)
eligible = reduce(intersect(eligible, cds, ignore.strand = TRUE))
events = mutations %Q% (Variant_Classification != 'Silent')

fish = Fish(hypotheses = genes[, 'gene_name'], events = events, eligible = eligible, idcol = 'Tumor_Sample_Barcode')

Error in .subset2(public_bind_env, "initialize")(...) :
Error: there are no seqlevels of events that match hypotheses
In addition: Warning messages:
1: In .subset2(public_bind_env, "initialize")(...) :
Warning: seqlevels of Hypotheses and Eligible appear to be in different formats
2: In .subset2(public_bind_env, "initialize")(...) :
Warning: seqlevels of Hypotheses and Events appear to be in different formats

variables

Bundled data for hg38/ensembl 88

Hi ,
I was wondering is that possible to use fishhook on hg38/ensembl88 based mutation ? Since the GDC is providing new called TCGA mutations on hg38, it would be better to provide tutorial about how to prepare the CDS reference data(cds definition file) of fishhook

Yao

unable to load WGS rds object

Hi,
I am trying to use fishHook on WGS data and was trying out the analysis mentioned in the tutorial. I was able to install fishHook but when I tried to read in the mutations via
mutations.wgs = readRDS("/gpfs/commons/groups/imielinski_lab/projects/FishHook/CurrentProtocols/db/de_identified_LUAD_mutations.rds")
I get the following error
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file '/gpfs/commons/groups/imielinski_lab/projects/FishHook/CurrentProtocols/db/de_identified_LUAD_mutations.rds', probable reason 'No such file or directory'
Could you please help me with this error? Thanks.

Could not find function Cov

Hello Marcin,

I've been using Fishhook to run the 1D model for Rameen. I needed to run it again and had to reinstall the package to use on a different version of R. Now when I try to create a covariate object (for instance, if I run this line in your tutorial hetchrom = Cov(hetchromdata, name = 'Heterochromatin')) I get the error: could not find function "Cov".

A version of fishhook I installed months ago still works, so possibly something has changed since then?

erroneous qqplot when adding covariate

Hi all,
I have some WGS data that I'd like to look at.
Q: Any recurrent mutations within the coding regions?

Events:

> head(ssvs[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
GRanges object with 6 ranges and 5 metadata columns:
      seqnames    ranges strand | Tumor_Sample_Barcode Variant_Type Variant_Classification
         <Rle> <IRanges>  <Rle> |             <factor>     <factor>               <factor>
  [1]        1  16965728      + |               BCCA1T          SNP      Missense_Mutation
  [2]        1  20345468      + |               BCCA1T          SNP      Missense_Mutation
  [3]        1  21823627      + |               BCCA1T          SNP      Missense_Mutation
  [4]        1 109192794      + |               BCCA1T          SNP      Missense_Mutation
  [5]        1 111766331      + |               BCCA1T          SNP      Missense_Mutation
  [6]        1 112913924      + |               BCCA1T          SNP      Missense_Mutation
      Reference_Allele Tumor_Seq_Allele2
           <character>       <character>
  [1]                G                 A
  [2]                G                 A
  [3]                C                 T
  [4]                A                 C
  [5]                T                 C
  [6]                A                 T

Eligible territory (GATK's WGS calling region):

> wgs_interval
GRanges object with 356 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]        1      10002-207666      *
    [2]        1     257668-297968      *
    [3]        1     347970-535988      *
    [4]        1    585990-2702781      *
    [5]        1  2746292-12954384      *
    ...      ...               ...    ...
  [352]        Y 21748373-21750013      *
  [353]        Y 21750316-21789281      *
  [354]        Y 21805283-26673214      *
  [355]        Y 56673216-56771509      *
  [356]        Y 56821511-56887902      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlenghts

Hypothesis:

gencode <- rtracklayer::import("/Non-coding/Elements/gencode.v37.annotation.gff3.gz")
cds <- gencode %Q% which(type == "CDS")
cds <- cds %Q% which(seqnames != "chrM")
cds <- cds %Q% which(level < 3)
genes <- gr.sub((cds), 'chr', '')

Run with no covariates:

fish_SSVs <- Fish(hypotheses = genes, 
                             events = ssvs, 
                             eligible = wgs_interval,
                             idcol = 'Tumor_Sample_Barcode')
fish_SSVs$score()

image

Added replication timing as a covariate:
From https://www2.replicationdomain.com/database.php
Acc: Int93941761

RT = gr.sub(import("/annotation_files/RT_BG01_NPC_hg38.bedgraph.gz"), 'chr', '')
sort(RT@seqnames)
RT = Cov(RT, field = "score", name = "Replication_timing") 

RT$data
[[1]]
GRanges object with 2160723 ranges and 1 metadata column:
            seqnames            ranges strand |      score
               <Rle>         <IRanges>  <Rle> |  <numeric>
        [1]        1       27919-27977      * |   0.306177
        [2]        1       40797-40855      * |   0.303592
        [3]        1       41261-41319      * |   0.303496
        [4]        1       42246-42304      * |   0.303294
        [5]        1       46704-46762      * |   0.302369
        ...      ...               ...    ... .        ...
  [2160719]        Y 57198996-57199054      * |  -0.161866
  [2160720]        Y 57200358-57200416      * |  -0.113958
  [2160721]        Y 57201187-57201245      * | -0.0840507
  [2160722]        Y 57203083-57203141      * | -0.0135262
  [2160723]        Y 57204578-57204636      * |  0.0441656

The qqplot now shows lambda < 1 .

image

> summary(fish_SSVs$model)

Call:
glm.nb(formula = formula, data = as.data.frame(tdt), maxit = iter, 
    init.theta = 0.9268776227, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8467  -0.3721  -0.3041  -0.2458   9.3879  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -7.67285    0.01335  -574.9   <2e-16 ***
Replication_timing -0.32102    0.01486   -21.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.9269) family taken to be 1)

    Null deviance: 30791  on 99999  degrees of freedom
Residual deviance: 30329  on 99998  degrees of freedom
AIC: 52366

Number of Fisher Scoring iterations: 1


              Theta:  0.9269 
          Std. Err.:  0.0437 

 2 x log-likelihood:  -52360.3110 

Anyone has any ideas why this happened?

Thanks!
Al

Error when NB model doesn't fit: while ((it <- it + 1) < limit && abs(del) > eps) missing value where TRUE/FALSE needed

Hello @walaj and I were trying to run a fishhook on sparse data (very few breakpoints across the binned genome). Sometimes the data is sparse enough that the nb model doesn't fit. In fact, it throws an error that's within glm fitting function : Error in while ((it <- it + 1) < limit && abs(del) > eps) { : missing value where TRUE/FALSE needed

We are switching to a poisson model, but perhaps there is a better way to catch this error and print a message or redirect to a poisson model instead. Right now it just stops the fishhook model from running.

Help to plot 'QQplot' with command 'fish$qqp()' and Coverage problem

I ran test data (luad.maf and others).
Results are the same in the manual.
but 'fish$qqp()' not works.

how should I do that?

  1. I have paired sequencing data. how to make coverage file?
    one way is extracting common and another is extracting all.
    which is right?

Thanks you for your reply in advance.
Thanks.

trinucleotide context code broken link

Hi there,

Nice looking tool.
In the tutorial there is a link to access the code for generating the nucelotide context file for any genome.
However the link is broken, is there any chance of getting this updated?

many thanks

James

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.