mskilab-org / fishhook Goto Github PK

R package for applying Gamma-Poisson regression to identify statistical enrichment or depletion of somatic mutations in regions after correcting for genomic covariates.

R 100.00%

fishhook's Introduction

fishHook

R package for applying Gamma-Poisson regression to identify statistical enrichment or depletion of somatic mutations in arbitrary (sets of) genomic intervals after correcting for genomic covariates, e.g. replication timing, sequence context, chromatin state.

Installation

Install devtools from CRAN (if you don't have it already)

install.packages('devtools')

Install gUtils

devtools::install_github('mskilab/gUtils')

Install fishHook

devtools::install_github('mskilab/fishHook')

Documentation

fishHook Tutorial

fishHook Developer Reference

Attributions

Marcin Imielinski - Assistant Professor, Weill-Cornell Medical College. Core Member, New York Genome Center.

Zoran Gajic - Undergraduate Research Assistant, Rutgers University, Weill Cornell Medicine, New York Genome Center.

fishhook's People

Contributors

Stargazers

Watchers

Forkers

meermustafa xtyao xtmgah doaneas walaj williamhannon ahmedarslan shaiberalon zining01

fishhook's Issues

Nucleotide context

Hi there @mskilab @imielinski
I have a maf file containing ssvs for 100 samples.
How do i get the nucleotide context for these samples?
Do i need to generate nucleotide context for each one and average across the samples?
Or is there a generic nucleotide context that can be used?
I am using hg38.
Thanks for your help!
A

Error: covariates must be of class Covariate

Hi,
I am using fishHook. I can run the basic model (without covariates) without an issue.

When I add in the covariates, I am getting the following error -

fish$covariates = c(reptime, gc, hetchrom)
Error in (function (value) :
Error: covariates must be of class Covariate

I am not sure what is going wrong here.
The first class of the 3 variables reptime, gc and hetchrom is definitely "covariate"

class(reptime)
[1] "Covariate" "R6"
class(gc)
[1] "Covariate" "R6"
class(hetchrom)
[1] "Covariate" "R6"

reptime
1 Covariates with features:
name type class field signature na.rm pad grep
1: ReplicationTiming numeric GRanges score NA 0 NA
gc
2 Covariates with features:
name type class field signature na.rm pad grep
1: C numeric GRanges C NA 0 NA
2: G numeric GRanges G NA 0 NA
hetchrom
1 Covariates with features:
name type class field signature na.rm pad grep
1: Heterochromatin numeric GRanges het NA 0 NA

Here is my session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] MASS_7.3-51.4 org.Hs.eg.db_3.8.2
[3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.36.3
[5] AnnotationDbi_1.46.0 Biobase_2.44.0
[7] rtracklayer_1.44.0 fishHook_0.1
[9] plotly_4.9.0 ggplot2_3.2.0
[11] gUtils_0.2.0 data.table_1.12.2
[13] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
[15] IRanges_2.18.1 S4Vectors_0.22.0
[17] BiocGenerics_0.30.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lattice_0.20-38 tidyr_0.8.3
[4] prettyunits_1.0.2 Rsamtools_2.0.0 Biostrings_2.52.0
[7] zoo_1.8-6 assertthat_0.2.1 zeallot_0.1.0
[10] digest_0.6.20 R6_2.4.0 backports_1.1.4
[13] RSQLite_2.1.1 httr_1.4.0 pillar_1.4.2
[16] progress_1.2.2 zlibbioc_1.30.0 rlang_0.4.0
[19] lazyeval_0.2.2 rstudioapi_0.10 blob_1.2.0
[22] Matrix_1.2-17 BiocParallel_1.18.0 stringr_1.4.0
[25] htmlwidgets_1.3 biomaRt_2.40.1 RCurl_1.95-4.12
[28] bit_1.1-14 munsell_0.5.0 DelayedArray_0.10.0
[31] compiler_3.6.0 pkgconfig_2.0.2 htmltools_0.3.6
[34] tidyselect_0.2.5 SummarizedExperiment_1.14.0 tibble_2.1.3
[37] GenomeInfoDbData_1.2.1 matrixStats_0.54.0 XML_3.98-1.20
[40] viridisLite_0.3.0 crayon_1.3.4 dplyr_0.8.3
[43] withr_2.1.2 GenomicAlignments_1.20.1 bitops_1.0-6
[46] grid_3.6.0 jsonlite_1.6 gtable_0.3.0
[49] DBI_1.0.0 magrittr_1.5 scales_1.0.0
[52] stringi_1.4.3 XVector_0.24.0 vctrs_0.2.0
[55] tools_3.6.0 bit64_0.9-7 glue_1.3.1
[58] purrr_0.3.2 hms_0.5.0 yaml_2.2.0
[61] colorspace_1.4-1 memoise_1.1.0

offset(eligible) should be offset(log(eligible))?

@karinkumar and I were debugging our model and noticed a math bug. Because the covariate terms are exponentiated in the likelihood function for a negative binomial regression, and the offset is a scaling term (e.g. half as much eligible territory should predict half as many events), the offset term should be log'ed first.

See pages 1 and 2 of https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Negative_Binomial_Regression.pdf noting that t_i is ln(t_i) in the likelihood.

It seems everywhere I look for glm.nb, they all use offset(log(x)). e.g. https://stats.stackexchange.com/questions/180260/r-glm-nb-and-when-to-consider-using-an-offset-or-including-a-covariate

This makes a difference. See the covariates plot below. Darkest red is evenly spaced bins with offset(log(eligible)). Medium-red is offset(eligible). The green bars are binned so that each bin covers almost exactly the same amount of eligible territory, so the parameterization of the offset should not matter.

interpreting output from WES fishHook analysis

Hi, I ran fishHook with WES data. Mutations were input as a granges object obtained from a maf file. Gene annotations were from the gencode v38 gtf file. For coverage, I lifted over the bigwig file for the hg19 version (given in the tutorial) to the hg38 version and used that. I could get the q-q plot but when I looked at the results, I found that there were multiple rows for a single gene. For example, for TP53 I found 221 rows (please see attached file). Could you please help me interpret this result? Thank you.
TP53.xlsx

How to interpret the results?

in tutorial data,

seqnames | start | end | strand | width | gene_name | hid | p | fdr | count | effectsize | count.pred | count.density | count.pred.density | eligible | p.neg | fdr.neg
17 | 7565097 | 7590856 | | 25760 | TP53 | 14970 | 1.1e-29 | 0.00000 | 99 | 5.58 | 2.070 | 0.0878 | 0.00184 | 1127 | 1 | 1
2 | 79384132 | 79386879 | | 2748 | REG3A | 2447 | 9e-08 | 0.00079 | 16 | 4.05 | 0.964 | 0.0305 | 0.00184 | 525 | 1 | 1
19 | 10596796 | 10614417 | | 17622 | KEAP1 | 16561 | 1.2e-07 | 0.00079 | 30 | 3.60 | 2.480 | 0.0222 | 0.00184 | 1350 | 1 | 1
8 | 88882973 | 88886296 | | 3324 | DCAF4L2 | 8280 | 4.3e-06 | 0.01800 | 21 | 3.27 | 2.180 | 0.0177 | 0.00184 | 1185 | 1 | 1

Interpretation of this result is that most likely a driver mutation is in the TP53 gene region. Should I interpret it like this? One step over, are TP53, REG3A, KEAP1, and DCAF4L2 statistically significant mutations developing lung adenocarcinoma?
But when you look at other programs, the resolution is very precise, not on a genetic basis, but on a single loci.

I'm not sure if FISHHOOK is a gene prioritization program or finding driver mutation program.

Please fill me in.
Thanks.

Let me get this straight about 'eligible territory'

Thanks. for your reply before.
But still i don't understand exactly for making coverage.

First of all, I know which tool to use to create the coverage.

But what I still don't know is whether I can define the GRCh38 genome file as eligible territory. In other words, should GRCh38 gtf be converted to bed to bigwig?

Or, read coverage varies from sample to sample, so I have to use different coverage files for each sample.

I'm very thank you for your kindness.

annotateTargets: Error: attempt to apply non-function

Hi,

Thank you for sharing this algorithm.

I'm attempting to run your demo script using the supplied data but I'm running into some troubles and was wondering if this is something known. The following command generates an error message.

anno = fish$annotateTargets(maxPtGene = 1, verbose = FALSE, PtIDCol = "patient_code")
Error: attempt to apply non-function

Two questions:

The function seems to be named annotate.targets in the manual, I don't know if this matters but is this only a typo?
According to the manual the Annotated-object used in later stages should be generated by calling FishHook$annotateTargets(). However, even if I swap to annotate.targets I can't seem to pass the fish variables into the function, it only responds with Error: attempt to apply non-function. I can use it by supplying the variables directly to the function but can't get the covariates to work (Error: object of type 'closure' is not subsettable when supplying the Covariates variable from c(rept). Do you have any input for this?

Working in R 3.4.2 on MacOS 10.12.6

Thanks

Fish error; there are no seqlevels of events that match hypothesis

library(fishHook)
library(gTrack)
library(rtracklayer)

mutations = dt2gr(fread('test_150_maftools.maf'))
genes = gr.sub(import('gencode.v27.annotation.gtf'))
genes = genes %Q% (gene_type == 'protein_coding') %Q% (level<3)
cds = import("gencode.v27.annotation.gtf")
exomecov = import('Z_NSLA_1004_T.bw')

head(mutations[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
eligible = exomecov %Q% which(score>0.95)
eligible = reduce(intersect(eligible, cds, ignore.strand = TRUE))
events = mutations %Q% (Variant_Classification != 'Silent')

fish = Fish(hypotheses = genes[, 'gene_name'], events = events, eligible = eligible, idcol = 'Tumor_Sample_Barcode')

Error in .subset2(public_bind_env, "initialize")(...) :
Error: there are no seqlevels of events that match hypotheses
In addition: Warning messages:
1: In .subset2(public_bind_env, "initialize")(...) :
Warning: seqlevels of Hypotheses and Eligible appear to be in different formats
2: In .subset2(public_bind_env, "initialize")(...) :
Warning: seqlevels of Hypotheses and Events appear to be in different formats

exome coverage file for hg38

Hi,
In the exome part of the fishHook suite, is there is a file for hg38 analogous to
exomecov = import('http://mskilab.com/fishHook/hg19/exome_coverage.bw')?
Or, is there a way to liftover the bigwig file from hg19 to hg38?
Thanks

Bundled data for hg38/ensembl 88

Hi ,
I was wondering is that possible to use fishhook on hg38/ensembl88 based mutation ? Since the GDC is providing new called TCGA mutations on hg38, it would be better to provide tutorial about how to prepare the CDS reference data(cds definition file) of fishhook

Yao

unable to load WGS rds object

Hi,
I am trying to use fishHook on WGS data and was trying out the analysis mentioned in the tutorial. I was able to install fishHook but when I tried to read in the mutations via
mutations.wgs = readRDS("/gpfs/commons/groups/imielinski_lab/projects/FishHook/CurrentProtocols/db/de_identified_LUAD_mutations.rds")
I get the following error
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file '/gpfs/commons/groups/imielinski_lab/projects/FishHook/CurrentProtocols/db/de_identified_LUAD_mutations.rds', probable reason 'No such file or directory'
Could you please help me with this error? Thanks.

Could not find function Cov

Hello Marcin,

I've been using Fishhook to run the 1D model for Rameen. I needed to run it again and had to reinstall the package to use on a different version of R. Now when I try to create a covariate object (for instance, if I run this line in your tutorial hetchrom = Cov(hetchromdata, name = 'Heterochromatin')) I get the error: could not find function "Cov".

A version of fishhook I installed months ago still works, so possibly something has changed since then?

erroneous qqplot when adding covariate

Hi all,
I have some WGS data that I'd like to look at.
Q: Any recurrent mutations within the coding regions?

Events:

> head(ssvs[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
GRanges object with 6 ranges and 5 metadata columns:
      seqnames    ranges strand | Tumor_Sample_Barcode Variant_Type Variant_Classification
         <Rle> <IRanges>  <Rle> |             <factor>     <factor>               <factor>
  [1]        1  16965728      + |               BCCA1T          SNP      Missense_Mutation
  [2]        1  20345468      + |               BCCA1T          SNP      Missense_Mutation
  [3]        1  21823627      + |               BCCA1T          SNP      Missense_Mutation
  [4]        1 109192794      + |               BCCA1T          SNP      Missense_Mutation
  [5]        1 111766331      + |               BCCA1T          SNP      Missense_Mutation
  [6]        1 112913924      + |               BCCA1T          SNP      Missense_Mutation
      Reference_Allele Tumor_Seq_Allele2
           <character>       <character>
  [1]                G                 A
  [2]                G                 A
  [3]                C                 T
  [4]                A                 C
  [5]                T                 C
  [6]                A                 T

Eligible territory (GATK's WGS calling region):

> wgs_interval
GRanges object with 356 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]        1      10002-207666      *
    [2]        1     257668-297968      *
    [3]        1     347970-535988      *
    [4]        1    585990-2702781      *
    [5]        1  2746292-12954384      *
    ...      ...               ...    ...
  [352]        Y 21748373-21750013      *
  [353]        Y 21750316-21789281      *
  [354]        Y 21805283-26673214      *
  [355]        Y 56673216-56771509      *
  [356]        Y 56821511-56887902      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlenghts

Hypothesis:

gencode <- rtracklayer::import("/Non-coding/Elements/gencode.v37.annotation.gff3.gz")
cds <- gencode %Q% which(type == "CDS")
cds <- cds %Q% which(seqnames != "chrM")
cds <- cds %Q% which(level < 3)
genes <- gr.sub((cds), 'chr', '')

Run with no covariates:

fish_SSVs <- Fish(hypotheses = genes, 
                             events = ssvs, 
                             eligible = wgs_interval,
                             idcol = 'Tumor_Sample_Barcode')
fish_SSVs$score()

Added replication timing as a covariate:
From https://www2.replicationdomain.com/database.php
Acc: Int93941761

RT = gr.sub(import("/annotation_files/RT_BG01_NPC_hg38.bedgraph.gz"), 'chr', '')
sort(RT@seqnames)
RT = Cov(RT, field = "score", name = "Replication_timing") 

RT$data
[[1]]
GRanges object with 2160723 ranges and 1 metadata column:
            seqnames            ranges strand |      score
               <Rle>         <IRanges>  <Rle> |  <numeric>
        [1]        1       27919-27977      * |   0.306177
        [2]        1       40797-40855      * |   0.303592
        [3]        1       41261-41319      * |   0.303496
        [4]        1       42246-42304      * |   0.303294
        [5]        1       46704-46762      * |   0.302369
        ...      ...               ...    ... .        ...
  [2160719]        Y 57198996-57199054      * |  -0.161866
  [2160720]        Y 57200358-57200416      * |  -0.113958
  [2160721]        Y 57201187-57201245      * | -0.0840507
  [2160722]        Y 57203083-57203141      * | -0.0135262
  [2160723]        Y 57204578-57204636      * |  0.0441656

The qqplot now shows lambda < 1 .

> summary(fish_SSVs$model)

Call:
glm.nb(formula = formula, data = as.data.frame(tdt), maxit = iter, 
    init.theta = 0.9268776227, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8467  -0.3721  -0.3041  -0.2458   9.3879  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -7.67285    0.01335  -574.9   <2e-16 ***
Replication_timing -0.32102    0.01486   -21.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.9269) family taken to be 1)

    Null deviance: 30791  on 99999  degrees of freedom
Residual deviance: 30329  on 99998  degrees of freedom
AIC: 52366

Number of Fisher Scoring iterations: 1


              Theta:  0.9269 
          Std. Err.:  0.0437 

 2 x log-likelihood:  -52360.3110

Anyone has any ideas why this happened?

Thanks!
Al

Error when NB model doesn't fit: while ((it <- it + 1) < limit && abs(del) > eps) missing value where TRUE/FALSE needed

Hello @walaj and I were trying to run a fishhook on sparse data (very few breakpoints across the binned genome). Sometimes the data is sparse enough that the nb model doesn't fit. In fact, it throws an error that's within glm fitting function : Error in while ((it <- it + 1) < limit && abs(del) > eps) { : missing value where TRUE/FALSE needed

We are switching to a poisson model, but perhaps there is a better way to catch this error and print a message or redirect to a poisson model instead. Right now it just stops the fishhook model from running.

Help to plot 'QQplot' with command 'fish$qqp()' and Coverage problem

I ran test data (luad.maf and others).
Results are the same in the manual.
but 'fish$qqp()' not works.

how should I do that?

I have paired sequencing data. how to make coverage file?
one way is extracting common and another is extracting all.
which is right?

Thanks you for your reply in advance.
Thanks.

trinucleotide context code broken link

Hi there,

Nice looking tool.
In the tutorial there is a link to access the code for generating the nucelotide context file for any genome.
However the link is broken, is there any chance of getting this updated?

many thanks

James