ctlab / fgsea Goto Github PK

View Code? Open in Web Editor NEW

356.0 356.0 65.0 3.33 MB

Fast Gene Set Enrichment Analysis

License: Other

R 73.49% C++ 26.51%

fgsea's People

Contributors

Stargazers

Watchers

Forkers

alfredshawn alenzhao vladpetyuk inambioinfo guangchuangyu frenkiboy yniknafs rysterzhu artem-ii yuzhu2216 wenliangz rdinatale nageuse vadimfarutin lekanwang kevinvervier hrk2109 bioinfonerd-forks adrianbzg lagzxadr budalnik chrismiddleton cmksqtx kc-lan kunju-pitt kant valleybluelily auberginekenobi huerqiang le-chang jun-lizst tauqeer9 kewalinsamart gitter-badger chenyangonlyone rosscm seifudd genomicsnx misaka-dayu joshuasalazar1699 beatbioinfo mariehidjo stjordanis sudolin shaiberalon maflaschl iyalue yanyang312 vdsukhov morganef06 sjtujx tbilab ywang0428 jjlea sultanghazala kevincjnixon zhangbei123 smartgamer bigomics 5l1v3r1 sukses24 yimjohns siwei-deng urbankunej

fgsea's Issues

`theme_bw()` is hard-coded in some plots

Hi, I noticed that theme_bw() is called inside plotEnrichment(), which makes it more difficult to customize. Would it be possible to remove this?

The current ggplot2 recommendation is to instead use theme_set() to define themes globally, such as:

theme_set(theme_bw())

Best,
Mike

running fgsea

I am getting the error message:
"Error in fmatch(p, names(stats)) : attempt to set an attribute on NULL"

from running this command:
fgsea(pathways = c5_pathways, stats = as.numeric(My_Data[,2]), nperm = 100)

Anyone know what to do about this?

fgsea is not reproducible on Windows and OSX

For some reason fgsea when used in parallel mode on Windows and OSX doesn't respect seed settings. See test failed at https://ci.appveyor.com/project/assaron/fgsea/build/1.0.27#L1265 and https://travis-ci.org/ctlab/fgsea/jobs/318271040#L4562

It is highly unusual for Rcpp to be included in the Depends section of the DESCRIPTION. I may be wrong here, but I believe it should be listed instead in the Imports section. (It is correctly listed in LinkingTo.)

fgsea never ending

I have some problem for a while, it seems like fgsea never return the results:

system.time(gseaSizeEffect <- fgsea(grouping, comp1, nperm = 10000))
##    user  system elapsed 
##   0.168   0.116   0.517 
system.time(gseaSizeEffect <- fgsea(grouping, comp1, nperm = 20000))
##
## Timing stopped at: 0.28 0.416 731.2
system.time(gseaSizeEffect <- fgsea(grouping, comp1, nperm = 1000))
##    user  system elapsed 
##   0.052   0.000   0.058 
system.time(gseaSizeEffect <- fgsea(grouping, comp1, nperm = 2000))
## 
## Timing stopped at: 0.204 0.32 147.1

So while the 10k permutations takes 0.116 and 0.516 for the system and in total, when I double it it takes much more and I had to stop the function (I had this same function running for 3 days without returning any value), despite being reported that to the user it took less than a second the output wasn't ready after 10 minutes. And a similar thing happened when I used 1k permutation and doubled to 2k.

Sorry I couldn't make it reproducible, I really don't understand what can be triggering this. Here is some information about the data I am testing:

length(comp1)
## 505
length(grouping)
## 127
table(lengths(grouping))
## 
##  1  2  3  4  5  6  7  8  9 11 12 14 21 22 42 
## 61 16 10 13  4  3  3  4  1  4  3  2  1  1  1 
quantile(comp1)
##         0%        25%        50%        75%       100% 
## -0.2198301  0.0000000  0.0000000  0.0000000  0.2639637

Perhaps the problem is that ~78% of my stats are 0, and so they are equivalent in the position. But when I removed them I also had the same problem.

The session_info("fgsea"):

Session info -----------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   i686, linux-gnu             
 ui       RStudio (1.1.423)           
 language en_US                       
 collate  en_US.UTF-8                 
 tz       Europe/Madrid               
 date     2018-02-15                  

Packages ---------------------------------------------------------------------------------------------------------------------------------
 package        * version  date       source         
 assertthat       0.2.0    2017-04-11 CRAN (R 3.4.3) 
 BH               1.65.0-1 2017-08-24 CRAN (R 3.4.2) 
 BiocParallel     1.12.0   2017-11-02 Bioconductor   
 cli              1.0.0    2017-11-05 CRAN (R 3.4.3) 
 colorspace       1.3-2    2016-12-14 CRAN (R 3.4.2) 
 compiler         3.4.3    2017-12-01 local          
 crayon           1.3.4    2017-09-16 CRAN (R 3.4.2) 
 data.table     * 1.10.4-3 2017-10-27 CRAN (R 3.4.3) 
 dichromat        2.0-0    2013-01-24 CRAN (R 3.4.2) 
 digest           0.6.15   2018-01-28 CRAN (R 3.4.3) 
 fastmatch        1.1-0    2017-01-28 CRAN (R 3.4.3) 
 fgsea          * 1.4.1    2018-01-26 Bioconductor   
 futile.logger    1.4.3    2016-07-10 CRAN (R 3.4.2) 
 futile.options   1.0.0    2010-04-06 CRAN (R 3.4.2) 
 ggplot2        * 2.2.1    2016-12-30 CRAN (R 3.4.3) 
 graphics       * 3.4.3    2017-12-01 local          
 grDevices      * 3.4.3    2017-12-01 local          
 grid             3.4.3    2017-12-01 local          
 gridExtra        2.3      2017-09-09 CRAN (R 3.4.2) 
 gtable           0.2.0    2016-02-26 CRAN (R 3.4.2) 
 labeling         0.3      2014-08-23 CRAN (R 3.4.2) 
 lambda.r         1.2      2017-09-16 CRAN (R 3.4.2) 
 lazyeval         0.2.1    2017-10-29 CRAN (R 3.4.2) 
 magrittr         1.5      2014-11-22 CRAN (R 3.4.2) 
 MASS             7.3-48   2017-12-24 CRAN (R 3.4.3) 
 methods        * 3.4.3    2017-12-01 local          
 munsell          0.4.3    2016-02-13 CRAN (R 3.4.2) 
 parallel       * 3.4.3    2017-12-01 local          
 pillar           1.1.0    2018-01-14 CRAN (R 3.4.3) 
 plyr             1.8.4    2016-06-08 CRAN (R 3.4.2) 
 R6               2.2.2    2017-06-17 CRAN (R 3.4.3) 
 RColorBrewer   * 1.1-2    2014-12-07 CRAN (R 3.4.2) 
 Rcpp           * 0.12.15  2018-01-20 cran (@0.12.15)
 reshape2       * 1.4.3    2017-12-11 CRAN (R 3.4.3) 
 rlang            0.1.6    2017-12-21 CRAN (R 3.4.3) 
 scales           0.5.0    2017-08-24 CRAN (R 3.4.2) 
 snow             0.4-2    2016-10-14 CRAN (R 3.4.2) 
 stats          * 3.4.3    2017-12-01 local          
 stringi          1.1.6    2017-11-17 CRAN (R 3.4.2) 
 stringr          1.2.0    2017-02-18 CRAN (R 3.4.3) 
 tibble           1.4.2    2018-01-22 CRAN (R 3.4.3) 
 tools            3.4.3    2017-12-01 local          
 utf8             1.1.3    2018-01-03 CRAN (R 3.4.3) 
 utils          * 3.4.3    2017-12-01 local          
 viridisLite      0.3.0    2018-02-01 CRAN (R 3.4.3)

BPPARAM error

I recently updated to R 4.4.0 and reinstalled fgsea, however I have ran into an issue when running fgsea or fgseaMultilevel:

Error in if (!bpschedule(BPPARAM) || length(X) == 1L || bpnworkers(BPPARAM) ==  : 
  missing value where TRUE/FALSE needed

I've tried reinstalling both via GitHub or Bioconductor and I have the same issue. Thanks!

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] devtools_2.3.0                          usethis_1.6.1                          
 [3] fgsea_1.14.0                            BiocParallel_1.22.0                    
 [5] msigdbr_7.1.1                           dplyr_1.0.0                            
 [7] data.table_1.12.8                       topconfects_1.4.0                      
 [9] enrichR_2.1                             tmod_0.40                              
[11] hugene11sttranscriptcluster.db_8.7.0    org.Hs.eg.db_3.11.4                    
[13] pd.hugene.1.1.st.v1_3.14.1              DBI_1.1.0                              
[15] oligo_1.52.0                            oligoClasses_1.50.0                    
[17] RSQLite_2.2.0                           Biostrings_2.56.0                      
[19] XVector_0.28.0                          annotate_1.66.0                        
[21] XML_3.99-0.3                            limma_3.44.1                           
[23] RColorBrewer_1.1-2                      plyranges_1.8.0                        
[25] DiffBind_2.16.0                         SummarizedExperiment_1.18.1            
[27] DelayedArray_0.14.0                     matrixStats_0.56.0                     
[29] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.40.0                 
[31] AnnotationDbi_1.50.0                    Biobase_2.48.0                         
[33] GenomicRanges_1.40.0                    GenomeInfoDb_1.24.0                    
[35] IRanges_2.22.2                          S4Vectors_0.26.1                       
[37] BiocGenerics_0.34.0                     ChIPseeker_1.24.0                      

loaded via a namespace (and not attached):
  [1] estimability_1.3         rappdirs_0.3.1           rtracklayer_1.48.0       AnnotationForge_1.30.1  
  [5] tidyr_1.1.0              ggplot2_3.3.1            bit64_0.9-7              hwriter_1.3.2           
  [9] RCurl_1.98-1.2           generics_0.0.2           preprocessCore_1.50.0    callr_3.4.3             
 [13] cowplot_1.0.0            europepmc_0.4            bit_1.1-15.2             enrichplot_1.8.1        
 [17] base64url_1.4            xml2_1.3.2               assertthat_0.2.1         batchtools_0.9.13       
 [21] viridis_0.5.1            amap_0.8-18              hms_0.5.3                plotwidgets_0.4         
 [25] fansi_0.4.1              progress_1.2.2           caTools_1.18.0           dbplyr_1.4.4            
 [29] Rgraphviz_2.32.0         igraph_1.2.5             geneplotter_1.66.0       htmlwidgets_1.5.1       
 [33] purrr_0.3.4              ellipsis_0.3.1           backports_1.1.7          V8_3.1.0                
 [37] biomaRt_2.44.0           vctrs_0.3.0              remotes_2.1.1            withr_2.2.0             
 [41] ggforce_0.3.1            triebeard_0.3.0          DOT_0.1                  BSgenome_1.56.0         
 [45] checkmate_2.0.0          emmeans_1.4.7            GenomicAlignments_1.24.0 prettyunits_1.1.1       
 [49] DOSE_3.14.0              crayon_1.3.4             genefilter_1.70.0        edgeR_3.30.3            
 [53] pkgconfig_2.0.3          tweenr_1.0.1             nlme_3.1-148             pkgload_1.1.0           
 [57] rlang_0.4.6              lifecycle_0.2.0          affyio_1.58.0            BiocFileCache_1.12.0    
 [61] qusage_2.22.0            GOstats_2.54.0           rsvg_2.1                 rprojroot_1.3-2         
 [65] polyclip_1.10-0          graph_1.66.0             Matrix_1.2-18            urltools_1.7.3          
 [69] boot_1.3-25              beeswarm_0.2.3           ggridges_0.5.2           processx_3.4.2          
 [73] pheatmap_1.0.12          png_0.1-7                viridisLite_0.3.0        rjson_0.2.20            
 [77] bitops_1.0-6             KernSmooth_2.23-17       blob_1.2.1               stringr_1.4.0           
 [81] qvalue_2.20.0            ShortRead_1.46.0         brew_1.0-6               jpeg_0.1-8.1            
 [85] gridGraphics_0.5-0       scales_1.1.1             memoise_1.1.0            GSEABase_1.50.1         
 [89] magrittr_1.5             plyr_1.8.6               gplots_3.0.3             gdata_2.18.0            
 [93] zlibbioc_1.34.0          compiler_4.0.0           scatterpie_0.1.4         plotrix_3.7-8           
 [97] DESeq2_1.28.1            Rsamtools_2.4.0          cli_2.0.2                systemPipeR_1.22.0      
[101] Category_2.54.0          ps_1.3.3                 MASS_7.3-51.6            tidyselect_1.1.0        
[105] fftw_1.0-6               stringi_1.4.6            yaml_2.2.1               GOSemSim_2.14.0         
[109] askpass_1.1              locfit_1.5-9.4           latticeExtra_0.6-29      ggrepel_0.8.2           
[113] grid_4.0.0               VariantAnnotation_1.34.0 fastmatch_1.1-0          tools_4.0.0             
[117] rstudioapi_0.11          foreach_1.5.0            tagcloud_0.6             gridExtra_2.3           
[121] farver_2.0.3             ggraph_2.0.3             digest_0.6.25            rvcheck_0.1.8           
[125] BiocManager_1.30.10      ff_2.2-14.2              Rcpp_1.0.4.6             httr_1.4.1              
[129] colorspace_1.4-1         fs_1.4.1                 splines_4.0.0            RBGL_1.64.0             
[133] graphlayouts_0.7.0       ggplotify_0.0.5          sessioninfo_1.1.1        xtable_1.8-4            
[137] jsonlite_1.6.1           tidygraph_1.2.0          testthat_2.3.2           R6_2.4.1                
[141] pillar_1.4.4             htmltools_0.4.0          affxparser_1.60.0        glue_1.4.1              
[145] DT_0.13                  codetools_0.2-16         pkgbuild_1.0.8           mvtnorm_1.1-0           
[149] utf8_1.1.4               lattice_0.20-41          tibble_3.0.1             curl_4.3                
[153] gtools_3.8.2             GO.db_3.11.4             openssl_1.4.1            survival_3.1-12         
[157] desc_1.2.0               munsell_0.5.0            DO.db_2.9                GenomeInfoDbData_1.2.3  
[161] iterators_1.0.12         reshape2_1.4.4           gtable_0.3.0

Loading own data

Hi there,

I have been trying to explore your vignette and now wanted to try applying the same workflow to my own pre-ranked files. However, I keep getting the same error reported regarding my rnk file

> rnk.file <- system.file("extdata", "Cluster11_GSEA.rnk", package="fgsea")
> ranks <- read.table(rnk.file, header = TRUE, colClasses = c("character", "numeric"))
Error in read.table(rnk.file, header = TRUE, colClasses = c("character", :
no lines available in input
In addition: Warning message:
In file(file, "rt") :
file("") only supports open = "w+" and open = "w+b": using the former

I not sure the rnk file was formatted properly, but also not sure about calling the functions exactly how I am supposed to. Any help you could provide on this matter would be highly appreciated.

Thanks!

Standard errors for Normal enrichment score

Hi
I was wondering if we can obtain standard errors for the normal enrichment scores. If I assume correctly, the p value is being calculated by bootstrapping. A similar method can be implemented to find random normal enrichment scores and take their standard deviation to include as the standard error.

Preranked metric ties

Hi, great package.
I am wondering how it is manage when the metric of the ranked genes it is a tie, this for values not equal to 0.

The Broad Institute java software has a warning about this, but I did not find any information about it in the preprint nor the vignette.

Should ties be solve with jittering the duplicated values? or a warning be printed when there are ties in the ranked gene list?

collapsePathways() errors in padj threshold-dependent fashion after successful fgsea() run

Hello,

First of all, thanks for writing/maintaining such a fantastic package! I have recently encountered an issue with the collapsePathways() function in fgsea 1.6.0, in which I get the following error after a successful fgsea() run: "Error in FUN(X[[i]], ...) :
GSEA statistic is not defined when all genes are selected"

I am running collapsePathways() as described in the vignette (see below) though filtering my input fgseaRes df such that padj < 0.05 rather than 0.01. On some datasets this has worked fine but on others I get the above error. Interestingly, lowering the padj threshold down to 0.01 alleviates the problem. Any help understanding this would be much appreciated!

I've added some code below showing my calls, error messages, and sessionInfo(). The .gmt file was downloaded from MSigDB. I'd also be happy to email a file containing the gene ranking for proper reproducibility if that would be helpful.

Thanks again!
Andrew

head(ranked_genes)

   KLHL1   SLC6A5     LBX1    LRP1B    GLRA2   TRIM67 
4021.458 3393.911 3320.055 2859.702 2844.379 2825.953

bp_go <- gmtPathways("/home/andrew/projects/daily_log/09292018/c5.bp.v6.2.symbols.gmt")

fgseaRes <- fgsea(pathways = bp_go, 
                  stats = ranked_genes,
                  minSize=15,
                  maxSize=1000,
                  nperm=100000)

fgsea runs normally as far as I can tell - only warning me about some ties:

Warning message: In fgsea(pathways = bp_go, stats = ranked_genes, minSize = 15, maxSize = 1000, : There are ties in the preranked stats (0.18% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.

The results are expected for my data and give a number of significant categories:

fgseaRes %>% arrange(pval) %>% select(pathway, pval, padj, ES, NES, nMoreExtreme, size) %>% head()

pathway	pval	padj	ES	NES	nMoreExtreme	size
GO_NEUROGENESIS	1.00E-05	0.006736106	0.7743695	1.315848	0	960
GO_CENTRAL_NERVOUS_SYSTEM_DEVELOPMENT	1.00E-05	0.006736106	0.8157949	1.384007	0	615
GO_REGULATION_OF_NERVOUS_SYSTEM_DEVELOPMENT	1.00E-05	0.006736106	0.7999929	1.355705	0	523
GO_HEAD_DEVELOPMENT	1.01E-05	0.006736106	0.8081064	1.369204	0	513
GO_CELL_CELL_SIGNALING	2.04E-05	0.009308852	0.8186953	1.378237	1	344
GO_SYNAPTIC_SIGNALING	2.08E-05	0.009308852	0.8530924	1.420684	1	229

collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.05], 
                                      bp_go, ranked_genes)

Running collapsePathways() as above, though, gives the below error. Running it with padj < 0.01 instead eliminates the error and returns a result (though still producing the tie warnings).

Error in FUN(X[[i]], ...) : 
  GSEA statistic is not defined when all genes are selected
In addition: Warning messages:
1: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.17% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2],  :
  There are ties in the preranked stats (0.1% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
3: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
4: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2],  :
  There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
5: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.17% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
6: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u2],  :
  There are ties in the preranked stats (0.29% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
7: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
8: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.18% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
9: In fgsea(pathways = pathways[pathwaysToCheck], stats = stats[u1],  :
  There are ties in the preranked stats (0.16% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2              fgsea_1.6.0                 Rcpp_0.12.18                DESeq2_1.20.0              
 [5] SummarizedExperiment_1.10.1 DelayedArray_0.6.5          BiocParallel_1.14.2         Biobase_2.40.0             
 [9] GenomicRanges_1.32.6        GenomeInfoDb_1.16.0         IRanges_2.14.11             S4Vectors_0.18.3           
[13] BiocGenerics_0.26.0         tximport_1.8.0              matrixStats_0.54.0          genefilter_1.62.0          
[17] ballgown_2.12.0             forcats_0.3.0               stringr_1.3.1               dplyr_0.7.6                
[21] purrr_0.2.5                 readr_1.1.1                 tidyr_0.8.1                 tibble_1.4.2               
[25] ggplot2_3.0.0               tidyverse_1.2.1            

loaded via a namespace (and not attached):
 [1] nlme_3.1-137             bitops_1.0-6             lubridate_1.7.4          bit64_0.9-7              RColorBrewer_1.1-2      
 [6] httr_1.3.1               tools_3.5.1              backports_1.1.2          R6_2.2.2                 rpart_4.1-13            
[11] Hmisc_4.1-1              DBI_1.0.0                lazyeval_0.2.1           mgcv_1.8-24              colorspace_1.3-2        
[16] nnet_7.3-12              withr_2.1.2              gridExtra_2.3            tidyselect_0.2.4         bit_1.1-14              
[21] compiler_3.5.1           cli_1.0.0                rvest_0.3.2              htmlTable_1.12           xml2_1.2.0              
[26] labeling_0.3             rtracklayer_1.40.6       checkmate_1.8.5          scales_1.0.0             digest_0.6.17           
[31] Rsamtools_1.32.3         foreign_0.8-71           XVector_0.20.0           htmltools_0.3.6          base64enc_0.1-3         
[36] pkgconfig_2.0.2          limma_3.36.3             htmlwidgets_1.2          rlang_0.2.2              readxl_1.1.0            
[41] rstudioapi_0.7           RSQLite_2.1.1            bindr_0.1.1              jsonlite_1.5             acepack_1.4.1           
[46] RCurl_1.95-4.11          magrittr_1.5             GenomeInfoDbData_1.1.0   Formula_1.2-3            Matrix_1.2-14           
[51] munsell_0.5.0            stringi_1.2.4            zlibbioc_1.26.0          plyr_1.8.4               grid_3.5.1              
[56] blob_1.1.1               crayon_1.3.4             lattice_0.20-35          Biostrings_2.48.0        haven_1.1.2             
[61] splines_3.5.1            annotate_1.58.0          hms_0.4.2                locfit_1.5-9.1           knitr_1.20              
[66] pillar_1.3.0             geneplotter_1.58.0       fastmatch_1.1-0          XML_3.98-1.16            glue_1.3.0              
[71] latticeExtra_0.6-28      data.table_1.11.4        modelr_0.1.2             cellranger_1.1.0         gtable_0.2.0            
[76] assertthat_0.2.0         xtable_1.8-3             broom_0.5.0              survival_2.42-6          GenomicAlignments_1.16.0
[81] AnnotationDbi_1.42.1     memoise_1.1.0            cluster_2.0.7-1          sva_3.28.0

ranked_genes.tsv.gz
c5.bp.v6.2.symbols.gmt.gz

Error in plotEnrichment() when input contains NA values.

I found that this error occurs when the input named vector of scores contain NA values.

plotEnrichment(zebrafish_GO[["extracellular_matrix(2)"]], ranked_vec)
Error in if (NR == 0) { : missing value where TRUE/FALSE needed
traceback()
2: calcGseaStat(statsAdj, selectedStats = pathway, returnAllExtremes = TRUE)
1: plotEnrichment(zebrafish_GO[["extracellular_matrix(2)"]], ranked_vec)

It might be nice to check for NAs in the input and give a more informative error.

By the way, fgsea() does not give an error given the same NA-containing input.

collapsePathways

There is a typo or something in the collapsePathways function as it uses a variable called "pathway" that is not defined anywhere in the function. This seems to have appeared sometime between the current version and version 1.8.1 as my computer with the older version runs just fine.

Best,
Tallulah

fGSEA for counting data

Hi,

I have done a gene presence absence analysis across two populations. There are some genes that are significantly present in one population. I also have the related GOs of all genes in both populations. Can I use fGSEA to find enriched GOs among the significantly present genes? In this case the "Gene-level statistic" would be the genes p-value. Can fGSEA handles p-values like 2E-98, ...0.01 etc?

Thanks!

fgsea NA values?

Hi, fgsea works really well for me however occasionally I get a dataset that has NA for the pval value. Does anyone know why that is the case? Thanks!

here is an example output.

                                                                pathway pval padj          ES NES nMoreExtreme size                                  leadingEdge
 1:                                          BP_AMIDE_BIOSYNTHETIC_PROCESS   NA   NA -0.11664589  NA            0  736 ST8SIA2,PIWIL1,IGF2BP1,GGT5,EEF1A2,SMPD3,...
 2:                                          BP_ANIMAL_ORGAN_MORPHOGENESIS   NA   NA -0.24180857  NA            0  944    COL2A1,DMP1,AMBN,IFITM5,MMP20,COL11A1,...
 3:                                                    BP_AXON_DEVELOPMENT   NA   NA -0.19181135  NA            0  479       SPTA1,DPYSL5,TNN,MYPN,FEZF2,CTNNA2,...
 4:                                 BP_CALCIUM_ION_TRANSMEMBRANE_TRANSPORT   NA   NA -0.23682943  NA            0  285    SLC8A3,TRDN,S100A1,CACNA1E,CALCR,R

fgsea_1.12.0

this is the code I run.
fgsea(stats=glist, pathways=pathways.in , minSize=15, maxSize=50000, nperm=1000)

what is gseaParam?

What is the gseaParam argument in fgsea()?
The help page just says "GSEA parameter value" and I couldn't find it in the GSEA docs.
Thanks!

Results of fgsea

Hi there,

I installed "fgsea" and input a list of pre-ranked genes.
Those genes were selected based on breast cancer.
The weights were t score by comparing the expression between ER+ and ER- patients.
When I ran fgsea and the result returned 18 pathways.
Can I tell which pathways are enriched in ER+ patients and which are enriched in ER- ones?

Do you have any suggestion on the weights? Which kind of statistic data is the best weight for fgsea?

Thank you.

Change size of vertical lines

I am using fgsea with a larger dataset than usual and a high percentage of the rank overlaps with the pathway. This causes the vertical lines in the indicator bar to appear as a continuous black block. I have two inconvenient workarounds (subsampling the dataset and chaning the width of the plot) but, ideally I would like to change the size of the individual vertical lines to prevent them form melting together. Is there a way to change the size of the vertical line ?
plot_Gsea_OCAB.pdf

plotGseaTable saveable to a variable like plotEnrichment

I have a Shiny app that I'm trying to use to display multiple plots. I encountered a problem where the result of plotGseaTable gets displayed in the RStudio viewer window instead of in the browser with the rest of my page. I've found that plotEnrichment has no issue like this, but plotGseaTable does.

Here is a minimum not-working example:

library(dplyr)
library(fgsea)
library(ggplot2)
gseaResult <- fgsea(pathways = examplePathways, stats = exampleRanks, nperm = 10)
topPathways <- gseaResult[NES > 0][head(order(desc(NES)), n = 10), pathway]
gseaPlot <- plotGseaTable(examplePathways[topPathways], exampleRanks, gseaResult)

Even though it's been stored in a variable, it shows up in the RStudio plot window. In the context of my Shiny app, this causes it to show up in the plot panel and not in the browser.

Whereas if you try:

gseaEnrichment <- plotEnrichment(examplePathways[[topPathwayUp]], exampleRanks)

it doesn't show up in the plot window.

See this SO question for a larger example of how it's not working for me:

https://stackoverflow.com/questions/55421131/shiny-plot-shows-up-in-rstudio-window-instead-of-in-browser

A colleague could not reproduce the issue and they have version 1.6 of fgsea, whereas I've seen the problem with 1.8 (Bioconductor) and 1.95 (Github).

Error in plotGseaTable()

Good evening!

I posted here a while ago when I was starting to get familiarized with the fgsea package, and the help that was provided to you helped me go a long way! I'm also done with my analyses, except that I am now generating panels that would be suitable for figures in articles. I am getting an error when trying to get the plotGseaTable() over and over again, and from what I understand it seems that it might be a formatting issue. I'm putting the code and the error below, I'd be happy to share more of the coding if that is necessary!

> topPathwaysUp_1.1 <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
> topPathwaysDown_1.1 <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
> topPathways_1.1 <- c(topPathwaysUp_1.1, rev(topPathwaysDown_1.1))
> plotGseaTable(examplePathways[topPathways_1.1], ranks_1.1, fgseaRes, gseaParam = 0.5)
Error: Aesthetics must be either length 1 or the same as the data (1): x, xend, yend

Any idea / advice to fix this? Thanks much for your help!
-Pauline

Can not run fgsea in R 3.4.2

Hi,

I successfully installed fgsea in R 3.4.2.
But, when I run the fgsea, it caused R crash.
Can you figure it out?

Thanks,
Per

Issue Installing fgsea for R3.3.1

Hi,
First, thank you for the package.
I have issues install fgsea. I get the following error (I think it may have to do with GCC version l I am using GCC 4.4):
g++ -std=c++0x -I/home/USE/tools/R-3.3.1/include -DNDEBUG -I/usr/local/include -I"/home/lseridi/tools/R-3.3.1/library/Rcpp/include" -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
g++ -std=c++0x -I/home/USE/tools/R-3.3.1/include -DNDEBUG -I/usr/local/include -I"/home/lseridi/tools/R-3.3.1/library/Rcpp/include" -fpic -g -O2 -c fastGSEA.cpp -o fastGSEA.o
fastGSEA.cpp: In function ‘Rcpp::IntegerVector combination(const int&, const int&, std::mt19937&)’:
fastGSEA.cpp:318: error: ‘uniform_int_distribution’ is not a member of ‘std’
fastGSEA.cpp:318: error: expected primary-expression before ‘int’
fastGSEA.cpp:318: error: expected ‘;’ before ‘int’
fastGSEA.cpp:324: error: ‘uni’ was not declared in this scope
make: *** [fastGSEA.o] Error 1
ERROR: compilation failed for package ‘fgsea’

Thank you,

Cannot install fgsea

Hello,

I was using fgsea last week and it worked well. I had to reinstall the package today but now it's no longer working. I saw that the package was updated yesterday and I'm wondering if these errors may be related to that.

I tried installing fgsea and it seems to work fine:

> if (!requireNamespace("BiocManager", quietly = TRUE))
+   install.packages("BiocManager")
> 
> BiocManager::install("fgsea")
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.2 (2019-12-12)
Installing package(s) 'fgsea'
trying URL 'https://bioconductor.org/packages/3.10/bioc/bin/windows/contrib/3.6/fgsea_1.12.0.zip'
Content type 'application/zip' length 1430372 bytes (1.4 MB)
downloaded 1.4 MB

package ‘fgsea’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\natha\AppData\Local\Temp\RtmpWE48pY\downloaded_packages
Installation path not writeable, unable to update packages: boot, foreign, MASS, nlme
Old packages: 'dplyr', 'mvtnorm', 'rlang', 'stringi', 'tidyselect'
Update all/some/none? [a/s/n]: 
a

  There are binary versions available but the source versions are later:
           binary source needs_compilation
dplyr       0.8.3  0.8.4              TRUE
mvtnorm    1.0-11 1.0-12              TRUE
rlang       0.4.3  0.4.4              TRUE
stringi     1.4.4  1.4.5              TRUE
tidyselect  0.2.5  1.0.0              TRUE

  Binaries will be installed
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/dplyr_0.8.3.zip'
Content type 'application/zip' length 3224388 bytes (3.1 MB)
downloaded 3.1 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/mvtnorm_1.0-11.zip'
Content type 'application/zip' length 271942 bytes (265 KB)
downloaded 265 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/rlang_0.4.3.zip'
Content type 'application/zip' length 1121133 bytes (1.1 MB)
downloaded 1.1 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/stringi_1.4.4.zip'
Content type 'application/zip' length 15307078 bytes (14.6 MB)
downloaded 14.6 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/tidyselect_0.2.5.zip'
Content type 'application/zip' length 624335 bytes (609 KB)
downloaded 609 KB

package ‘dplyr’ successfully unpacked and MD5 sums checked
package ‘mvtnorm’ successfully unpacked and MD5 sums checked
package ‘rlang’ successfully unpacked and MD5 sums checked
package ‘stringi’ successfully unpacked and MD5 sums checked
package ‘tidyselect’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\natha\AppData\Local\Temp\RtmpWE48pY\downloaded_packages

When I try to load the package, it doesn't work. Any ideas?

> library(fgsea)
Loading required package: Rcpp
Error: package or namespace load failed for ‘fgsea’ in get(Info[i, 1], envir = env):
 cannot open file 'C:/Users/natha/Documents/R/win-library/3.6/BiocParallel/R/BiocParallel.rdb': No such file or directory

Better - opensource licence ?

@assaron thanks for the package, starting to explore it now. I have a couple of questions, but i'll keep em as separate issues.

The licence for this repo aren't very permissive, from what I understand reading a bit here and here I can't take a fork of the repo and/or modify the source in any way. I'm not sure if this is what you've intended. It would be great if you could update your licence to be more opensource like. Quoting from one of the article I've mentioned above

When you make a creative work (which includes code), the work is under exclusive copyright by default. Unless you include a license that specifies otherwise, nobody else can copy, distribute, or modify your work without being at risk of take-downs, shake-downs, or litigation.

Also, this isn't directly related to this repo but shiny app https://github.com/ctlab/fgsea-web which you also contribute to, doesn't have a licence at all, would be great if you could add opensource licence to that repo too

This is pretty good site that helped me to pick a licence.

cheers

No Significant pvalue_adjusted

Hi, Thanks for creating this awesome package.

I have a doubt, I have been trying to do fgsea in R using pre-ranked based on log2fold change values. I notice that fgsea do the sorting from the highest positive to lowest negative. I tried several times with several datasets but never got a single significant (fdr < 0.01) pathway in several genesets. I thought this was very strange and did a faked pre-ranked file with all Glycolysis set from kegg in the edge of the list to get a super high enrichment score (Fig1). Even then, no significance was reported by fgsea. A low enriched pathway got more significance than the Glycolysis (Fig2). Can you point me to what I could be doing wrong?

I have attached the R objects that I am using. (pathway_rank_R_iamges.zip)

Thank you.

Fig1

Fig2

pathway_rank_R_iamges.zip

Parallel backend running out of resources

Hi,

First, very nice package.

If I run fgsea hundreds or thousands of times by means of an lapply or foreach::foreach, either on linux or Mac, it generates a lot of zombie processes and then runs out of resources. It would be nice to turn off the BPPARAM or nproc option and run serially. Even better, I can then wrap this inside a solid and stable backend like doParallel combined with an iterator like foreach and not have to deal with an unsatisfactory BiocParallel wrapper.

Why I am running this hundreds or thousands of times has to do with a specific application not (historically) common in bioinformatics data sets.

Thanks in advance,
Chris

Cumulative test

Hi.

I understand that both GSEA and FSEA yield to the same results, but I'm not sure the methodology it uses.

Does it uses Kolmogorov-Smirnov a custom cumulative test or a modified version of Kolmogorov?

Thanks

scoring scheme

Hi,
great package!
To calculate the enrichment score in GSEApreranked tool from the broad institute they offer 4 options (classic, weighted, weighted_p2, weighted_p3.) How are you handling these settings under the hood and are you considering to add these options to your package?

getting different results from multiple fgsea runs (unstable)

Dear Dr. Sergushichev,
Thanks for the great fast GSEA implementation in R.
I have issues when running fgsea multiple times (with same ranked list and settings).
I get slightly different results and this is kind of annoying for reproducibility concerns.
Is there a way to set the seed within the function? right now it is: seeds <- sample.int(10^9, length(permPerProc))
Would greatly appreciate your help
Best

Question about collapsePathways() logic

Hi,

I just discovered the fgsea package and I love it. Very clean and easy to use.

I tried using the collapsePathways function on my results and it threw out a few pathways that I intuitively believed shouldn't have much overlap. So I started looking into the source code and realized two things I wanted to let you know about.

First, the main problem was this parameter: nperm = 10/pval.threshold

In the following chunk of code, which uses the universe of genes WITHOUT those from the query set, the fgseaSimple() call for 200 permutations returned a p-value of ~.08. Therefore, the set was deemed redundant and discarded.

u1 <- setdiff(universe, pathways[[p]])

fgseaRes1 <- fgseaSimple(pathways = pathways[pathwaysToCheck], 
  stats = stats[u1], nperm = nperm, maxSize = length(u1) - 
    1, nproc = 1, gseaParam = gseaParam)

However, running the exact same parameters on fgseaMultilevel returned a p-value of ~10^-8, which fits my intuition that the gene sets have almost no overlap and are not redundant.

I tried increasing the number of permutations to the call above and got the following results.

As you can see, even though the "true" p-value for that set should be very low, at the default settings of pcutoff .05 and 200 permutations, it fails significance testing, and likewise fails again at pcutoff .01 and 1000 permutations.

I'm not sure whether this is due to some bizarre peculiarity with my gene sets or not (i am working with s. cerevisiae GO terms), but I would like to recommend changing the default setting to nperm = 100/pval.threshold or changing the test to fgseaMultilevel

Second, the logic of the second fgsea call makes no sense to me

At this part:

u2 <- pathways[[p]]
fgseaRes2 <- fgseaSimple(pathways = pathways[pathwaysToCheck], 
  stats = stats[u2], nperm = nperm, maxSize = length(u2) - 
    1, nproc = 1, gseaParam = gseaParam)

minPval[fgseaRes2$pathway] <- pmin(minPval[fgseaRes2$pathway], 
  fgseaRes2$pval)

parentPathways[names(which(minPval > pval.threshold))] <- p

The minimum p-value is merged into the minPval vector, and then any pathways not passing the p-value cutoff are marked redundant. This second part seems backwards... it is asking whether you still get pathway enrichment using just the genes from the query pathway "p." You would expect there to be NO enrichment for non-overlapping gene sets and a high p-value.

As an example, consider gene sets A and B. Both are highly significantly enriched in the dataset, and set A has 400 genes, and set B has the same 400 genes plus 1 more. Obviously, we would want to mark these as redundant.

In the above algorithm, the first test removes the 400 genes from set A and tests set B for enrichment. This returns a high p-value which would fail the p value cutoff.

In the second test, using only the 400 genes from set A, you would get a VERY significant enrichment in set B. Since the minimum p-value from both tests is stored in the minPval vector, gene set B will actually be marked as non-redundant.

Do I have this correct or am I missing something?

Perhaps instead we should do a second test here for anything passing the p-value cutoff, and mark those as redundant. Alternatively, I'm not sure this second test is needed at all. Just the first test, checking the pathways with the query pathway genes removed, seems sufficient.

Thanks,
Jason

Warning about recycled values

Hi,

I am getting this warning when running fgsea on reactome pathways obtained from the package's reactomePathways function, using my top 100 genes:

Warning messages:
1: In `[.data.table`(pvals, , `:=`(leadingEdge, list(leadingEdges))) :
  Supplied 192 items to be assigned to 745 items of column 'leadingEdge' (recycled leaving remainder of 169 items)

The returned value contains many NAs in place of the pathways.
Will try to narrow this issue down, but maybe you can check if there is no cbind operation that would lead to this.

pathways for which P-values were not calculated properly due to unbalanced gene-level statistic values

Hi, Dear developer
when using fgsea, I found the below warning

And this warning will make the correspondent pvalue in pathway become NA. So I am curious about the cause of this warning. Is it means my postive statistic values number is not same as negative?

> table(vals > 0)

FALSE  TRUE 
24237 13099

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/sysoft/R-3.6.1/lib64/R/lib/libRblas.so
LAPACK: /opt/sysoft/R-3.6.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fgsea_1.12.0    Rcpp_1.0.3      forcats_0.4.0   stringr_1.4.0  
 [5] dplyr_1.0.0     purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [9] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0    haven_2.2.0         lattice_0.20-38    
 [4] colorspace_1.4-1    vctrs_0.3.2         generics_0.0.2     
 [7] rlang_0.4.7         pillar_1.4.3        glue_1.4.1         
[10] withr_2.1.2         DBI_1.1.0           BiocParallel_1.19.6
[13] dbplyr_1.4.2        modelr_0.1.5        readxl_1.3.1       
[16] lifecycle_0.2.0     munsell_0.5.0       gtable_0.3.0       
[19] cellranger_1.1.0    rvest_0.3.5         parallel_3.6.1     
[22] fansi_0.4.0         broom_0.5.3         scales_1.1.0       
[25] backports_1.1.5     jsonlite_1.6        fs_1.3.1           
[28] gridExtra_2.3       fastmatch_1.1-0     hms_0.5.2          
[31] packrat_0.5.0       stringi_1.4.3       grid_3.6.1         
[34] cli_2.0.0           tools_3.6.1         magrittr_1.5       
[37] lazyeval_0.2.2      crayon_1.3.4        pkgconfig_2.0.3    
[40] ellipsis_0.3.0      Matrix_1.2-18       data.table_1.12.6  
[43] xml2_1.2.2          reprex_0.3.0        lubridate_1.7.4    
[46] assertthat_0.2.1    httr_1.4.1          rstudioapi_0.10    
[49] R6_2.4.1            nlme_3.1-143        compiler_3.6.1

Best wishes

Guandong Shang

Ranking metric ascending or descending

I read the manual but could not figure out what the ranking metric should be, ascending tor descending? In JAVA GSEA, there is an option to sort the gene list based on either ascending to descending metric.

Significance of gene-sets higher with increasing nperm values?

Hi,

A collaborator of mine recommended fgsea to apppy on some microarray data that we have generated. I was really happy that it's very fast. I made a gene set gathering BTMs (Li et al. 2014) + kegg + reactome + hallmark. To my surprise however, when I ran fgsea changing only the number of permutations (from 1000 to 100000) my padj goes from non significant (at 1000 perms they are all over 0.23) to very significant, to the point that with 100000 permutations the first seven processes are signifcant at padj < .05. How come this be? I know from the documentation that the minimum nominal p-value is a function of the nperm parameter.

I'm happy to send my rank file if you need.

This is how I ran fgsea:

# Kegg + Reactome + Hallmark + BTM gene-sets from Msigdb
pathways.kegg.react.hallmark <- gmtPathways("gene-sets/kegg_reactome_hallmark_gene_sets.gmt")

# Load rank file (decreasing log2FC values)
v1.7dpi.rnk <- read.delim(file="ranked/V1_7dpi.rnk", header = T, stringsAsFactors = F)
v1.7dpi.rnk <- deframe(v1.7dpi.rnk)

fgseaRes <- fgsea(pathways=pathways.kegg.react.hallmark, 
                  stats=v1.7dpi.rnk,
                  nperm=100000)

only the nperm parameter was changed between runs.

do you only support ENTREZ ids ?

@assaron here come my second question..

the gene names/ids have to be of the same type in both gene set and ranked list right? picking a the source a bit it seem that you only support ENTREZ ids right?

We mostly use ensembl ids, those that start with ENS. Running the shiny app from https://github.com/ctlab/fgsea-web I get this error

Error: You must provide at least one key to the has

which I think comes from fgse package and not shiny and has to do with the fact that my Ids don't match between ranked list and the gene set(s)

any ides how to fix the error and/or also make fgsea more generic to work with any gene sets?

slightly aside; often people have then own gene set that they want to test out i.e "favourite gene list" so the shiny app should support this. I can fix that given the licence have been updated

Cheers

struggling to understand fgsea output

Hi, Thanks for making this package which made gsea very convenient for R user. Recently, we have done a few gsea using custom gene list from differentially expression results from DESeq2. We used the t-statistics as the ranking parameter. Surprisingly, we see negative ES and NES whereas most of our genes are upregulated (from the custom list), only a handful are downregulated. Can you advise anything?

Best wishes
Nurun

plotEnrichment ranks labels

Is it possible to add an option to add ranks labels to plotEnrichment()? Currently, it is not possible to determine the ranks order from the plot without some additional information. For example, if you are comparing KO to WT, it would be helpful add "KO" and "WT" to opposite ends of the X (ranks) axis.

Here is an example from limma::barcodeplot() to illustrate what I am trying convey:

If that is outside the scope of this package, do you have any suggestions on how to best add something like that with least interference with other elements?

Add support for one-tailed GSEA tests

This may be outside of the desired scope of your software, but it would be quite useful if fgsea could be extended to support one-tailed GSEA tests.

This way, fgsea could be used to analyze other non-expression single-tailed gene statistics, and could also be used to look abs(fold change) rankings.

Here is an example of another package which implements such an approach:

scRNA-seq: how to create the Ranked list?

Hello,

I have some clusters with differentially expresses genes generated using Seurat. I also have a highly curated gene list and I wish to perform enrichment analysis using these on each cluster picking the top X genes. I understand the differentially expressed genes in each cluster can be input to your algorithm as a pathway. But I am confused what the ranked list should be? Is it the curated gene list? If yes, could you explain the values to be input into this list?

Thank you,
Asma

Crash running fgsea example

Hi,
I found a crash error when I tested the web example. I am not able to fix it by myself. I couldn't run my own experiment, neither the toy example from the web. Hope you could help me. By the way, I got the same error using two installation ways: Bioconductor and devtool/github.

I have tried to run the example from the web using:

# Loading libraries
library(data.table)
library(fgsea)
library(ggplot2)

# Loading example pathways and gene-level statistics:
data(examplePathways)
data(exampleRanks)

# Running fgsea (should take about 10 seconds):
fgseaRes <- fgsea(pathways = examplePathways, 
                  stats = exampleRanks,
                  minSize=15,
                  maxSize=500,
                  nperm=10000)

This is the error I got:

Error in registered()[[bpparamClass]] :    attempt to select less than one element in get1index

My session info:

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.2.1       fgsea_1.3.1.9000    Rcpp_0.12.14        data.table_1.10.4-3

loaded via a namespace (and not attached):
 [1] lattice_0.20-35    assertthat_0.1     grid_3.4.0         plyr_1.8.4         gtable_0.2.0      
 [6] scales_0.4.1       lazyeval_0.2.0     Matrix_1.2-8       fastmatch_1.1-0    BiocParallel_1.8.2
[11] tools_3.4.0        munsell_0.4.3      parallel_3.4.0     compiler_3.4.0     colorspace_1.3-2  
[16] gridExtra_2.3      tibble_1.2

Thank you very much for your help!

All the best,
Javier

Bug when no genes pass the min/max size threshold

Great job on the tool!

I spotted a bug here: https://github.com/ctlab/fgsea/blob/master/R/fgsea.R#L140

If pathwaysFiltered is an empty vector numeric(0) (because no genes passed the min/max thresholds) calling sapply(pathwaysFiltered, length) on it will return a list() (one of R's kinks), which means that the next line will be max(list()), which will give an error. You can get around this by checking to see if length(pathwaysSizes) > 0 .

Identical padj among the top significant pathways

Hi,
one thing that caught my attention but I couldn't find an explanation is a reason for identical adjusted pvalues among the most significant pathways. Is this a normal thing to happen? I noticed that now in several fgsea runs.

Issue with ranking

I've started today with fgsea and I found something for which I've no explanation.

I’ve a ranking of 293 genes (it's just a test) from a DEG experiment, I’ve ordered genes by its p-value in two different ways:

A) sign of logFC * abs(log10(pvalue))
B) From 1 to 293, being 1 the most significant gene down-regulated, 293 the most significant up-regulated gene, 2 would be the second most significant gene down-regulated, 292 as you imagine is the second most significant gene up-regulated and so on for the rest of genes.

I took the top 20 up/down genes from this ranking as gene sets and performed a fgsea analysis.

fgseaRes <- fgsea(pathways = rnk.sig[1:2], stats = rnk.list[[1]], minSize=5, maxSize=500, nperm=10000, nproc = 15)

The ES were as expected whenever I used both rankings, for the UP gene set the ES was 1 and for the DN gene set the ES was -1.

The problem becomes when checking the p-values, if I use the ranking A, both gene sets (UP and DN) are significant with similar p-values.
But this doesn’t happen when I input the ranking B, the UP gene set is significant while the DN does not. P-value seems to range between 1 and 0.3.

I’ve check for each analysis the plots by using the function plotEnrichment() and in all cases are ok, with genes totally skewed towards one of the extremes of the ranking

I’m not sure what is happening.

Thanks for your help.

Plot modifications

Hi! Would it be possible to add arguments to modify plots? Especially the output of the function plotGseaTable is a bit limited and changing fonts and colors of signaling pathways and bars could be nice. Maybe that would be possible by altering the source code, but was not able to do that.

Nonetheless, this is a very nice and efficient package!

Ranking the genes from Seurat DE output file

I am using this tool for enrichment analysis from DESeq2 results file, where I use Wald statistic for genes prerank. It makes things very easy.
In Seurat, DESeq2 test doesn't return with Wald statistic values. How to prerank genes for fgsea? DE Output file contains p_val, avg_logFC, pct.1, pct.2 and p_val_adj?
I am new in this field, sorry I don't know much.

Leading Edges

Really great code Alexei, one missing feature that would be greatly helpful and should not be too difficult to implement is to return the leading edge genes (http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Running_a_Leading) for the pathway to understand the genes that drive the enrichment. The R code to do something like this should be along the lines of:

if(maxP > -minP) {
leadingEdge <- names(geneList[1:which(runningScore == maxP)])
leadingEdge <- leadingEdge[leadingEdge %in% geneSet]
leadingEdge <- paste0(leadingEdge, collapse='|')
}else{
leadingEdge <- names(geneList[which(runningScore == minP):
length(geneList)])
leadingEdge <- laggingEdge[leadingEdge %in% geneSet]
leadingEdge <- paste0(leadingEdge, collapse='|')
}

Setting colwidth to zero doesn't seem to work properly

Hi @assaron,

Setting colwidth to zero doesn't seem to work as expected: expected behavior would be not drawing the column at all. Please, see the example below:

library(data.table)
library(fgsea)
library(ggplot2)

data(examplePathways)
data(exampleRanks)

fgseaRes <- fgsea(pathways = examplePathways, 
                  stats = exampleRanks,
                  minSize=15,
                  maxSize=500,
                  nperm=10000)

topPathwaysUp <- fgseaRes[ES > 0, ][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0, ][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes, 
              gseaParam = 0.5, colwidths = c(0, 1, 0.5, 0.5, 0))

And after you plot GSEA table you can easily see "leftovers" on the most left and most right columns.

Cheers and best of luck
Konstantin

fgsea for many rank vectors

Hi.

I want to compute fgsea for many rank vectors (hundreds, thousands...) with the same genesets. Currently doing it in a loop or actually wrapped mclapply but it would be nice to let fgsea accept a matrix instead of vector for the stats to avoid any overhead. Would this be faster and feasible?

Ivo

fgseaMultilevel() hangs with some pathway/gene combinations

I successfully ran fgsea many times. I finally ran into a problem. One particular pathway and gene list combination causes fgsea to hang. I am just going to include a reproducible example.

Set up the environment (includes one ranked gene list and a list of four pathways):

library(fgsea)
library(BiocParallel)
register(SerialParam())

download.file("https://www.dropbox.com/s/e6lgirf3xt1b3to/fgsea-pathways.rds?raw=1", "fgsea-p.rds")
download.file("https://www.dropbox.com/s/8znl53ha7wylvdc/fgsea-stats.rds?raw=1", "fgsea-s.rds")

fgseap = readRDS("fgsea-p.rds")
fgseas = readRDS("fgsea-s.rds")

The original fsgea() method works:

fgsea_res = fgsea(pathways = fgseap, stats = fgseas, nperm = 1000)

Using fgseaMultilevel() method works if you skip the second pathway:

fgsea_res = fgseaMultilevel(pathways = fgseap[c(1,3,4)], stats = fgseas)

Randomly subsetting the genes works (from about 20k to 10k):

fgsea_res = fgseaMultilevel(pathways = fgseap, stats = sample(fgseas, 10000))

Including all the pathways and all the genes makes it hang (the command runs forever):

fgsea_res = fgseaMultilevel(pathways = fgseap, stats = fgseas)

I tried this on two different systems with different fgsea versions (1.10.0 and 1.11.1) and it happens on both.

installing error

install_github("ctlab/fgsea")
Downloading GitHub repo ctlab/fgsea@master
from URL https://api.github.com/repos/ctlab/fgsea/zipball/master
Installing fgsea
'/opt/apps/R/3.3.2/lib64/R/bin/R' --no-site-file --no-environ --no-save
--no-restore --quiet CMD INSTALL
'/tmp/RtmpeIaZPo/devtools9dd13505afcd/ctlab-fgsea-29b4e3c'
--library='/opt/apps/R/3.3.2/lib64/R/library' --install-tests

installing source package ‘fgsea’ ...
** libs
g++ -std=c++0x -I/opt/apps/R/3.3.2/lib64/R/include -DNDEBUG -I/opt/apps/bzip2-1.0.6/include -I/opt/apps/xz/5.2.2/include -I/opt/apps/pcre/8.39/include -I/opt/apps/curl/7.52.1/include -I"/opt/apps/R/3.3.2/lib64/R/library/Rcpp/include" -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
g++ -std=c++0x -I/opt/apps/R/3.3.2/lib64/R/include -DNDEBUG -I/opt/apps/bzip2-1.0.6/include -I/opt/apps/xz/5.2.2/include -I/opt/apps/pcre/8.39/include -I/opt/apps/curl/7.52.1/include -I"/opt/apps/R/3.3.2/lib64/R/library/Rcpp/include" -fpic -g -O2 -c fastGSEA.cpp -o fastGSEA.o
fastGSEA.cpp: In function ‘Rcpp::IntegerVector combination(const int&, const int&, std::mt19937&)’:
fastGSEA.cpp:318: error：‘uniform_int_distribution’ is not member of ‘std’
fastGSEA.cpp:318: error：expected primary-expression before ‘int’
fastGSEA.cpp:318: error：expected ‘;’ before ‘int’
fastGSEA.cpp:324: error：‘uni’ undefined
make: *** [fastGSEA.o] error 1
ERROR: compilation failed for package ‘fgsea’
removing ‘/opt/apps/R/3.3.2/lib64/R/library/fgsea’
ERROR: Command failed (1)

problem loading the package

Hello,
Im trying to run on my Mac (R studio; R version 3.5.1 )
library(fgsea)
I get this error:

Loading required package: Rcpp
Error in value[[3L]](cond) : 
  Package ‘Rcpp’ version 0.12.19 cannot be unloaded:
 Error in unloadNamespace(package) : namespace ‘Rcpp’ is imported by ‘tidyr’, ‘plyr’, ‘tidyselect’, ‘reticulate’, ‘scales’, ‘dplyr’, ‘reshape2’, ‘bindrcpp’, ‘htmltools’, ‘lubridate’, ‘ggrepel’, ‘fs’ so cannot be unloaded

Any ideas?

ctlab / fgsea Goto Github PK

fgsea's People

Contributors

Stargazers

Watchers

Forkers

fgsea's Issues

At this part:

Recommend Projects

Recommend Topics

Recommend Org