dputhier / scigenex Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 58.42 MB

This repository stores the scigenex R library.

License: Other

R 99.30% Makefile 0.70%

r-package r scrna-seq scrna-seq-analysis scrnaseq classification unsupervised-learning filtering-algorithm

scigenex's Introduction

SciGeneX repository

⏬ Installation

System requirements

The partitioning steps are currently performed using a system call to the Markov Cluster (MCL) algorithm that presently limits the use of DBF-MCL to unix-like platforms. Importantly, the mcl command should be in your PATH and reachable from within R (see dedicated section).

Step 1 - Installation of SciGeneX

From R

The scigenex library is currently not available in CRAN or Bioc. To install from github, use:

devtools::install_github("dputhier/scigenex")
library(scigenex)

From the terminal

Download the tar.gz from github or clone the main branch. Uncompress and run the following command from within the uncompressed scigenex folder:

R CMD INSTALL .

Then load the library from within R.

library(scigenex)

Step 2 - Installation of MCL

You may skip this step as the latest versions of SciGeneX will call scigenex::install_mcl()to install MCL in ~/.scigenex directory if this program is not found in the PATH.

Installation of MCL using install_mcl()

The install_mcl() has been developed to ease MCL installation. This function should be call automatically from within R when calling the gene_clustering() function. If install_mcl() does not detect MCL in the PATH it will install it in ~/.scigenex.

Installation of MCL from source

One also can install MCL from source using the following code.

# Download the latest version of mcl 
wget http://micans.org/mcl/src/mcl-latest.tar.gz
# Uncompress and install mcl
tar xvfz mcl-latest.tar.gz
cd mcl-xx-xxx
./configure
make
sudo make install
# You should get mcl in your path
mcl -h

Installation of MCL from sources

Finally you may install MCL using conda. Importantly, the mcl command should be available in your PATH from within R.

conda install -c bioconda mcl

Example

The scigenex library contains several datasets including the pbmc3k_medium which is a subset from pbmc3k 10X dataset.

library(Seurat)
library(scigenex)
set_verbosity(1)

# Load a dataset
load_example_dataset("7871581/files/pbmc3k_medium")

# Select informative genes
res <- select_genes(pbmc3k_medium,
                     distance = "pearson",
                     row_sum=5)
                     
# Cluster informative features
 
## Construct and partition the graph
res <- gene_clustering(res, 
                       inflation = 1.5, 
                       threads = 4)
                        
# Display the heatmap of gene clusters
res <- top_genes(res)
plot_heatmap(res, cell_clusters = Seurat::Idents(pbmc3k_medium))

📖 Documentation

Documentation (in progress) is available at https://dputhier.github.io/scigenex/.

scigenex's People

Contributors

Stargazers

Watchers

Forkers

sebastiennin paul-yph

scigenex's Issues

Example dataset in doc

In the documentation it would be better for pedagogical purpose to download and uncompress the file.

	url = "https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz"
	download.file(url=url, 
	                  destfile = "pbmc3k_filtered_gene_bc_matrices.tar.gz")
	system("tar xvfz pbmc3k_filtered_gene_bc_matrices.tar.gz")
	pbmc_data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

Store cell order/cluster

I dont know if there is any plan to store cell order/cluster into the clusterSet object after plot_heatmap. This would be valuable so that the user may manipulate the cell order/cluster later.
Best
Denis

example in plot_heatmap

I would suggest a code providing a set of clusters not a single one:

    library(scigenex)
    set.seed(123)
    m <- matrix(rnorm(40000), nc=20)
    m[1:100,1:10] <- m[1:100,1:10] + 4
    m[101:200,11:20] <- m[101:200,11:20] + 3
    m[201:300,5:15] <- m[201:300,5:15] + -2
    res <- DBFMCL(data=m,
                  distance_method="pearson",
                  av_dot_prod_min = 0,
                  inflation = 2,
                  k=25,
                  fdr = 10)

    plot_heatmap(object = res)

ClusterSet should implement a slot with the number of clusters.

to avoid

     1:max(dbf_seurat@cluster) # see vignette. To be fixed.

Improvement for get_genes function

It could be nice to add a parameter "top" in get_genes function to extract the top genes from a gene signature.

Ex : get_genes(obj, cluster = 1, top = TRUE)

silent argument in DBFMCL

The help file indicate:

#' @param silent if set to TRUE (default), the progression of distance matrix
#' calculation is not displayed.

My feeling is that it does not control this anymore. Maybe MCL messages ?

The method of the ClusterSet object does not appear in the documentation

There should be something to fix with roxygen. The slots are listed but not the methods.
?"ClusterSet-class"

storing av_dot_prod

Hi,
I think it would be interesting to store the value of av_dot_prod in the DBFMCL object so that could re-run DBFMCL with and increased value of av_dot_prod_min to delete some contaminating clusters.
Best

Default inflation value

I have previously observed that pushing inflation too much results in loosing informative genes supported by few cells. Unless there are strong arguments against that I would definitively suggest a default inflation value set to 2.
Best

Improve DBFMCL performances....

The library should propose some parallelization... We may ask fafa13...

MCL implementation in R

Currently DBFMCL use a call to MCL Unix command. There is a package for performing MCL analysis in R. Should be a good idea to go for it as it would make DBFMCL easier to install.
To be tested and implemented if performance and results are similar.
https://cran.r-project.org/web/packages/MCL/MCL.pdf

If no results of functional enrichment analysis for a specific cluster then visualization of results will stop at this cluster

Add a warning message to inform user that there is no result for cluster XX and keep continue the visualization of next clusters.

> res_gene <- enrich_analysis(object = res_gene, specie = "mmusculus")
[1] "Enrichment analysis for cluster 1"
[1] "Enrichment analysis for cluster 2"
[1] "Enrichment analysis for cluster 3"
[1] "Enrichment analysis for cluster 4"
[1] "Enrichment analysis for cluster 5"
[1] "Enrichment analysis for cluster 6"
[1] "Enrichment analysis for cluster 7"
[1] "Enrichment analysis for cluster 8"
[1] "Enrichment analysis for cluster 9"
[1] "Enrichment analysis for cluster 10"
[1] "Enrichment analysis for cluster 11"
[1] "Enrichment analysis for cluster 12"
[1] "Enrichment analysis for cluster 13"
[1] "Enrichment analysis for cluster 14"
[1] "Enrichment analysis for cluster 15"
[1] "Enrichment analysis for cluster 16"
[1] "Enrichment analysis for cluster 17"
[1] "Enrichment analysis for cluster 18"
No results to show
Please make sure that the organism is correct or set significant = FALSE
[1] "Enrichment analysis for cluster 19"
[1] "Enrichment analysis for cluster 20"
[1] "Enrichment analysis for cluster 21"
No results to show
Please make sure that the organism is correct or set significant = FALSE
[1] "Enrichment analysis for cluster 22"

> res_gene <- enrich_viz(object = res_gene)
|--  Plot enrichment analysis results for cluster 1 
|--  Plot enrichment analysis results for cluster 2 
|--  Plot enrichment analysis results for cluster 3 
|--  Plot enrichment analysis results for cluster 4 
|--  Plot enrichment analysis results for cluster 5 
|--  Plot enrichment analysis results for cluster 6 
|--  Plot enrichment analysis results for cluster 7 
|--  Plot enrichment analysis results for cluster 8 
|--  Plot enrichment analysis results for cluster 9 
|--  Plot enrichment analysis results for cluster 10 
|--  Plot enrichment analysis results for cluster 11 
|--  Plot enrichment analysis results for cluster 12 
|--  Plot enrichment analysis results for cluster 13 
|--  Plot enrichment analysis results for cluster 14 
|--  Plot enrichment analysis results for cluster 15 
|--  Plot enrichment analysis results for cluster 16 
|--  Plot enrichment analysis results for cluster 17 
|--  Plot enrichment analysis results for cluster 18 
Error in gostplot(object@cluster_annotations[[cur_cluster]], interactive = TRUE) : 
  The following columns are missing from the result: source_order, term_size, term_name, term_id, source, significant

Unstable results

There are some discrepancies in tests...

	test_that("Cheking DBFMCL is providing the right number of genes", {
	  set.seed(123)
	  m <- matrix(rnorm(80000), nc=20)
	  m[1:100,1:10] <- m[1:100,1:10] + 4
	  m[101:200,11:20] <- m[101:200,11:20] + 3
	  m[201:300,5:15] <- m[201:300,5:15] + -2
	  res <- DBFMCL(data=m,
	                distance_method="pearson",
	                av_dot_prod_min = 0,
	                inflation = 1.2,
	                k=25,
	                fdr = 10)
	  #plot_clust(res, ceil = 10, floor = -10)
	  expect_equal(length(res@size), 3)
	  expect_equal(res@size, c(109, 107, 81))
	})


	  ══ Failed tests ════════════════════════════════════════════════════════════════
	  ── Failure (test.dbfmcl.R:15:3): Cheking DBFMCL is providing the right number of genes ──
	  res@size not equal to c(109, 107, 81).
	  2/3 mismatches (average diff: 6)
	  [1] 115 - 109 == 6
	  [2] 113 - 107 == 6

## Warning in PrepDR(object = object, features = features, verbose = verbose):

The following 11714 features requested have not been scaled (running reduction ## without them):

This warning message should be fix in the tutorial (vignette/web page)

plot_heatmap and blank lines

In the plot_heatmap() fonction the rownames corresponding to blank lines that separate the clusters are appearing. These rownames should be discarded or the corresponding lines if blank row names are not supported by the underlying function.

README example

The example in the README section is not very attractive...
I would propose

library(scigenex)
sed.seed(123)
m <- matrix(rnorm(40000), nc=20)
m[1:100,1:10] <- m[1:100,1:10] + 4
m[101:200,11:20] <- m[101:200,11:20] + 3
m[201:300,5:15] <- m[201:300,5:15] + -2
res <- DBFMCL(data=m,
              distance_method="pearson",
              av_dot_prod_min = 0,
              inflation = 2,
              k=25,
              fdr = 10)
plot_clust(res, ceil = 10, floor = -10)
write_clust(res, "ALL.sign.txt")

plot_heatmap: Error in matrix(ncol = ncol(m)) : data is too long

This is most probably due to the fact that hierarchical clustering is no more part of the function.

	> dbf_seurat <- top_genes(dbf, top = 10)
	|--  Results are stored in top_genes slot of ClusterSet object. 
	> plot_heatmap(dbf_seurat, use_top_genes = T)
	|--  Centering matrix. 
	|--  Ordering cells based on hierarchical clustering. 
	|--  Ceiling matrix. 
	|--  Flooring matrix. 
	Error in matrix(ncol = ncol(m)) : data is too long

It fails here when trying to access @cell_clusters$hclust_res$order which is NULL.

Best

    if (cell_ordering_method == "hclust") {
      print_msg("Ordering cells based on hierarchical clustering.", 
        msg_type = "DEBUG")
      m <- m[, object@cell_clusters$hclust_res$order]
      if (length(object@cell_clusters$labels) != 0) {
        object@cell_clusters$labels <- object@cell_clusters$labels[colnames(m)]
        object@cell_clusters$cores <- object@cell_clusters$cores[colnames(m)]
      }
    }

0 cluster found with spearman as distance

Using pbmc dataset from seurat and spearman correlation coef as distance, 1 cluster was found containing 799 genes and then filtered.

|-- Done
|-- creating file : /mnt/NAS7/Workspace/bavaisj/ciml-splab/BecomingLTi/PBMC_DBFMCL/03_Script/02_GeneSelectionWithDBFMCL//mnt/NAS7/SPlab/BIOINFO_PROJECT/BecomingLTi/PBMC_DBFMCL/05_Output/02_GeneSelectionWithDBFMCL/TestR/BecomingLTi_PBMC_DBFMCL_intermediate_result.mcl_out.txt
|-- Reading MCL output: /mnt/NAS7/SPlab/BIOINFO_PROJECT/BecomingLTi/PBMC_DBFMCL/05_Output/02_GeneSelectionWithDBFMCL/TestR/BecomingLTi_PBMC_DBFMCL_intermediate_result.mcl_out.txt
|-- 0 clusters conserved after MCL partitioning.
|-- 1 clusters filtered out from MCL partitioning (size and mean dot product).

Parameters used for this run :
The following parameters will be used :
Working directory: /mnt/NAS7/Workspace/bavaisj/ciml-splab/BecomingLTi/PBMC_DBFMCL/03_Script/02_GeneSelectionWithDBFMCL
Name: /mnt/NAS7/SPlab/BIOINFO_PROJECT/BecomingLTi/PBMC_DBFMCL/05_Output/02_GeneSelectionWithDBFMCL/TestR/BecomingLTi_PBMC_DBFMCL_intermediate_result
Distance method: spearman
Minimum average dot product for clusters: 0.2
Minimum cluster size: 10
Number of neighbors: 75
Number of randomizations: 3
FDR: 1 %
Inflation: 5
Visualize standard outputs from both mcl and cluster commands: FALSE
Memory used : 1024

We also change the parameters (average dot product, neighbors, inflation, and fdr) but nothing change.
However, we obtain many clusters using pearson correlation as distance.

We checked the formula of the spearman correlation coef in the C code and we saw that the formula is (dist²*6)/(n_sample*((n_sample*n_sample)-1)).
Shouldn't it be 1 - (dist²*6)/(n_sample*((n_sample*n_sample)-1)) ?

Failure (test.dbfmcl.R:12:3): Cheking DBFMCL is providing the right number of clusters

When running make test on OSX the number of selected genes seem to differ from UNIX. @JulieBvs can you check the test is working on Linux.

   make test

    Testing dbfmcl
    ✔ |  OK F W S | Context
    ⠏ |   0       | test.dbfmcl                                                     The following parameters will be used :
        Working directory:  /Users/puthier/Documents/git/project_dev/dbfmcl/tests/testthat
        Name:  exprs
        Distance method:  pearson
        Number of neighbors:  25
        Number of randomizations:  3
        FDR:  10 %
        Perform clustering:  FALSE
        Visualize standard outputs from both mcl and cluster commands:  FALSE
        Memory used :  1024

    Randomization: 7994001 (1/1.000    ratio)
    Seed = 123
    Pre-computation for distances
    Computing distances: 100.00%
    Randomization: 7994001 obtained, 7994001 asked
    Computing FDR: 100.00%
    Computing cut-off
    number of conserved genes = 309
    Building graph
    Genes   core = 309   extra = 0
    DBF done
    ✖ |   1 1     | test.dbfmcl [1.2 s]
    ────────────────────────────────────────────────────────────────────────────────
    Failure (test.dbfmcl.R:12:3): Cheking DBFMCL is providing the right number of clusters
    res@size not equal to 298.
    1/1 mismatches
    [1] 309 - 298 == 11

Need colnames/rownames method for a clusterSet object

Not user friendly to always ask for colnames(dbf_seurat@data).
Best

Inconsistency between g_profiles clusters and legend

I noticed that the clusters seen on tile plot from plot_clust ranged form 1 to 9. The color of the legend range from 0 to 9 and correspond to Seurat clusters.

MCL package not available since last commit

Hi @dputhier ,
I did a pull this morning before making a new branch and now I got this error when running reinstall.

reinstall()
ℹ Loading dbfmcl
Error: Dependency package(s) 'MCL' not available.
Run rlang::last_error() to see where the error occurred.
In addition: Warning message:
In (function (dep_name, dep_ver = "*") :
Dependency package 'MCL' not available.

cluster argument to plot_heatmap

I think it would be interesting to add an argument to select the clusters to be displayed (one or several)

Empty critical_distance slot

When running :

m <- matrix(rnorm(6000), nc=20)
m[1:100,1:10] <- m[1:100,1:10] + 4
m[101:200,11:20] <- m[101:200,11:20] + 3
m[201:300,5:15] <- m[201:300,5:15] + -2

res <- DBFMCL(data=m,
distance_method="pearson",
av_dot_prod_min = 0,
inflation = 1.2,
k=25,
fdr = 90)

res@critical_distance
[1] NA

enrich_go with Mmusculus

Error using enrich_go with "Mmusculus" as specie parameter.
Needs to replace hs in

query_entrezid <- AnnotationDbi::select(hs, 
                                           keys = query,
                                           columns = c("ENTREZID", "SYMBOL"),
                                           keytype = "SYMBOL")

Example not working

In clusterSet examples, the default example is not working due to the fact that we are calling MCL R library by default. Adding "mcl_cmd_line = T" typically fix the issue. In fact mcl_cmd_line should be proposed by default. We should even discard this dependency to MCL R library as this project is no more maintained and the version is really a poor implementation of the original MCL algorithm.

  ?"ClusterSet-class"

 res <- DBFMCL(data=m, distance_method="pearson", av_dot_prod_min = 0, inflation = 1.2, k=25, fdr = 10, mcl_cmd_line = T)

ClusterSet object does not store all parameters

The av_dot_prod_min and min_cluster_size are not stored.

    > dbf@parameters
    $distance_method
    [1] "pearson"
    
    $k
    [1] 80
    
    $fdr
    [1] 10
    
    $seed
    [1] 123
    
    $inflation
    [1] 4

top_genes and clusters in plot_heatmap

Got this error when using plot_heatmap

	> plot_heatmap(dbf, use_top_genes=TRUE, cluster = 1)
	|--  Centering matrix. 
	|--  Ordering cells based on hierarchical clustering. 
	|--  Ceiling matrix. 
	|--  Flooring matrix. 
	Error in m[genes_top, ] : subscript out of bounds
	> dbf@top_genes
	           gene.top.1      gene.top.2      gene.top.3      gene.top.4      gene.top.5     
	cluster.1  "H2-K1"         "Ccl21a"        "Fcgbp"         "Ifi27l2a"      "H2-D1"        
	cluster.2  "Gapdh"         "Stmn1"         "Ubb"           "Hmgn2"         "Pclaf"        
	cluster.3  "Gm48228"       "Mgat4c"        "Gucy2g"        "Abca17"        "5033417F24Rik"
	cluster.4  "1700024I08Rik" "Gm17200"       "Crocc2"        "Gm45418"       "Gm19689"      
	cluster.5  "Gm36356"       "1700029N11Rik" "Ttll6"         "Gmnc"          "Tsix"         
	cluster.6  "mt-Co1"        "mt-Atp6"       "Rps14"         "Rps28"         "Rpl37a"       
	cluster.7  "Gm553"         "Gm5784"        "5430431A17Rik" "Gabra2"        "Trav10n"      
	cluster.8  "Rpl30"         "Rpl22"         "Atp5mpl"       "Rpl23"         "Rpl34"        
	cluster.9  "Rmnd5a"        "Endou"         "Ccnd3"         "Satb1"         "Ets2"         
	cluster.10 "Rplp1"         "Rpsa"          "Actb"          "Rplp0"         "Laptm5"       
	cluster.11 "Rps21"         "Fau"           "Rps26"         "Chchd2"        "Rpl8"         
	cluster.12 "Tmod2"         "Cngb1"         "Gm43848"       "Gm29562"       "Tigd5"        
	cluster.13 "Rps17"         "Map1lc3b"      "Psma3"         "Gm10076"       "Uqcr10"       
	cluster.14 "G530011O06Rik" "1700041G16Rik" "Mkrn3"         "Cmklr1"        "Zfp456"       
	cluster.15 "mt-Co3"        "mt-Co2"        "Rps16"         "Rps20"         "Tpt1"         
	cluster.16 "Rps19"         "Rpl19"         "Rps6"          "Rpl12"         "Rps5"         
	cluster.17 "Gucy2e"        "Spink12"       "Serinc2"       "Gm16196"       "Gm13391"      
	cluster.18 "B2m"           "Pdia3"         "Tmsb4x"        "Npc2"          "Atp1b3"       
	cluster.19 "Boll"          "Fignl2"        "F8"            "Sntb1"         "Hoxa5"        
	cluster.20 "Sis"           "D630023F18Rik" "Mpo"           "Padi4"         "Arl4d"        
	cluster.21 "B930018H19Rik" "Loxl4"         "Gm38037"       "A430110L20Rik" "Pcdh1"        
	           gene.top.6 gene.top.7      gene.top.8 gene.top.9 gene.top.10    
	cluster.1  "H2-Ab1"   "H2-Eb1"        "H2-Aa"    "Mgp"      "Igfbp4"       
	cluster.2  "Snrpg"    "Slc25a5"       "Calm2"    "Rpl15"    "Cox7a2"       
	cluster.3  "Myo3b"    "Egf"           "Upk1a"    "Mamstr"   "Cldn3"        
	cluster.4  "Sypl2"    "9630002D21Rik" "Meig1"    "Gm16006"  "Dnah10"       
	cluster.5  "Lrfn2"    "Chst5"         "Adora2b"  "Adm2"     "Gjb6"         
	cluster.6  "Rps29"    "mt-Nd4"        "Rpl10a"   "Rps27"    "Rpl3"         
	cluster.7  "Dgkk"     "4632428C04Rik" "Cxxc4"    "Tmem136"  "Wee2"         
	cluster.8  "Rpl37"    "Rps7"          "Rps4x"    "Rps15a"   "Atp5j2"       
	cluster.9  "Themis"   "Myb"           "Aqp11"    "Rag2"     "Arl5c"        
	cluster.10 "Coro1a"   "Rps10"         "Rps9"     "Rplp2"    "Eef2"         
	cluster.11 "Arhgdia"  "Selplg"        "Mbnl1"    "Marcksl1" "Atp5g3"       
	cluster.12 "Wtip"     "Dennd5b"       "Caskin2"  "Zfp11"    "Vpreb1"       
	cluster.13 "Cct5"     "Uba1"          "Mapk1"    "Tpi1"     "Esd"          
	cluster.14 "Poll"     "Gpr137"        "Ogfod3"   "Ndufaf6"  "L2hgdh"       
	cluster.15 "mt-Nd1"   "Rpl27a"        "Rps15"    "mt-Nd3"   "mt-Nd2"       
	cluster.16 "Rps2"     "Rpl18a"        "Hsp90ab1" "Hspe1"    "Psmb8"        
	cluster.17 "Fam189a2" "Gm4890"        "Ticam2"   "Grk5"     "Pmaip1"       
	cluster.18 "Nfkbia"   "Sdf4"          "Tagln2"   "Rgs10"    "Lbh"          
	cluster.19 "Lrrc25"   "Plek2"         "Rom1"     "Anxa8"    "A730063M14Rik"
	cluster.20 "Taf4b"    "Lrg1"          "Fam110b"  "Sept10"   "Fitm2"        
	cluster.21 "Rassf6"   "Gpr160"        "Gm5134"   "Gm13546"  "Zdhhc23"

Fix plot_dist

DESCRIPTION:URL: https://github.com/dputhier/sigenex

In the description file the URL is wrong:
https://github.com/dputhier/sigenex -> https://github.com/dputhier/scigenex

R session crash when running DBF function

R session crash when optional_output is set to TRUE in DBF or DBFMCL function. It may caused by the recent modification of the C code (commit fccb0d5).
Note : a commit older than fccb0d5 works without any problems.

Error in function DBFMCL, "system mcl"

Hi,
I got an error when I try to run the example of the main github page of DBFMCL

I run those lines:

> m <- matrix(rnorm(80000), nc=20)
> m[1:100,1:10] <- m[1:100,1:10] + 4
> m[101:200,11:20] <- m[101:200,11:20] + 3
> m[201:300,5:15] <- m[201:300,5:15] + -2

I first got this error:

> res <- DBFMCL(data=m,
+               distance_method="pearson",
+               clustering=TRUE,
+               k=25)
Error in DBFMCL(data = m, distance_method = "pearson", clustering = TRUE,  : 
  unused argument (clustering = TRUE)

Then, when I removed the "clustering" parameter:

> res <- DBFMCL(data=m,
+               distance_method="pearson",
+               k=25)
The following parameters will be used : 
	Working directory:  /home/rstudio 
	Name:  1HCr50pP3f 
	Distance method:  pearson 
	Minimum average dot product for clusters:  2 
	Minimum cluster size:  10 
	Number of neighbors:  25 
	Number of randomizations:  3 
	FDR:  10 % 
	Inflation: 8 
	Visualize standard outputs from both mcl and cluster commands:  FALSE 
	Memory used :  1024 

Randomization: 7994001 (1/1.000    ratio)
Seed = 123
Pre-computation for distances
Computing distances: 100.00%
Randomization: 7994001 obtained, 7994001 asked
Computing FDR: 100.00%
Computing cut-off
number of conserved genes = 310
Building graph
Genes   core = 310   extra = 0
DBF done
sh: 1: mcl: not found
Error in if (system("mcl --version | grep 'Stijn van Dongen'", intern = TRUE) >  : 
  argument is of length zero
In addition: Warning message:
In system("mcl --version | grep 'Stijn van Dongen'", intern = TRUE) :
  running command 'mcl --version | grep 'Stijn van Dongen'' had status 1

Mean Dot product

This step is quite long. I think you could simply take randomly a subset (e.g. 20%) a the clustered gene to compute the mean dot product (setting a minimum number of genes).
Best

Example diagram with GO terms

A suggestion.

README and Vignette should contain an example of a ClusterSet produced with an alternative to DBFMCL.

TODO

Path in write_clust function

Julie told me of some problems with write_clust(). Need to add tests.

Need a method to extract gene list from clusters stored in a ClusterSet object

top_genes return a matrix with bad rownames

When using top_genes function, the rownames of the object@top_genes always start at 1. It may be good to make it consistent with the cluster parameters used as input.

set.seed(123)
m <- matrix(rnorm(40000), nc=20)
m[1:100,1:10] <- m[1:100,1:10] + 4
m[101:200,11:20] <- m[101:200,11:20] + 3
m[201:300,5:15] <- m[201:300,5:15] + -2
res <- DBFMCL(data=m,
            distance_method="pearson",
            av_dot_prod_min = 0,
            inflation = 2,
            k=25,
            fdr = 10)

res <- top_genes(res, cluster = 2, top = 20)

res@top_genes

          gene.top.1 gene.top.2 gene.top.3 gene.top.4 gene.top.5 gene.top.6 gene.top.7 gene.top.8 gene.top.9 gene.top.10 gene.top.11 gene.top.12 gene.top.13 gene.top.14 gene.top.15
cluster.1 "gene166"  "gene186"  "gene192"  "gene183"  "gene117"  "gene155"  "gene180"  "gene114"  "gene122"  "gene163"   "gene121"   "gene150"   "gene168"   "gene181"   "gene200"  
          gene.top.16 gene.top.17 gene.top.18 gene.top.19 gene.top.20
cluster.1 "gene171"   "gene104"   "gene113"   "gene165"   "gene189"

av_dot_prod_min

When selecting clusters for dot product taking the average may be highly sensitive to outliers resulting in numerous spurious clusters. We should compute the median

  if (mean(cur_dot_prod) > av_dot_prod_min & length(h) >

 ===> 

if (median(cur_dot_prod) > av_dot_prod_min & length(h) >

This is examplified here with numerous signatures selected while they should be discarded.

DBFMCL filtering is unstable.

Depending the same dataset may provide different results over time with same parameters.
This is highly problematic for reproducibility, to write tests, but also to create a documentation.
Would be cool if @fafa13 would help us to fix it !

    library(devtools)
    devtools::install_github("dputhier/scigenex")
    library(scigenex)
    m <- matrix(rnorm(80000), nc=20)
    m[1:100,1:10] <- m[1:100,1:10] + 4
    m[101:200,11:20] <- m[101:200,11:20] + 3
    m[201:300,5:15] <- m[201:300,5:15] + -2
    res <- DBFMCL(data=m,
                  distance_method="pearson",
                  av_dot_prod_min = 0,
                  inflation = 1.2,
                  k=25,
                  fdr = 10)
   nrow(res)

add col dendrogram

Hi Julie,
It seems there is a function in iheatmapr to add the dendrogram to plot_heatmap. I think it would be a valable option.
Best
Denis

Non-interactive heatmap

Add an option to plot_heatmap to generate a non-interactive heatmap.

Fix white line in plot_heatmap using top_genes

Improving the speed of filtering step

In the current implementation the dot product is computed whatever the size of the cluster (min_cluster_size). The cluster should be first tested for min_cluster_size and the remaining for av_dot_prod_min. This should improve the processing.
Best
Denis

  > viz_enrich(dbf_seurat)
  Error in h(simpleError(msg, call)) : 
    error in evaluating the argument 'x' in selecting a method for function 'nrow': subscript out of bounds