Hello, Thank you for the development of distinct tool. I have been t

Runs Indefinitely about distinct HOT 9 CLOSED

bshim181 commented on July 30, 2024

Runs Indefinitely

from distinct.

Comments (9)

SimoneTiberi commented on July 30, 2024

Hi Bohoon,
3 days is indeed a very long time!
In our benchmarks, smaller datasets of 3,600 cells were analyzed in 3-4 minutes.
Parallelization of tasks is done at the cell-type level; so, distinct uses at most n_cell_types cores (our default, if n_cores is left unspecified).
How many cell types do you have?

Time heavily depends on the number of significant results: if you have a lot of differential genes, time could increase a lot.
That's the only physiological potential cause I can think of.
Alternatively, something is wrong with the input data, or with out internal functions.

If you want to share your scripts and data, I can have a deeper look.

Have a good one,
Simone

from distinct.

bshim181 commented on July 30, 2024

So currently, we have 6 cell types and I have tried n_cores = 6 as well which also ran for more than three days. Is there a possibility where its detecting no differentially expressed genes and the program wont terminate? Since in our case, we are looking at potential gene level differences within different states of Tregs(activated vs exhausted), there might not exist significant level of gene profile differences.

Here is the code block which I am using to run distinct! Thank you!

merged.sce <- as.SingleCellExperiment(merged,assay="RNA")

merged.sce <- computeLibraryFactors(merged.sce)
merged.sce <- logNormCounts(merged.sce)
cpm(merged.sce) <- calculateCPM(merged.sce,assay.type="counts")

(merged.sce <- prepSCE(merged.sce,
kid = "seurat_clusters",
gid = "Condition",
sid = "Sample",
drop=T))

colData(merged.sce)

Batch = factor(c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B"))
Sex = factor(c("Male", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female"))

merged.samples <- merged.sce@metadata$experiment_info$sample_id
merged.group <- merged.sce@metadata$experiment_info$group_id
#merged.design <- model.matrix(~merged.group + Batch + Sex)
merged.design <- model.matrix(~merged.group)

rownames(merged.design) <- merged.samples

set.seed(61217)

merged.res <- distinct_test(x = merged.sce,
name_assays_expression = "logcounts",
name_cluster = "cluster_id",
name_sample = "sample_id",
design = merged.design,
P_4 = 20000,
column_to_test = 2,
min_non_zero_cells = 50,
n_cores = 16)

merged.res <- log2_FC(res = merged.res,
x = merged.sce,
name_assays_expression = "cpm",
name_group = "group_id",
name_cluster = "cluster_id")

from distinct.

SimoneTiberi commented on July 30, 2024

Ok, what I meant is that using n_cores = n_cells (6) will speed-up calculations, but using more will not make a difference.

Everything seems ok; if you want, you can also account for "batch" and sex" using the commented design.

> Is there a possibility where its detecting no differentially expressed genes and the program wont terminate?
Actually, the more differential genes there are, the longer distinct will take.

I see you are using P_4 = 20,000: this actually increases the permutations used for significant genes, and hence the runtime.
I suggest you use the default values (P_4 = 10,000), which is more than enough to obtain good estimates of the p-values.

In addition, just to try and see if everything runs, you can set P_3 = 1,000 and P_4 = 1,000.
This will give less accurate estimates of small p-values, but it will run much faster.

If you want to share the data, I can try the package there.

Take care,
Simone

from distinct.

bshim181 commented on July 30, 2024

Hello,

I have tried with P_2 = 100, P_3=100 and P_4 = 100 which took about one hour and half.

One thing that I noticed which was slightly abnormal was that when I looked at top_results for cluster 2, I saw that all genes had p-val of 0.009 (all equal p-value). I know that since permutations used is very low, these p-values should not be taken as granted but is this what you would normally see?

from distinct.

SimoneTiberi commented on July 30, 2024

Hi Bohoon,
ok, this explains the huge computational cost: all p-values are significant, which means that they will use up to P_4 permutations (10 k by default), which increases a lot the computational cost.

Something is probably going wrong with this cluster.
Do you notice the same behaviour with the other clusters too?

An easy solution may be to run the analysis without cluster 2.
Alternatively, I can run the package on your data.

Simone

from distinct.

bshim181 commented on July 30, 2024

I am not 100% sure on the sharing permissibility since it is a patient dataset. One thing to consider is that this dataset is supposed to be all sorted for Tregs and therefore, the clusters are capturing potential variation in cell states among Tregs. I am assuming there should be very minute levels of differences in terms of expression profile.

When I ran muscat on this same dataset, it did not pick up any differentially expressed genes among all clusters. Is there a possibility that since differences are so miniscule, its considering all genes as differentially expressed (all-p values become significant).

from distinct.

SimoneTiberi commented on July 30, 2024

muscat detects DGE (differences in means), while distinct uses a non-parametric approach and also detects differences that do not involve the mean (e.g., difference in variance).
So it can be expected that distinct detects more differences than muscat, but the kind of pattern you observe is extreme.

One thing I just noticed is that top_results ranks results from most to least significant; so when doing "head" it is normal that you have all significant results printed.

Check how many significant results there are:

a = top_results( ...)
mean(a$p_val < 0.01)
hist(a$p_val)

from distinct.

bshim181 commented on July 30, 2024

I ran distinct with P_4 = 1000 as you suggested and it did finish within one day.
Interestingly, when i ran top_results for all clusters, there were some clusters that did not have any differentially expressed genes where other clusters had 7,000 to 10,000 differentially expressed genes detected.

These are the histogram of p values for clusters 0 and 1 where it demonstrated noticeably large number of differentially expressed genes. Are these numbers out of ordinary or would they eventually get filtered out after increasing P_3 and P_4 values? (I will give it around five days to run to check whether they successfully finish.)

from distinct.

SimoneTiberi commented on July 30, 2024

Hi Bohoon,
there are definitely A LOT of significant genes...this is certainly an unusual histogram of p-values (all p < 0.03).

Increasing P_3 and P_4 will not make a difference: this enables a more accurate estimate of small p-values, but we're talking about a small difference; this will not push p-values to 1.

Hard to say what is causing this.
If can make the data anonymous I can look at it.

Take care,
Simone

from distinct.

Runs Indefinitely about distinct HOT 9 CLOSED

Comments (9)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent