Comments (9)
Hi Bohoon,
3 days is indeed a very long time!
In our benchmarks, smaller datasets of 3,600 cells were analyzed in 3-4 minutes.
Parallelization of tasks is done at the cell-type level; so, distinct uses at most n_cell_types cores (our default, if n_cores is left unspecified).
How many cell types do you have?
Time heavily depends on the number of significant results: if you have a lot of differential genes, time could increase a lot.
That's the only physiological potential cause I can think of.
Alternatively, something is wrong with the input data, or with out internal functions.
If you want to share your scripts and data, I can have a deeper look.
Have a good one,
Simone
from distinct.
So currently, we have 6 cell types and I have tried n_cores = 6 as well which also ran for more than three days. Is there a possibility where its detecting no differentially expressed genes and the program wont terminate? Since in our case, we are looking at potential gene level differences within different states of Tregs(activated vs exhausted), there might not exist significant level of gene profile differences.
Here is the code block which I am using to run distinct! Thank you!
merged.sce <- as.SingleCellExperiment(merged,assay="RNA")
merged.sce <- computeLibraryFactors(merged.sce)
merged.sce <- logNormCounts(merged.sce)
cpm(merged.sce) <- calculateCPM(merged.sce,assay.type="counts")
(merged.sce <- prepSCE(merged.sce,
kid = "seurat_clusters",
gid = "Condition",
sid = "Sample",
drop=T))
colData(merged.sce)
Batch = factor(c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B"))
Sex = factor(c("Male", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female"))
merged.samples <- merged.sce@metadata$experiment_info$sample_id
merged.group <- merged.sce@metadata$experiment_info$group_id
#merged.design <- model.matrix(~merged.group + Batch + Sex)
merged.design <- model.matrix(~merged.group)
rownames(merged.design) <- merged.samples
set.seed(61217)
merged.res <- distinct_test(x = merged.sce,
name_assays_expression = "logcounts",
name_cluster = "cluster_id",
name_sample = "sample_id",
design = merged.design,
P_4 = 20000,
column_to_test = 2,
min_non_zero_cells = 50,
n_cores = 16)
merged.res <- log2_FC(res = merged.res,
x = merged.sce,
name_assays_expression = "cpm",
name_group = "group_id",
name_cluster = "cluster_id")
from distinct.
Ok, what I meant is that using n_cores = n_cells (6) will speed-up calculations, but using more will not make a difference.
Everything seems ok; if you want, you can also account for "batch" and sex" using the commented design.
> Is there a possibility where its detecting no differentially expressed genes and the program wont terminate?
Actually, the more differential genes there are, the longer distinct will take.
I see you are using P_4 = 20,000
: this actually increases the permutations used for significant genes, and hence the runtime.
I suggest you use the default values (P_4 = 10,000), which is more than enough to obtain good estimates of the p-values.
In addition, just to try and see if everything runs, you can set P_3 = 1,000
and P_4 = 1,000
.
This will give less accurate estimates of small p-values, but it will run much faster.
If you want to share the data, I can try the package there.
Take care,
Simone
from distinct.
Hello,
I have tried with P_2 = 100, P_3=100 and P_4 = 100 which took about one hour and half.
One thing that I noticed which was slightly abnormal was that when I looked at top_results for cluster 2, I saw that all genes had p-val of 0.009 (all equal p-value). I know that since permutations used is very low, these p-values should not be taken as granted but is this what you would normally see?
from distinct.
Hi Bohoon,
ok, this explains the huge computational cost: all p-values are significant, which means that they will use up to P_4 permutations (10 k by default), which increases a lot the computational cost.
Something is probably going wrong with this cluster.
Do you notice the same behaviour with the other clusters too?
An easy solution may be to run the analysis without cluster 2.
Alternatively, I can run the package on your data.
Simone
from distinct.
I am not 100% sure on the sharing permissibility since it is a patient dataset. One thing to consider is that this dataset is supposed to be all sorted for Tregs and therefore, the clusters are capturing potential variation in cell states among Tregs. I am assuming there should be very minute levels of differences in terms of expression profile.
When I ran muscat on this same dataset, it did not pick up any differentially expressed genes among all clusters. Is there a possibility that since differences are so miniscule, its considering all genes as differentially expressed (all-p values become significant).
from distinct.
muscat detects DGE (differences in means), while distinct uses a non-parametric approach and also detects differences that do not involve the mean (e.g., difference in variance).
So it can be expected that distinct detects more differences than muscat, but the kind of pattern you observe is extreme.
One thing I just noticed is that top_results
ranks results from most to least significant; so when doing "head" it is normal that you have all significant results printed.
Check how many significant results there are:
a = top_results( ...)
mean(a$p_val < 0.01)
hist(a$p_val)
from distinct.
I ran distinct with P_4 = 1000 as you suggested and it did finish within one day.
Interestingly, when i ran top_results for all clusters, there were some clusters that did not have any differentially expressed genes where other clusters had 7,000 to 10,000 differentially expressed genes detected.
These are the histogram of p values for clusters 0 and 1 where it demonstrated noticeably large number of differentially expressed genes. Are these numbers out of ordinary or would they eventually get filtered out after increasing P_3 and P_4 values? (I will give it around five days to run to check whether they successfully finish.)
from distinct.
Hi Bohoon,
there are definitely A LOT of significant genes...this is certainly an unusual histogram of p-values (all p < 0.03).
Increasing P_3 and P_4 will not make a difference: this enables a more accurate estimate of small p-values, but we're talking about a small difference; this will not push p-values to 1.
Hard to say what is causing this.
If can make the data anonymous I can look at it.
Take care,
Simone
from distinct.
Related Issues (8)
- Replicate sample naming guidelines HOT 2
- Add log-fold-changes to res output table HOT 10
- Mat::init(): requested size is too large HOT 9
- object '.doSnowGlobals' not found HOT 4
- design matrix reference group HOT 6
- log2_FC error: Error in pb_2[, i] : subscript out of bounds HOT 1
- Result data frame is NULL HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from distinct.