jinmiaochenlab / batch-effect-removal-benchmarking Goto Github PK
View Code? Open in Web Editor NEWA benchmark of batch-effect correction methods for single-cell RNA sequencing data
A benchmark of batch-effect correction methods for single-cell RNA sequencing data
when i am trying to git lfs fetch/pull he increase me this error,
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking.git/info/lfs'
may i get a link to google drive with the datasets
call_seurat3 <- function(batch_list, batch_label, celltype_label, npcs = 20,
plotout_dir = "", saveout_dir = "",
outfilename_prefix = "",
visualize = T, save_obj = T)
Sorry to disturb you, but I wonder why I need to set npcs as 20 in this step, to run Seurat v3, because the default value of this step is 30. Thanks.
Hello, I think in original benchmark paper, I cannot find the original paper for human pbmc 5' dataset (I know 3' part coming from Zheng et al.). Where can I find the source for 5' dataset? Thanks.
Sorry to disturb you again. I notice that in this paper you do not use any metric to evaluate the effect of BBKNN, I guess this is because bbknn cannot really modify the original count matrix. However, it can affect the result after UMAP dimension reduction. Therefore, could I use the LISI rate and kBET rate to evaluate this method? Thanks a lot.
What do you recommend to try out for "big data" but for RNA seq instead of single cell?
In your script (https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/blob/master/Script/simulation/02_run/run_scanorama.ipynb)
you used corrected_adata.var_names = adata.var_names
to update the gene names in the "corrected_adata" object which saved the integration results from Scanorama, "adata" is an object before input to Scanorama.
However, after reading the source code of Scanorama (https://github.com/brianhie/scanorama/blob/master/scanorama/scanorama.py, Line 316, function merge_datasets
), I found that Scanorama will sort the gene names input to it, which means:
Given your input gene names adata.var_names=('Gene1', 'Gene2', …, 'Gene5000')
and data matrix adata.X=[x1, x2, …, x5000]
, Scanorama will reorganize the gene names and data matrix, which are corrected_adata.var_names=('Gene1', 'Gene10', 'Gene100', …, 'Gene999')
and corrected_adata.X = [x1, x10, x100,…,x999]
. And the returned gene names and data matrix are in the altered order.
Thus, if running your code corrected_adata.var_names = adata.var_names
, you will get:
corrected_adata.var_names=('Gene1', 'Gene2', …, 'Gene5000')
corrected_adata.X = [x1, x10, x100,…,x999]
Obviously, the gene names are mismatched with the data. Then, your following evaluation for differential expressed genes will be completely wrong. After correcting this bug, I found that Scanorama achieved the state-of-the-art performance on DEGs recovery.
Sorry to disturb you, since python cannot read rds file, so I cannot generate h5ad file using colab. I cannot load this dataset using my own laptop otherwise my computer will crash down. Therefore, could you please give me some suggestions about how to get the h5ad file mentioned in your script? Or is there any link for me to download this file? Thanks
I meet some problems when I intend to load data in loom file to the pipeline you provide in this code, could you please give me some suggestions here?
In addition, when you use IMAP to visulaize your result, is the performance is same for every time you run a same dataset? Or in fact there will be some slight change in this part? Thanks.
Currently I am trying to run the same methods you mentioned in this paper in other platform(eg. harmony in python) but I get different clisi/asw/ilisi rate. Is it reasonable? Thanks.
Hello,
Thank you for the amazing work on providing the benchmarking scripts and datasets!
We are currently experiencing some issues accessing the datasets from LFS, please see attached error message:
fetch: Fetching reference refs/heads/master
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking.git/info/lfs'
Thank you for your kind help!
Cheers,
Chloe
Hello,
I want to use the dataset in your benchmarking paper to evaluate an algorithm. I notice that you seperate the DCs cell into two batches accordding to the plate ID in dataset 1. ( P7 P8 P9 P10 as the batch 1, and P3 P4 P13 P14 as the batch 2)
I wonder why the batch group was defined like this.
Thanks a lot.
Hello,
I was hoping to use your simulated data, but wanted to also have a look at what the true up and down regulated genes are. I saw there was a file created by your splatter script, but the files are missing from the simulation data directories.
Thanks!
Hello :
Thanks for your wonderful job!
I have a question about the input file of kBET algorithm.
I noticed that the input file of kBET is the PCA embedding matrix of intergrated object , instead of the cell_feature matrix.
So, I tested the following 3 input files.
cell_feature matrix of integrated data
seurat_V3_直接用细胞.png.pdf
PCA embedding matrix of intergrated data .
seurat_V3_intergrated_PCA.pdf
PCA embedding matrix of Raw data
serat_v3_sct.pdf
It looks better to use PCA embedding as the input file.
Why is this?
Hi,
Thanks for the extremely useful benchmark! I'm trying to reproduce some of the results, and found dataset4 files has git lfs pointers instead of files.
I tried to install git lfs and fetch the file, but the error message says
fetch: Fetching reference refs/heads/master
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/jykr/Batch-effect-removal-benchmarking.git/info/lfs'
Can you try uploading the data again? Thanks a lot!
Hello,
I was hoping to evaluate the performance of my algorithm, but confused on how to record memory usage. I didn't find a description of the tool for recording memory usage in your article. Could you please tell me what tools are?
Thanks!
Hi,
There is only call_harmony_2 function in call_harmony.R
when I change call_harmony to call_harmony_2 in run_harmony_01.R, it has error below:
b_seurat <- RunHarmony(object = b_seurat, batch_label, theta = theta_harmony, plot_convergence = TRUE,
nclust = numclust, max.iter.cluster = max_iter_cluster)
Error in UseMethod("RunHarmony") :
no applicable method for 'RunHarmony' applied to an object of class "seurat"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.