Code Monkey home page Code Monkey logo

scib's Introduction

Stars PyPI PyPIDownloads Build Status Documentation codecov pre-commit

Benchmarking atlas-level data integration in single-cell genomics

This repository contains the code for the scib package used in our benchmarking study for data integration tools. In our study, we benchmark 16 methods (see Tools) with 4 combinations of preprocessing steps leading to 68 methods combinations on 85 batches of gene expression and chromatin accessibility data.

Workflow

Resources

  • The git repository of the scib package and its documentation.
  • The reusable pipeline we used in the study can be found in the separate scib pipeline repository. It is reproducible and automates the computation of preprocesssing combinations, integration methods and benchmarking metrics.
  • On our website we visualise the results of the study.
  • For reproducibility and visualisation we have a dedicated repository: scib-reproducibility.

Please cite:

Luecken, M.D., Büttner, M., Chaichoompu, K. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 41–50 (2022). https://doi.org/10.1038/s41592-021-01336-8

Package: scib

We created the python package called scib that uses scanpy to streamline the integration of single-cell datasets and evaluate the results. The package contains several modules for preprocessing an anndata object, running integration methods and evaluating the resulting using a number of metrics. For preprocessing, scib.preprocessing (or scib.pp) contains functions for normalising, scaling or batch-aware selection of highly variable genes. Functions for the integration methods are in scib.integration or for short scib.ig and metrics are under scib.metrics (or scib.me).

The scib python package is available on PyPI and can be installed through

pip install scib

Import scib in python:

import scib

Optional Dependencies

The package contains optional dependencies that need to be installed manually if needed. These include R dependencies (rpy2, anndata2ri) which require an installation of R integration method packages. All optional dependencies are listed under setup.cfg under [options.extras_require] and can be installed through pip.

e.g. for installing rpy2 and bbknn dependencies:

pip install 'scib[rpy2,bbknn]'

Optional dependencies outside of python need to be installed separately. For instance, in order to run kBET, install it via the following command in R:

install.packages('remotes')
remotes::install_github('theislab/kBET')

Metrics

We implemented different metrics for evaluating batch correction and biological conservation in the scib.metrics module.

Biological Conservation

Batch Correction

  • Cell type ASW

  • Cell cycle conservation

  • Graph cLISI

  • Adjusted rand index (ARI) for cell label

  • Normalised mutual information (NMI) for cell label

  • Highly variable gene conservation

  • Isolated label ASW

  • Isolated label F1

  • Trajectory conservation

  • Batch ASW

  • Principal component regression

  • Graph iLISI

  • Graph connectivity

  • kBET (K-nearest neighbour batch effect)

For a detailed description of the metrics implemented in this package, please see our publication and the package documentation.

Integration Tools

Tools that are compared include:

Development

For developing this package, please make sure to install additional dependencies so that you can use pytest and pre-commit.

pip install -e '.[test,dev]'

Please refer to the setup.cfg for more optional dependencies.

Install pre-commit to the repository for running it automatically every time you commit in git.

pre-commit install

scib's People

Contributors

adamgayoso avatar danielstrobl avatar hrovatin avatar johnarevalo avatar kridsadakorn avatar lauradmartens avatar lazappi avatar lisasikkema avatar luckymd avatar martaint avatar mbuttner avatar mumichae avatar mxmstrmn avatar scottgigante avatar scottgigante-immunai avatar szhorvat avatar zhongzheng1999 avatar zhongzheng99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scib's Issues

runIntegration.py, error

Is there any update that I may have missed? I couldn't run any method via runIntegration.py on my 3 merged datasets.

Below, BBKNN, which I could run in a Notebook.

(sc-tutorial) [chaichoompu@icb-lisa brain_atac_3datasets]$ ${PYSCRIPT} -i ${INPUTFILE} -o ${OUTPUTFILE} -b batch -v 5000 -m ${METHOD} > ${OUTPUTLOG}
Traceback (most recent call last):
  File "/home/icb/chaichoompu/Group/workspace/Benchmarking_data_integration/scripts/runIntegration.py", line 67, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "/home/icb/chaichoompu/Group/workspace/Benchmarking_data_integration/scripts/runIntegration.py", line 24, in runIntegration
    adataOut=True)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/preprocessing.py", line 264, in hvg_batch
    batch_key=batch_key)
  File "/home/icb/chaichoompu/.local/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 269, in highly_variable_genes
    flavor=flavor)
  File "/home/icb/chaichoompu/.local/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 123, in _highly_variable_genes_single_batch
    np.inf
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 270, in cut
    duplicates=duplicates,
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 387, in _bins_to_cuts
    "the 'duplicates' kwarg".format(bins=bins)
ValueError: Bin edges must be unique: array([  -inf, 1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12,
       1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12, 1.e-12,
       1.e-12, 1.e-12, 1.e-12, 1.e-12,    inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

update scanpy to current master

Hey @martaint @mumichae @danielStrobl @mbuttner,

A fix for a long-standing HVG selection bug was just introduced. That fix should change all of our results. I would suggest you either:

  1. Don't update your scanpy anymore for the analysis of your dataset, and note down the exact version you use
  2. Update now and rerun the analysis.

Depending on how far in you are, you may opt for option 1. Option 2 would obviously be better though. To run the integration & evaluation scripts we will need to use an updated scanpy either way. So if you choose to leave scanpy as it is for data pre-processing, you'll need a separate environment to run the integration + evaluation.

@kridsadakorn: This may also be relevant for you and Anna. See scverse/scanpy#705. I don't know whether Episcanpy also has this bug.

Using snakemake for metrics

Setup

Install snakemake into your conda environment

conda install -c bioconda -c conda-forge snakemake
  • Switch to metrics_fixes branch.
  • Go to the scripts folder. You will find a Snakefile, which defines the rules.
  • Create a config.yaml

Example of a config file

DATASETS: 
  - scanorama_embed
  - scanorama_full
  - seurat_full
  - trvae_embed
  - harmony_embed

batch_key: study
label_key: cell_type_union
organism: mouse

UNINTEGRATED: /storage/groups/ml01/workspace/group.daniela/atlases_merged_anno_new.h5ad
INTEGRATED_DIR: /storage/groups/ml01/workspace/group.daniela/atlases_integrated
METRICS_DIR : /storage/groups/ml01/workspace/group.daniela/metrics

Note on file names/paths

I would recommend to move/copy your output files into an integration output directory and create an output folder for the metrics only. Also rename/copy your files to <nameWithoutUnderscores>.h5ad, so that snakemake can distinguish the file prefix from the output type. The DATA_SETS key will contain entries in format <nameWithoutUnderscores>_<outputType>. Output types are:

Output type Description Methods
full corrected expression matrix, both with full or reduced feature set MNN, Scanorama, Seurat
embed corrected embedding Scanorama, Harmony, trVAE
knn corrected kNN graph BBKNN

Call pipeline

Always run the pipeline from the scripts directory of the metric_fixes branch.

snakemake -n                 # dryrun
snakemake                     # pipeline runs
snakemake --cores 10    # parallelise jobs

Malte: Errors for Lung Data

Scanorama, hvgs=0

Traceback (most recent call last):
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 26, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 691, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/integration.py", line 30, in runScanorama
    emb, corrected = scanorama.correct_scanpy(split, return_dimred=True)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/scanorama/scanorama.py", line 205, in correct_scanpy
    **kwargs
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/scanorama/scanorama.py", line 97, in correct
    dimred=dimred)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/scanorama/scanorama.py", line 386, in process_data
    datasets[i] = normalize(ds, axis=1)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1614, in normalize
    estimator='the normalize function', dtype=FLOAT_DTYPES)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/sklearn/utils/validation.py", line 550, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 15148)) while a minimum of 1 is required by the normalize function.

HVG selection by intersect

@danielStrobl @mumichae It looks like we just keep going until we have 4k HVGs in the intersection in this function.
https://github.com/theislab/Benchmarking_data_integration/blob/81aa23057223970e75c193687b1f2f36b6a56478/scIB/preprocessing.py#L208-L213

I thought we agreed on:

  1. Using 2k HVGs
  2. Never going to more than 8k HVGs per dataset to find the intersection HVG list of 2000.

If this is how all the methods were integrated so far, we will have to redo everything... @mbuttner

Did not find X_emb error

Hi! I'm getting an error when running metrics.py, it looks as follows:

Options
type: embed
batch_key: dataset
label_key: scgen_labels
assay: expression
organism: human
n_hvgs: 1969
out_prefix: Barbry_Krasnow_Kropski_Lafyatis_Meyer_Misharin_MisharinNew_Nawijn_Teichmann_log1p_SCGEN_2khvgs_output_embed
optimised clustering results: /storage/groups/ml01/workspace/lisa.sikkema/scIB_metrics/Barbry_Krasnow_Kropski_Lafyatis_Meyer_Misharin_MisharinNew_Nawijn_Teichmann_log1p_SCGEN_2khvgs_output_embed_int_nmi.txt
reading adata before integration
AnnData object with n_obs × n_vars = 343717 × 22833
obs: 'age', 'age_range', 'anatomical_region', 'anatomical_region_detailed', 'batch', 'dataset', 'disease', 'donor', 'ethnicity', 'ethnicity_mixed', 'last_author/PI', 'lung_vs_nasal', 'original_celltype_ann', 'pack_years', 'sample', 'sample_alias', 'sample_type', 'sampling_method', 'sex', 'smoking', 'total_counts', 'log10_total_counts', 'n_genes_detected', 'mito_frac', 'ribo_frac', 'compl', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'ann_highest_res', 'ann_new', 'subject_type', 'n_counts', 'size_factors', 'scgen_labels'
var: 'n_cells', 'highly_variable_per_sample', 'highly_variable_per_dataset', 'highly_variable'
uns: 'anatomical_region_colors', 'ann_level_1_colors', 'ann_level_2_colors', 'ann_level_3_colors', 'dataset_colors', 'last_author', 'lung_vs_nasal_colors', 'neighbors', 'pca', 'scgen_labels_colors', 'sex_colors', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'counts'
reading adata after integration
AnnData object with n_obs × n_vars = 343717 × 1969
obs: 'age', 'age_range', 'anatomical_region', 'anatomical_region_detailed', 'ann_highest_res', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'ann_new', 'batch', 'compl', 'dataset', 'disease', 'donor', 'ethnicity', 'ethnicity_mixed', 'last_author/PI', 'log10_total_counts', 'lung_vs_nasal', 'mito_frac', 'n_counts', 'n_genes_detected', 'original_celltype_ann', 'pack_years', 'ribo_frac', 'sample', 'sample_alias', 'sample_type', 'sampling_method', 'scgen_labels', 'sex', 'size_factors', 'smoking', 'subject_type', 'total_counts', 'concat_batch'
uns: 'dataset_colors', 'neighbors', 'pca', 'scgen_labels_colors', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
reduce integrated data:
HVG selection: None
compute neighbourhood graph: True on X_emb
precompute PCA: True
PCA
Nearest Neigbours
Traceback (most recent call last):
File "metrics.py", line 159, in
pca=precompute_pca, umap=False)
File "/mnt/home/icb/lisa.sikkema/software/scib/scIB/preprocessing.py", line 382, in reduce_data
sc.pp.neighbors(adata, use_rep=use_rep)
File "/home/icb/lisa.sikkema/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scanpy/neighbors/init.py", line 94, in neighbors
random_state=random_state,
File "/home/icb/lisa.sikkema/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scanpy/neighbors/init.py", line 620, in compute_neighbors
X = _choose_representation(self._adata, use_rep=use_rep, n_pcs=n_pcs)
File "/home/icb/lisa.sikkema/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scanpy/tools/_utils.py", line 53, in _choose_representation
'You need to compute it first.'.format(use_rep))
ValueError: Did not find X_emb in .obsm.keys(). You need to compute it first.

I do have X_pca in both my objects, could it be that I have to rename it to X_emb?

Mouse Atlas Annotations missing

The annotations provided only contain tissue and cell type. We also need sample and batch information, just like in Tabula Muris.

UnableToOpenBlob error when running plotSingleAtlas.R script

Hi!

I'm trying to run the plotSingleAtlas.R script to visualize the results of my benchmarking, but am getting an error when running its function in R:

plotSingleAtlas("/Users/lisa/Documents/scib_in_r/metrics_scaled.csv")
The following from values were not present in x: hvg, full_feature
The following from values were not present in x: Seurat, Mnn, Trvae
Error in magick_image_readpath(enc2native(path), density, depth, strip, :
R: UnableToOpenBlob `./img/embedding.png': No such file or directory @ error/blob.c/OpenBlob/2761
In addition: There were 21 warnings (use warnings() to see them)

Any idea where this error comes from or what piece of code relates to it?

Thanks!

Lisa

pip cannot find scanpy version for python conda environment

Hi!

Was trying to install the python conda environment (scIB-python.yml) and got this error:

Collecting scanpy==1.4.5.dev114+gd69832a (from -r /mnt/home/icb/lisa.sikkema/software/condaenv.s3o7wb4g.requirements.txt (line 49))

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement scanpy==1.4.5.dev114+gd69832a (from -r /mnt/home/icb/lisa.sikkema/software/condaenv.s3o7wb4g.requirements.txt (line 49)) (from versions: 0.2.1, 0.2.3, 0.2.3.4, 0.2.3.5, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.2.9.1, 0.3, 0.3.1, 0.3.2, 0.4, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 1.0, 1.0.1, 1.0.1.post1, 1.0.2, 1.0.3, 1.0.3.post1, 1.0.4, 1.1a1, 1.1a2, 1.1, 1.2.0, 1.2.1, 1.2.2, 1.3, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.3.8, 1.4, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.4.4.post1, 1.4.5, 1.4.5.post1, 1.4.5.post2, 1.4.5.post3, 1.4.5.1, 1.4.6, 1.5.0a1, 1.5.0, 1.5.1)
ERROR: No matching distribution found for scanpy==1.4.5.dev114+gd69832a (from -r /mnt/home/icb/lisa.sikkema/software/condaenv.s3o7wb4g.requirements.txt (line 49))

CondaEnvException: Pip failed

Is it possible you need to adjust the scanpy version?

run conos via runIntegration.py failed

Hi,

I executed conos via runIntegration.py:

./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_conos.h5ad -b tech -v -m conos > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_conos.out

and obtained the following error message:

Traceback (most recent call last):
  File "./runIntegration.py", line 57, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 16, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 580, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 154, in runConos
    ro.r('con$buildGraph(k=10, k.self=10, space="PCA", ncomps=50)')
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/__init__.py", line 389, in __call__
    res = self.eval(p)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 192, in __call__
    .__call__(*args, **kwargs))
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 121, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 28, in _
    cdata = function(*args, **kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface.py", line 773, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in con$buildGraph(k = 10, k.self = 10, space = "PCA", ncomps = 50) : 
  insufficient number of comparable pairs

Gene scaling in data integration functions

When implementing the methods, do we generally follow the tutorials? We should be doing this.

Sometimes the tutorials will however use different normalization procedures than we would according to best practices. I guess we do not implement those, as we assume that we get a pre-processed dataset for the function. The one thing we should consider is whether we use feature scaling or not depending on the method tutorials (e.g., https://github.com/quon-titative-biology/examples/blob/master/scAlign_multiway_alignment/scAlign_multiway_pancreas.md)

DL methods will not deal as well with non-scaled data. So we should follow the method recommendations here. Maybe add a scaling parameter in the function that defaults to whatever is the default for the method according to their tutorial?

LISI error

just got this error while computing LISI on the BBKNN graph error

Traceback (most recent call last):
  File "metrics.py", line 138, in <module>
    lisi_=lisi_
  File "/home/icb/michaela.mueller/workspace/Benchmarking_data_integration/scIB/metrics.py", line 783, in metrics
    verbose=verbose), axis=1)
  File "/home/icb/michaela.mueller/workspace/Benchmarking_data_integration/scIB/metrics.py", line 529, in lisi
    simpson_estimate_batch = ro.r(f"simpson.estimate_batch <- compute_simpson_index(nn_indx, nn_dst, batch, n_batches, perplexity)") #batch_label_keys)")
  File "/home/icb/michaela.mueller/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/__init__.py", line 389, in __call__
    res = self.eval(p)
  File "/home/icb/michaela.mueller/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 192, in __call__
    .__call__(*args, **kwargs))
  File "/home/icb/michaela.mueller/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 121, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/home/icb/michaela.mueller/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 28, in _
    cdata = function(*args, **kwargs)
  File "/home/icb/michaela.mueller/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface.py", line 785, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in compute_simpson_index(nn_indx, nn_dst, batch, n_batches, perplexity) : could not find function "compute_simpson_index"

LISI needs update for Rcpp package

Hi,

I have implemented the LISI score and started to test it.
I have got the following error message and assume, that the R package 'Rcpp' needs to be updated, but I do not know how to do it.

Traceback (most recent call last):
  File "metrics.py", line 99, in <module>
    pcr_=pcr_, kBET_=kBET_, lisi_ = lisi_, cell_cycle_=cc, verbose=False, organism=organism
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 673, in metrics
    verbose=False), axis=1)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 478, in lisi
    lisi_estimate = ro.r(f"lisi.estimate <- compute_lisi(data_mtrx, metadata, batch_label_keys)") #batch_label_keys)")
  File "/home/icb/daniel.strobl/miniconda3/envs/benchmarking_data_integration_dev/lib/python3.7/site-packages/rpy2/robjects/__init__.py", line 389, in __call__
    res = self.eval(p)
  File "/home/icb/daniel.strobl/miniconda3/envs/benchmarking_data_integration_dev/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 192, in __call__
    .__call__(*args, **kwargs))
  File "/home/icb/daniel.strobl/miniconda3/envs/benchmarking_data_integration_dev/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 121, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/benchmarking_data_integration_dev/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 28, in _
    cdata = function(*args, **kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/benchmarking_data_integration_dev/lib/python3.7/site-packages/rpy2/rinterface.py", line 785, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in compute_simpson_index(t(dknn$nn.dists), t(dknn$nn.idx) - 1, labels,  : 
  function 'enterRNGScope' not provided by package 'Rcpp'

Help is much appreciated.

run MNN via runIntegration.py failed on huge dataset

Hi,
My dataset has 978k cells and I receive the following error message when running MNN:

nice ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/mouse_brain/mouse_brain_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/mouse_brain/integrated/mouse_brain_mnn_hvg2000.h5ad -b study -v 2000 -m mnn > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/mouse_brain/integrated/mouse_brain_mnn.out
Traceback (most recent call last):
  File "./runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 27, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 599, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/integration.py", line 140, in runMNN
    corrected = mnnpy.mnn_correct(*split, var_subset=hvg)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/mnnpy/mnn.py", line 126, in mnn_correct
    svd_mode=svd_mode, do_concatenate=do_concatenate, **kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/mnnpy/mnn.py", line 157, in mnn_correct
    var_subset, n_jobs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/mnnpy/utils.py", line 54, in transform_input_data
    in_scaling = p_n.map(l2_norm, in_batches)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
    put(task)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

runSeurat

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-5ff2ad57c105> in <module>
----> 1 res=scIB.integration.runSeurat(adata, batch='batch')

/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/integration.py in runSeurat(adata, batch, hvg)
     75     anndata2ri.activate()
     76 
---> 77     tmp = anndata.AnnData(X=adata.X.sorted_indices(), obs=adata.obs)
     78     ro.globalenv['adata'] = tmp
     79     ro.r('sobj = as.Seurat(adata, counts=NULL, data = "X")')

AttributeError: 'numpy.ndarray' object has no attribute 'sorted_indices'

cannot run kbet_ or hvg_score_ in scIB.metrics.metrics()

I'm getting the following errors when I try to run the scIB.metrics.metrics() function on my own data.
I'm printing below the error callback when I try to run the metrics on unintegrated data (i.e. scIB.metrics.metrics(adata, adata, 'RTcleanupMethod', 'timepoint', ...)), however I'm getting the same errors when I provide the adata_int outputs from runScanorama(), runBBKNN() or runCombat().

kbet_=True

kBET...
/home/jamesb/anaconda3/envs/scIB-python/lib/python3.7/site-packages/umap/umap_.py:349: NumbaWarning: 
Compilation is falling back to object mode WITH looplifting enabled because Function "fuzzy_simplicial_set" failed type inference due to: Untyped global name 'nearest_neighbors': cannot determine Numba type of <class 'function'>

File "../../../../../../../../../../../home/jamesb/anaconda3/envs/scIB-python/lib/python3.7/site-packages/umap/umap_.py", line 467:
def fuzzy_simplicial_set(
    <source elided>
    if knn_indices is None or knn_dists is None:
        knn_indices, knn_dists, _ = nearest_neighbors(
        ^

  @numba.jit()
/home/jamesb/anaconda3/envs/scIB-python/lib/python3.7/site-packages/numba/compiler.py:742: NumbaWarning: Function "fuzzy_simplicial_set" was compiled in object mode without forceobj=True.

File "../../../../../../../../../../../home/jamesb/anaconda3/envs/scIB-python/lib/python3.7/site-packages/umap/umap_.py", line 350:
@numba.jit()
def fuzzy_simplicial_set(
^

  self.func_ir.loc))
/home/jamesb/anaconda3/envs/scIB-python/lib/python3.7/site-packages/numba/compiler.py:751: NumbaDeprecationWarning: 
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "../../../../../../../../../../../home/jamesb/anaconda3/envs/scIB-python/lib/python3.7/site-packages/umap/umap_.py", line 350:
@numba.jit()
def fuzzy_simplicial_set(
^

  warnings.warn(errors.NumbaDeprecationWarning(msg, self.func_ir.loc))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-f54eb36b4479> in <module>
     20     lisi_graph_=False, # bug .cat
     21 
---> 22     verbose=False
     23 )

/media/jamesb/OS/Users/jb2094/OneDrive - University Of Cambridge/PhD/Cambridge stem cell institute/phd/projects/_shared/pkg/scib-master/scIB/metrics.py in metrics(adata, adata_int, batch_key, label_key, hvg_score_, cluster_nmi, nmi_, ari_, nmi_method, nmi_dir, silhouette_, embed, si_metric, pcr_, cell_cycle_, organism, verbose, isolated_labels_, n_isolated, graph_conn_, kBET_, kBET_sub, lisi_graph_, trajectory_, type_)
   1848         kbet_score = 1-np.nanmean(kBET(adata_int, batch_key=batch_key, label_key=label_key, type_=type_,
   1849                                        embed = embed, subsample=kBET_sub,
-> 1850                                        heuristic=True, verbose=verbose)['kBET'])
   1851     else:
   1852         kbet_score = np.nan

/media/jamesb/OS/Users/jb2094/OneDrive - University Of Cambridge/PhD/Cambridge stem cell institute/phd/projects/_shared/pkg/scib-master/scIB/metrics.py in kBET(adata, batch_key, label_key, embed, type_, hvg, subsample, heuristic, verbose)
   1592         #check if neighborhood size too small or only one batch in subset
   1593         if np.logical_or(adata_sub.n_obs < 10, 
-> 1594                          len(adata_sub.obs[batch_key].cat.categories)==1):
   1595             print(f"{clus} consists of a single batch or is too small. Skip.")
   1596             score = np.nan

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5174             or name in self._accessors
   5175         ):
-> 5176             return object.__getattribute__(self, name)
   5177         else:
   5178             if self._info_axis._can_hold_identifiers_and_holds_name(name):

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    173             # we're accessing the attribute of the class, i.e., Dataset.geo
    174             return self._accessor
--> 175         accessor_obj = self._accessor(obj)
    176         # Replace the property with the accessor object. Inspired by:
    177         # http://www.pydanny.com/cached-property.html

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in __init__(self, data)
   2591 
   2592     def __init__(self, data):
-> 2593         self._validate(data)
   2594         self._parent = data.values
   2595         self._index = data.index

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in _validate(data)
   2601         if not is_categorical_dtype(data.dtype):
   2602             raise AttributeError(
-> 2603                 "Can only use .cat accessor with a " "'category' dtype"
   2604             )
   2605 

AttributeError: Can only use .cat accessor with a 'category' dtype

hvg_score_=True

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-f5d298b6424b> in <module>
     20     lisi_graph_=False, # bug .cat
     21 
---> 22     verbose=False
     23 )

/media/jamesb/OS/Users/jb2094/OneDrive - University Of Cambridge/PhD/Cambridge stem cell institute/phd/projects/_shared/pkg/scib-master/scIB/metrics.py in metrics(adata, adata_int, batch_key, label_key, hvg_score_, cluster_nmi, nmi_, ari_, nmi_method, nmi_dir, silhouette_, embed, si_metric, pcr_, cell_cycle_, organism, verbose, isolated_labels_, n_isolated, graph_conn_, kBET_, kBET_sub, lisi_graph_, trajectory_, type_)
   1875 
   1876     if hvg_score_:
-> 1877         hvg_score = hvg_overlap(adata, adata_int, batch_key)
   1878     else:
   1879         hvg_score = np.nan

/media/jamesb/OS/Users/jb2094/OneDrive - University Of Cambridge/PhD/Cambridge stem cell institute/phd/projects/_shared/pkg/scib-master/scIB/metrics.py in hvg_overlap(adata_pre, adata_post, batch, n_hvg)
    421 
    422     else:
--> 423         hvg_pre_list = precompute_hvg_batch(adata_pre, batch, hvg_post)
    424 
    425 

/media/jamesb/OS/Users/jb2094/OneDrive - University Of Cambridge/PhD/Cambridge stem cell institute/phd/projects/_shared/pkg/scib-master/scIB/metrics.py in precompute_hvg_batch(adata, batch, features, n_hvg, save_hvg)
    399             print(i.obs[batch][0]+' has less than the specified number of genes')
    400             print('Number of genes: '+str(i.n_vars))
--> 401         hvg = sc.pp.highly_variable_genes(i, flavor='cell_ranger', n_top_genes=n_hvg_tmp, inplace=False)
    402         hvg_dir[i.obs[batch][0]] = i.var.index[hvg['highly_variable']]
    403     adata_list=None

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace, batch_key)
    258             n_top_genes=n_top_genes,
    259             n_bins=n_bins,
--> 260             flavor=flavor,
    261         )
    262     else:

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_single_batch(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor)
    124             -np.inf,
    125             np.percentile(df['means'], np.arange(10, 105, 5)),
--> 126             np.inf
    127         ])
    128         disp_grouped = df.groupby('mean_bin')['dispersions']

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/pandas/core/reshape/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates)
    268         include_lowest=include_lowest,
    269         dtype=dtype,
--> 270         duplicates=duplicates,
    271     )
    272 

~/anaconda3/envs/scIB-python/lib/python3.7/site-packages/pandas/core/reshape/tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
    385                 "Bin edges must be unique: {bins!r}.\nYou "
    386                 "can drop duplicate edges by setting "
--> 387                 "the 'duplicates' kwarg".format(bins=bins)
    388             )
    389         else:

ValueError: Bin edges must be unique: array([          -inf, 6.89655170e-03, 1.03448275e-02, 1.03448275e-02,
       1.37931034e-02, 1.72413792e-02, 2.06896551e-02, 2.41379309e-02,
       2.75862068e-02, 3.10344826e-02, 3.44827585e-02, 4.13793102e-02,
       5.51724136e-02, 7.24137947e-02, 9.31034461e-02, 1.31034479e-01,
       1.93103448e-01, 2.96551734e-01, 5.82758605e-01, 3.50999985e+01,
                  inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

I can't quite figure out what's going on. Any help appreciated!

Analysis of robustness to order of datasets

Another analysis we should consider is robustness of results to the order of datasets. I heard today that Seurat v3 is not that robust to the order of datasets.

The idea would be to take 1 metric: e.g., NMI for clustering-label partition similarity, and then look at the variance of this result depending on random ordering of datasets.

Memory and time

Just to follow up our meeting: I post here as a issue since I am on the phone (before I forget it).

@danielStrobl for runIntegretion.py could you also please print out the whole parts of memory usage, including after-penalizing data loading? I want to figure out why sometime BBKNN and SCVI randomly report 0MB usage, which it should not be zero. I could check the memory usage from Slurm as well and I would expect when we sum up all parts, the numbers should be similar.

Second about the time, is it also possible to print out the CPU time and user time as well. I expect that the user time should be more or less simliar to the time that is reported from the script at the moment. The CPU time should represent the time when we run as a single core more or less.

From the internet.

Wall clock time is the actual amount of time taken to perform a job. This is equivalent to timing your job with a stopwatch and the measured time to complete your task can be affected by anything else that the system happens to be doing at the time.

User time measures the amount of time the CPU spent running your code. This does not count anything else that might be running, and also does not count CPU time spent in the kernel (such as for file I/O).

CPU time measures the total amount of time the CPU spent running your code or anything requested by your code. This includes kernel time.

The "User time" measurement is probably the most appropriate for measuring the performance of different jobs, since it will be least affected by other things happening on the system.

Difficulty installing the conda environments

Hi,

I'd like to try your snakemake pipeline on some data of my own. However, I'm having some trouble creating the two conda environments that require scanpy. When trying to create "scIB-python.yml", pip fails, because it can't find scanpy==1.4.5.dev114+gd69832a:

Pip subprocess error:
  ERROR: Could not find a version that satisfies the requirement scanpy==1.4.5.dev114+gd69832a (from versions: 0.2.1, 0.2.3, 0.2.3.4, 0.2.3.5, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.2.9.1, 0.3, 0.3.1, 0.3.2, 0.4, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 1.0, 1.0.1, 1.0.1.post1, 1.0.2, 1.0.3, 1.0.3.post1, 1.0.4, 1.1a1, 1.1a2, 1.1, 1.2.0, 1.2.1, 1.2.2, 1.3, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.3.8, 1.4, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.4.4.post1, 1.4.5, 1.4.5.post1, 1.4.5.post2, 1.4.5.post3, 1.4.5.1, 1.4.6, 1.5.0a1, 1.5.0, 1.5.1)
ERROR: No matching distribution found for scanpy==1.4.5.dev114+gd69832a
CondaEnvException: Pip failed

And when I modify the .yml to just require scanpy==1.4.5, I get the following error:

Pip subprocess error:
ERROR: scanpy 1.4.5 has requirement h5py!=2.10.0, but you'll have h5py 2.10.0 which is incompatible.
ERROR: scanpy 1.4.5 has requirement scipy>=1.3, but you'll have scipy 1.2.1 which is incompatible.
ERROR: torchvision 0.5.0 has requirement torch==1.4.0, but you'll have torch 1.2.0 which is incompatible.
ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Any help getting this to work would be very much appreciated!

NMI

NMI requires matching clusters of different annotations by their label. When comparing clusters before and after integration, the labels must somehow match. However, if you use "cell_ontology_class" and compare it with a new louvain clustering with labels 1,2,.. there won't be any match among the labels.

How can we solve this? My ideas so far:

  • rename unique clusters by integers and map these to the assignments
  • sort cluster labels by size, then rename them as integers

Tests for integration methods

One thing that is still outstanding is to write tests for integration methods. This is not immediate priority, but should be done soon.

Things to test:

  1. Method runs
  2. obs_names same after and before
  3. batch key is not overwritten

We should probably test this on some toy data that we create. Any other thoughts? Specifically @danielStrobl and @mumichae?

To Do

  • combat
  • mnn in R
  • pca regression -> correct over highest variance confounder
  • correct over samples (i.e. "everything") -> time intensive
  • overcorrection via bio variance e.g. labels, trajectories
  • hvgs per batches

run MNN vin runIntegration.py failed

Hi,

I executed MNN via runIntegration.py:

./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_mnn.h5ad -b tech -v -m mnn > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_mnn.out

and obtained the following error message:

Traceback (most recent call last):
  File "./runIntegration.py", line 57, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 16, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 580, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 130, in runMNN
    out.uns['emb']=False
NameError: name 'out' is not defined

notebooks/ folder excess

Hey @mumichae,

I'm wondering what the integration folder is for in the notebooks folder. I undestand showing that methods run... but conos.ipynb tests seurat rather than conos ;). And scgen is no longer used... Maybe we should get rid of at least those two notebooks?

run seurat via runIntegration.py failed

Hi,

I executed the runIntegration with the seurat method:

./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_seurat.h5ad -b tech -v -m seurat > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_seurat.out

and had the following error message:

Traceback (most recent call last):
  File "./runIntegration.py", line 57, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 16, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 580, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 63, in runSeurat
    tmp = anndata.AnnData(X=adata.X.sorted_indices(), obs=adata.obs)
AttributeError: 'ArrayView' object has no attribute 'sorted_indices'

run seurat via runIntegration failed

Hi,

I had problems to perform Seurat:

$ ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_seurat_hvg2000.h5ad -b tech -v 2000 -m seurat > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_seurat.out

Traceback (most recent call last):
  File "./runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 27, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 599, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 107, in runSeurat
    'eps = 0,'+
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/__init__.py", line 389, in __call__
    res = self.eval(p)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 192, in __call__
    .__call__(*args, **kwargs))
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 121, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 28, in _
    cdata = function(*args, **kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface.py", line 773, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in nn2(data = data.use[anchors.cells2, ], query = data.use, k = k +  : 
  Cannot find more nearest neighbours than there are points

Error when running runSeurat

@mumichae
There seems to be a bug in the runSeurat function, maybe it's connected to my seurat/scanpy version. It breaks at the IntegrateData() step. I used the merged_adata.h5ad for testing.
Error:

---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
<ipython-input-4-fda5300d5c32> in <module>
----> 1 corr_seurat = scIB.integration.runSeurat(adata, 'method')

/mnt/znas/icb_zstore01/groups/ml01/workspace/daniel.strobl/Benchmarking_data_integration/scIB/integration.py in runSeurat(adata, batch, hvg)
     90         'preserve.order = F,'+
     91         'do.cpp = T,'+
---> 92         'eps = 0,'+
     93         'verbose = T)'
     94     )

~/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/__init__.py in __call__(self, string)
    387     def __call__(self, string):
    388         p = _rparse(text=StrSexpVector((string,)))
--> 389         res = self.eval(p)
    390         return conversion.rpy2py(res)
    391 

~/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py in __call__(self, *args, **kwargs)
    190                 kwargs[r_k] = v
    191         return (super(SignatureTranslatedFunction, self)
--> 192                 .__call__(*args, **kwargs))
    193 
    194 

~/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py in __call__(self, *args, **kwargs)
    119             else:
    120                 new_kwargs[k] = conversion.py2rpy(v)
--> 121         res = super(Function, self).__call__(*new_args, **new_kwargs)
    122         res = conversion.rpy2py(res)
    123         return res

~/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py in _(*args, **kwargs)
     26 def _cdata_res_to_rinterface(function):
     27     def _(*args, **kwargs):
---> 28         cdata = function(*args, **kwargs)
     29         # TODO: test cdata is of the expected CType
     30         return _cdata_to_rinterface(cdata)

~/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface.py in __call__(self, *args, **kwargs)
    771                     error_occured))
    772             if error_occured[0]:
--> 773                 raise embedded.RRuntimeError(_rinterface._geterrmessage())
    774         return res
    775 

RRuntimeError: Error in nn2(data = data.use[anchors.cells2, ], query = data.use, k = k +  : 
  Cannot find more nearest neighbours than there are points
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> IntegrateData -> FindWeights -> nn2

scIB fails to import: AttributeError: module 'scanpy.plotting.palettes' has no attribute 'godsnot_64'

Installed scIB with pip install git+https://github.com/theislab/scib@master but cannot import the module.

>>> import scIB
/home/scottgigante/.local/lib/python3.6/site-packages/numba/core/errors.py:149: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9
  warnings.warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scottgigante/.local/lib/python3.6/site-packages/scIB/__init__.py", line 2, in <module>
    from scIB import preprocessing, integration, metrics, clustering
  File "/home/scottgigante/.local/lib/python3.6/site-packages/scIB/preprocessing.py", line 51, in <module>
    palette=sc.pl.palettes.godsnot_64):
AttributeError: module 'scanpy.plotting.palettes' has no attribute 'godsnot_64'
>>> import scanpy as sc
>>> sc.__version__
'1.5.1'
>>> dir(sc.plotting.palettes)
['Mapping', 'Sequence', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_plot_color_cycle', 'cm', 'colors', 'default_102', 'default_20', 'default_28', 'godsnot_102', 'vega_10', 'vega_10_scanpy', 'vega_20', 'vega_20_scanpy', 'zeileis_28']

Metrics

  • distinguish embedded vs. non-embedded vs. kNN graph results (BBKNN vs. Harmony)
  • pca regression on embedded data
  • don't recompute neighbourhood graph

run via metrics.py

I ran with this command

METHOD=bbknn
TYPE=knn
INPUTFILE=/storage/groups/ce01/workspace/Benchmarking_data_integration/data/brain_atac_3datasets/merge_10x_CEMBA180312_3B_GSM3034638_bin_merged_filterRowCol_filterCountCell.h5ad
INTEGRATIONFILE=/storage/groups/ce01/workspace/Benchmarking_data_integration/data/brain_atac_3datasets/output/merge_10x_CEMBA180312_3B_GSM3034638_bin_merged_filterRowCol_filterCountCell_${METHOD}.h5ad
OUTDIR=/storage/groups/ce01/workspace/Benchmarking_data_integration/data/brain_atac_3datasets
BATCH=batchname
CELLTYPE=broad_cell_label
ORGANISM=human
HVGS=0
    
SCRIPTPATH=/home/icb/daniel.strobl/Benchmarking_data_integration/scripts
python ${SCRIPTPATH}/metrics.py -u ${INPUTFILE} -i ${INTEGRATIONFILE} -o ${OUTDIR} -b ${BATCH} -l ${CELLTYPE} --organism ${ORGANISM} --type ${TYPE} --hvgs ${HVGS} 

but here is an error

reading adata before integration
AnnData object with n_obs × n_vars = 14927 × 59822 
    obs: 'batch', 'batchname', 'n_genes', 'n_counts', 'counts'
    var: 'overlap10x-0-0', 'chrom-1-0', 'chrom2-1-0', 'n_cells-1-0', 'first_filtering-1', 'second_filtering-1', 'n_cells-1', 'n_cells'
    uns: 'batchname_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
reading adata after integration
AnnData object with n_obs × n_vars = 14927 × 59822 
    obs: 'batch', 'batchname', 'n_genes', 'n_counts', 'counts'
    var: 'overlap10x-0-0', 'chrom-1-0', 'chrom2-1-0', 'n_cells-1-0', 'first_filtering-1', 'second_filtering-1', 'n_cells-1', 'n_cells'
    uns: 'batchname_colors', 'mem', 'neighbors', 'pca', 'runtime'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
Nearest Neigbours
computing metrics
Traceback (most recent call last):
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/metrics.py", line 138, in <module>
    lisi_=lisi_
  File "/home/icb/chaichoompu/Benchmarking_data_integration/scIB/metrics.py", line 711, in metrics
    checkBatch(label_key, adata.obs)
  File "/home/icb/chaichoompu/Benchmarking_data_integration/scIB/utils.py", line 13, in checkBatch
    raise ValueError(f'column {batch} is not in obs')
ValueError: column broad_cell_label is not in obs

I could see that 'batchname' are in both files, but still curious why this error occurs

Adapting scVI parameters

scVI seems to perform very poorly on the integration benchmark. This is likely due to sub-optimal parametrization. Currently the parameter choices were taken from the tutorial, but this may not be the best way to use scVI to compare it with other methods. Note we cannot optimize scVI too much as we don't do the same for other methods either. This would bias our results towards scVI.

Things we could change to improve performance:

  • increase number of epochs
    • from 100 to 400, 1000
    • for some representative datasets to get a good estimate
    • check model fit and when it overfits e.g. Elbo plot
  • in-/decrease n_latent
  • use batch-specific dispersion?

https://github.com/YosefLab/scVI/blob/975d9d86ba2be3341a07af07d879bca02c06a7ac/scvi/models/vae.py

runIntegration.py in progress

See here
https://1drv.ms/x/s!AkUC7od-rodKikyX9-bV9DgBUSgj

I start to see the overall picture of scIB

  • Seurat definitely has an issue with big datasets, still.
  • I doubt that the error in MNN on some datasets might due to computing issues, big data issue?
  • As Malte told before, Lung data have some issues.
  • trvae got killed by Slurm one some datasets. I don't know the reason. I will need to rerun and check the history from Slurm database. Since I didn't store the job id, can't retrieve back the info.

run scgen via runIntegration.py failed

Hi,

I executed the scgen data integration using the following command:

./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scgen.h5ad -b tech -v -m scgen > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scgen.out

and obtained this error message:

Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2890, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'louvain'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./runIntegration.py", line 57, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 16, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 580, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 43, in runScGen
    adata.obs['cell_type'] = adata.obs[cell_type].copy()
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/pandas/core/frame.py", line 2975, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2892, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'louvain'

The thing is that my data set does not contain the 'cell_type' field, so can you add in the runScGen function something like

# save cell_types for later
if adata.obs[cell_type] is not None:
    cell_types = adata.obs[cell_type].copy()

batches = adata.obs[batch].copy()

?
Thanks!

Memory measurement

To follow up on our meeting:
As I observed that the memory usage of BBKNN dropped down in some datasets, I started to question why.
Let discuss over here

bbknn runIntegration.py ATAC big dataset

26.11.2019 I just re-ran bbknn multiple times on sparse matrix of ATAC big dataset
and I got this error.

Traceback (most recent call last):
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 26, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/chaichoompu/Benchmarking_data_integration/scIB/metrics.py", line 691, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/chaichoompu/Benchmarking_data_integration/scIB/integration.py", line 216, in runBBKNN
    corrected = bbknn.bbknn(adata, batch_key=batch, copy=True)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/bbknn/__init__.py", line 283, in bbknn
    deep=('added to `.uns[\'neighbors\']`\n'
  File "/home/icb/chaichoompu/.local/lib/python3.7/site-packages/scanpy/logging.py", line 19, in info
    return msg(*args, v='info', **kwargs)
TypeError: msg() got an unexpected keyword argument 'deep'

Has anybody seen before?

I'm not sure, is it because of the data file itself? or because of the move of our node to join with the new slurm server.

run Seurat with runIntegration.py on huge dataset failed

Hi,

I ran Seurat on the huge dataset and obtained the following error mesage:

$./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/mouse_brain/mouse_brain_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/mouse_brain/integrated/mouse_brain_seurat_hvg2000.h5ad -b study -v 2000 -m seurat > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/mouse_brain/integrated/mouse_brain_seurat.out

Traceback (most recent call last):
  File "./runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "./runIntegration.py", line 27, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 599, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/integration.py", line 92, in runSeurat
    'max.features = 200,'+
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/__init__.py", line 389, in __call__
    res = self.eval(p)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 192, in __call__
    .__call__(*args, **kwargs))
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 121, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 28, in _
    cdata = function(*args, **kwargs)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/rpy2/rinterface.py", line 773, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in SNNAnchor(k_matrix = k.matrix, anchor_only = anchor.only) : 
  std::bad_alloc

Replace lm.score with metrics.r2_score

In metrics.py, pc_regression function (line 472), it might be desirable to replace lm.score(), with metrics.r2_score(), since the former outputs an annoying warning at each iteration.

installation instructions unclear

Hi!
I'd like to use scIB to test the quality of dataset integration in a dataset of mine. It is a bit unclear to me what/how I should install. There are installation instructions for scib that are not under the header "Installation" but above, and then there are the conda environment instructions. Should I first install scIB and then the environment? Might be good to clarify this a bit in the readme.
Thanks!
Lisa

scvi on runIntegration.py

Test run on human_pancreas hvgs = 0

human_pancreas_norm_scvi_hvg0_err.txt
Traceback (most recent call last):
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 26, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 691, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/integration.py", line 89, in runScvi
    raise TypeError('Adata does not contain a `counts` layer in `adata.layers[`counts`]`')
TypeError: Adata does not contain a `counts` layer in `adata.layers[`counts`]`
human_pancreas_norm_scvi_hvg2000_err.txt
Traceback (most recent call last):
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 68, in <module>
    runIntegration(file, out, run, hvg, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scripts/runIntegration.py", line 26, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/metrics.py", line 691, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/mnt/znas/icb_zstore01/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scIB/integration.py", line 89, in runScvi
    raise TypeError('Adata does not contain a `counts` layer in `adata.layers[`counts`]`')
TypeError: Adata does not contain a `counts` layer in `adata.layers[`counts`]`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.