The atlas-feature-selection-benchmark from theislab

New metric: F1

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The F1 score is a commonly used evaluation metric that measures classification performance as the harmonic mean of precision and recall.

Links

Links to information about the metric

Description: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Code: https://github.com/scikit-learn/scikit-learn
Package: https://pypi.org/project/sklearn/

Anything else?

Anything else you think is important about the metric

Need to decide what kind of class averaging to use

New method: singleCellHaystack

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

singleCellHaystack is a package for predicting differentially expressed genes (DEGs) in single-cell transcriptome data without the use of cell labels. It uses Kullback-Leibler Divergence to find genes that are expressed in subsets of cells that are non-randomly positioned in a reduced dimensional space.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

use.advanced.sampling mode

Links

Links to information about the method

Paper: https://www.nature.com/articles/s41467-020-17900-3
Code: https://github.com/alexisvdb/singleCellHaystack
Package: https://cran.r-project.org/package=singleCellHaystack

Anything else?

Anything else you think is important about the method

New method: scSEGIndex

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

The single-cell Stably Expressed Gene (scSEG) index is available in the {scMerge} package and measures how stably expressed a gene is across a dataset and can be used to select stably expressed genes. This is the opposite of typical methods which look for highly variable genes and would serve as a negative control for the benchmark.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://academic.oup.com/gigascience/article/8/9/giz106/5570567
Code: https://github.com/SydneyBioX/scMerge
Package: https://bioconductor.org/packages/scMerge

Anything else?

Anything else you think is important about the method

Need to decide which genes to select based on the score, the top n genes is probably the simplest approach

New metric: NMI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Normalised Mutal Information (NMI) measures the similarity between two sets of labels (in this case the ground truth cell labels and cluster assignments). It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

Clustering can be optimised for this metric using a function in scib

Things to think about

Promising:

reference mapping metrics
downstream analysis after integration and/or reference mapping, then metrics for this

Not sooo promising:

more integration metrics
potentially more methods

New method: seurat

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

{Seurat} is the most commonly used R toolbox. It contains a highly variable gene feature selection function that selects features by either performing a variance stabilising transformation and selecting variable features ("vst"), binning features by expression and selecting over-dispersed features ("mean.var.plot") or simply selecting the features with highest dispersion values ("dispersion").

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Method ("vst", "mean.var.plot", "dispersion")
Number of features (for "dispersion" and "vst")
Overall vs per batch

Links

Paper: ["mean.var.plot" method doi.org/10.1038/nbt.3192, "vst" flavor sciencedirect.com/science/article/pii/S0092867419305598
Code: https://github.com/satijalab/seurat
Package: https://cran.r-project.org/web/packages/Seurat/

Anything else?

Some methods are also implemented in scanpy but both should be included for comparison
Consider implementing each method separately (different scripts)

New feature: Pass reference and query to metrics

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

Adapt pipeline to pass reference and query integrated objects to metrics. This is required by some of the mapping/unseen population metrics.

Anything else?

Anything else you think is important about the feature

New method: Wilcoxon

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features by taking the top marker genes for each label as detected using the Wilcoxon rank sum test

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Filtering of genes (expression/proportion)
Ranking/sorting
Number of genes per label

Links

Links to information about the method

Code: https://github.com/scverse/scanpy (could use other packages)
Package: https://pypi.org/project/scanpy/

Anything else?

Anything else you think is important about the method

Filtering 0 count genes in reference and query

Currently removal of genes with 0 counts is done before splitting into reference and query datasets. This means they have the same feature sets but means there is a chance that one of them could contain some 0 genes. Consider if filtering should be done on both separately and the intersection used.

method-triku small bug

In the methods workflow line 169, "method-random" instead of "method-triku"

New metric: ref-query ILIS

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

LISI between the reference and the query.

Links

Links to information about the metric

Paper: https://www.nature.com/articles/s41467-021-25957-x
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

This should be able to be implemented by modifying one of the existing LISI scripts to use different labels (reference/query instead of batch).
Will need to take both the reference and mapped query objects

New method: SCMER

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

SCMER is a feature selection method designed for single-cell data analysis. It selects a compact set of markers that preserve the manifold in the original data. It can also be used for multimodal data integration by using features in one modality to match the manifold of another modality.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://www.nature.com/articles/s43588-021-00070-7
Code: https://github.com/KChen-lab/SCMER
Package: https://pypi.org/project/scmer/

Anything else?

Anything else you think is important about the method

New metric: Cell cycle conservation

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The cell cycle conservation score measures how much of the variance associated with the cell cycle in individual batches remains after integration. It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

Requires the pipeline to be modified to pass species information for datasets to metrics

New dataset: Human Endoderm Atlas

Thanks for taking the time to suggest a new dataset!

Description

Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?

The Human Endoderm Atlas [HEA] is a reference atlas of multiple endodermal organs from human development. It contains 34 samples from 14 individuals across 15 tissues from six organs. High quality labels with two hierarchical levels (major cell type, 7; cell type, 27) are available.

Links

Links to information about the dataset

Paper: https://doi.org/10.1016/j.cell.2021.04.028
Source: https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/x53tts3zfr-1.zip

Anything else?

Anything else you think is important about the dataset

This dataset is large (155.232 cells)

New method: scanpy

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

scanpy is the most commonly used Python toolbox and contains three methods for selecting highly variable genes: "seurat" (default), "cell_ranger" and "seurat_v3". For "seurat" and "cell_ranger" genes are binned by mean expression and normalised dispersions calculated per bin. "seurat" uses thresholds to select features while "cell_ranger" uses a target number of genes.

"seurat_v3" starts with raw counts (rather than log normalised) and applies a variance stabilising transformation and ranking genes by normalised variance.

More details on methods in the scanpy documentation.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Flavors ("seurat", "cell_ranger" and "seurat_v3")
Number of features (for "cell_ranger" and "seurat_v3")
Overall vs per batch

Links

Paper: scanpy https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0, "seurat" flavor https://doi.org/10.1038/nbt.3192, "cell_ranger" flavor https://doi.org/10.1038/ncomms14049, "seurat_v3" flavor https://www.sciencedirect.com/science/article/pii/S0092867419305598
Code: https://github.com/scverse/scanpy
Package: https://pypi.org/project/scanpy/

Anything else?

Anything else you think is important about the method

The "seurat" and "seurat_v3" flavors are also implemented in the original {Seurat} R package but it is worth including both to compare how similar the implementations are
Consider everything should be implemented as one method with variants (a single script), or a separate method for each flavor (multiple scripts)

New metric: MILO

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Use the MILO method to identify cells with enriched query neighbourhoods in unseen cell labels.

Links

Links to information about the metric

Paper: https://www.biorxiv.org/content/10.1101/2022.11.10.515939v1
Code: https://github.com/emdann/milopy

Anything else?

Anything else you think is important about the metric

MIght need some kind of summarisation to get an overall score.

Weight classification metrics by label rarity

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Classification metrics that are calculated per label (such as F1 score) can be averaged in various ways. The suggestion would be to weight averages by the rarity of the labels to focus on correct classifications of uncommon labels.

Links

Links to information about the metric

Paper: https://arxiv.org/abs/2010.05995

Anything else?

Anything else you think is important about the metric

New feature: Update to R 4.2

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

Update all R environments to use R 4.2 and Bioconductor 3.16

Anything else?

Anything else you think is important about the feature

New method: Simple statistics

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features based on simple statistical values

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Mean
Variance

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New method: scPNMF

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

The single-cell Projective Non-negative Matrix Factorization (scPNMF) method performs a dimensionality reduction and then filters bases to find those that show evidence of multimodal structure. Features can then be selected based on those bases.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://academic.oup.com/bioinformatics/article/37/Supplement_1/i358/6319662
Code: https://github.com/JSB-UCLA/scPNMF

Anything else?

Anything else you think is important about the method

New metric: Jaccard Index

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Jaccard index, or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets.

Links

Links to information about the metric

Description: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html?highlight=metric
Code: https://github.com/scikit-learn/scikit-learn
Package: https://pypi.org/project/sklearn/

Anything else?

Anything else you think is important about the metric

Need to decide what kind of class averaging to use

New metric: kBET

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The k-Nearest Neighbour Batch effect Test (kBET) uses a statistical test to measure the mixing of batches and labels within the neighbourhood of a cell. It was used as part of the scIB project.

Links

Paper:
- scIB https://www.nature.com/articles/s41592-021-01336-8
- kBET https://doi.org/10.1038/s41592-018-0254-1
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

kBET can be difficult and slow to implement (partially because of using rpy2), for this reason it may not be worth using

New method: scry

Thanks for taking the time to suggest a new method!

Description

Selects features based on high deviance from a constant multinomial model

Variants

Batch aware mode

Links

Links to information about the method

Paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6
Code: https://github.com/kstreet13/scry
Package: https://bioconductor.org/packages/scry/

Anything else?

New metric: Batch PCR

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The PCA Regression (PCR) comparison measures the amount of variance in the dataset explained by the batch label before and after integration. If the integration performs well more variance should be explained by batch prior to integration. It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

New metric: Imbalanced clustering metrics

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Add extensions to existing integration metrics based on clustering that take into account the imbalance in ground truth labels.

Links

Links to information about the metric

Paper: https://doi.org/10.1101/2022.10.06.511156
Code: https://github.com/hsmaan/balanced-clustering
Package: https://pypi.org/project/balanced-clustering/

Anything else?

Anything else you think is important about the metric

New method: Hotspot

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Hotspot is a graph-based method that selects features that are associated with similarity between cells (represented as a graph).

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://doi.org/10.1016/j.cels.2021.04.005
Code: https://github.com/Yoseflab/Hotspot
Package: https://pypi.org/project/hotspotsc/

Anything else?

_Anything else you think is important about the method

New metric: Graph iLISI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Local Inverse Simpson’s Index (LISI) measures diversity in the neighbourhood of a cell. The integration variant (iLISI) gives better scores when the neighbourhood consists of different labels to the target cell. It was used as part of the scIB project where a more flexible graph-based implementation was developed.

Links

Paper:
- scIB https://www.nature.com/articles/s41592-021-01336-8
- LISI https://www.nature.com/articles/s41592-019-0619-0
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

New metric: ARI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Adjusted Rand Index (ARI) measures the similarity between two sets of labels (in this case the ground truth label and a clustering). It adjusts for the similarity that would be expected by chance depending on the size of the clusters. It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

Clustering can be optimized for this metric using a function in scib

New metric: Reconstruction error

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Distance between an average reconstructed cell and real query cells

Links

Links to information about the metric

Paper: https://www.biorxiv.org/content/10.1101/2022.11.10.515939v1
Code: Check https://github.com/MarioniLab/oor_benchmark

Anything else?

Anything else you think is important about the metric

New method: DUBStepR

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? What does it measure? Why would it be a good fit for the project?

DUBStepR (Determining the Underlying Basis using Step-wise Regression) is a feature selection algorithm for cell type identification in single-cell RNA-sequencing data. It is based on the intuition that cell-type-specific marker genes tend to be well correlated with each other, i.e. they typically have strong positive and negative correlations with other marker genes.

Links

Links to information about the metric

Paper: https://www.nature.com/articles/s41467-021-26085-2
Code: https://github.com/prabhakarlab/DUBStepR
Package: https://cran.r-project.org/package=DUBStepR

Anything else?

Anything else you think is important about the metric

New method: OSCA

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

The Orchestrating Single-Cell Analysis with Bioconductor book describes how to use core Bioconductor packages to analyse single-cell data. They propose a method a method for feature selection that considers batches in the data and selects features with additional biological variation.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Paper: http://bioconductor.org/books/3.15/OSCA.multisample/integrating-datasets.html#slower-setup
Code: {scran} https://github.com/MarioniLab/scran, {batchelor} https://github.com/LTLA/batchelor
Package: {scran} https://bioconductor.org/packages/release/bioc/html/scran.html, {batchelor} https://bioconductor.org/packages/release/bioc/html/batchelor.html

Anything else?

Anything else you think is important about the method

New metric: Graph connectivity

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The graph connectivity score measures how connected the subgraphs for each label are. For a well-integrated dataset it is expected that cells with the same label will be well connected while for a poorly integrated dataset they will be more disconnected. It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

New metric: MCC

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Matthews Correlation Coefficient (MCC) measures the quality of binary and multiclass classifications and is regarded as a balanced measure that can be used even if the classes are of very different sizes.

Links

Description: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html
Code: https://github.com/scikit-learn/scikit-learn
Package: https://pypi.org/project/sklearn/

Anything else?

Anything else you think is important about the metric

New feature: Bump scIB version

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

Update scIB environments to the latest release once changes have been merged

Anything else?

Anything else you think is important about the feature

Required to avoid workarounds in:

LISI metrics
Cell cycle metric

New method: Brennecke

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features based on excess coefficient of variation

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://www.nature.com/articles/nmeth.2645
Code: https://github.com/MarioniLab/scran (there are several other implementations but I think we will use this one)
Package: https://bioconductor.org/packages/release/bioc/html/scran.html

Anything else?

Anything else you think is important about the method

New metric: Local structure

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Describes how well the local structure of each group prior to integration is preserved after integration

Links

Links to information about the metric

Paper: https://www.nature.com/articles/nbt.4096
Code: https://github.com/satijalab/seurat
Package: https://satijalab.org/seurat/reference/localstruct

Anything else?

Anything else you think is important about the metric

New metric: Graph cLISI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Local Inverse Simpson’s Index (LISI) measures diversity in the neighbourhood of a cell. The cell-type variant (cLISI) gives better scores when the neighbourhood consists of the same labels as the target cell. It was used as part of the scIB project where a more flexible graph-based implementation was developed.

Links

Paper:
- scIB https://www.nature.com/articles/s41592-021-01336-8
- LISI https://www.nature.com/articles/s41592-019-0619-0
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

New dataset: HLCA

Thanks for taking the time to suggest a new dataset!

Description

Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?

The Human Lung Cell Atlas is a comprehensive catalogue of cells in the human lung. It contains samples from hundreds of individuals and high-quality consensus cell labels at different hierarchical levels.

Links

Paper: https://www.biorxiv.org/content/10.1101/2022.03.10.483747v1
Source: https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293

Anything else?

Anything else you think is important about the dataset

This dataset is very large and may be difficult in terms of computational resources

New metric: ALCS

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Accuracy Loss of Cell type Self-projection, difference in accuracy between classifiers trained on the query only (per batch) vs the query in the reference space

Links

Links to information about the metric

Paper: https://www.biorxiv.org/content/10.1101/2022.09.27.509674v1, https://github.com/Functional-Genomics/CrossSpeciesIntegration
Code: https://github.com/SCCAF/sccaf
Package: https://pypi.org/project/SCCAF/

Anything else?

Anything else you think is important about the metric

Not sure that the exact metric is in the package or if it needs some wrapping code

Bug: NBumi returns 0 features on test simulation

Thanks for taking the time to fill out this bug report!
Please use Markdown formatting for any code snippets.

What happened?

Briefly describe the issue

On the small test simulation datasets the NBumi method selects zero genes which breaks the integration stage of the pipeline

What were you doing?

Briefly describe what led to the issue. A reproducible example or other code snippets are great

What did you see?

_Include any error messages, log output or other output that could help diagnose the problem

Subsetting to 0 selected features...
Setting up AnnData for scVI...
...
IndexError: index 172 is out of bounds for size 0

Proposed solution

If you have a suggestion for how to solve the issue we would love to hear it!

Options:

Make NBumi error if no features are selected (and allow the Nextflow process to fail)
Return all genes if none are selected?
Modify the integration stage to error if there are no selected genes (and allow the Nextflow process to fail)

Your environment

Please include the information relevant to your issue

HMGU server

Anything else?

Anything else you want to tell us about the issue

New metric: query kNN-corr

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Correlation between the KNN of query cells before and after mapping.

Links

Links to information about the metric

Paper: https://www.nature.com/articles/s41467-021-25957-x
Code: https://github.com/immunogenomics/symphony
Package: https://cran.r-project.org/web/packages/symphony/

Anything else?

Anything else you think is important about the metric

May be possible to reuse the Symphony implementation (preferred) but if not should not be too difficult to reimplement

New method: triku

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

triku is a graph-based feature selection method that selects features that show an unexpected number of zero counts and whose expression is located in cells that have similar expression profiles.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac017/6547682
Code: https://github.com/alexmascension/triku
Package: https://pypi.org/project/triku/

Anything else?

Anything else you think is important about the method

New metric: Label ASW

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Adjusted Silhouette Width (ASW) measures the compactness of clusters in a dataset. By calculating it on cell labels we evaluate whether cells of the same type are nearby or separated. It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

New method: Pearson residuals

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features based on high Pearson residuals

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02451-7
Code: https://github.com/theislab/scanpy
Package: https://pypi.org/project/scanpy/

Anything else?

Anything else you think is important about the method

Can probably be implemented by modifying the existing scanpy script

New metric: Isolated labels score

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The isolated labels score measures how well labels that were present in few samples can be distinguished in the integrated dataset. It can be calculated using either a clustering-based approach with the F1 score or unsupervised ASW. It was used as part of the scIB project.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

Can probably be implemented as a single script with an option to select either the F1 or ASW score
- Check what clustering information is required/used for the F1 variant

New feature: Pass dataset species

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

The pipeline should keep track of the species for each dataset and pass that information as needed (particularly to metrics). Required for the cell conservation score #18.

Anything else?

Anything else you think is important about the feature

New metric: Batch ASW

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Adjusted Silhouette Width (ASW) measures the compactness of clusters in a dataset. For the scIB project a modified version was developed which evaluates how spread (and therefore weel integrated) batches are.

Links

Paper: scIB https://www.nature.com/articles/s41592-021-01336-8
Code: https://github.com/theislab/scib
Package: https://pypi.org/project/scib/

Anything else?

Anything else you think is important about the metric

New method: M3Drop

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

M3Drop fits a Michaelis-Menten model to the pattern of dropouts in single-cell RNASeq data. This model is used as a null to identify significantly variable (i.e. differentially expressed) genes for use in downstream analysis.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Variants for full-length (M3Drop) and UMI data (NBumi)

Links

Links to information about the method

Paper: https://academic.oup.com/bioinformatics/article/35/16/2865/5258099
Code: https://github.com/tallulandrews/M3Drop
Package: https://www.bioconductor.org/packages/release/bioc/html/M3Drop.html

Anything else?

Anything else you think is important about the method

Package also contains implementations of other simple feature selection methods (Brennecke, giniFS, pcaFS, corFS, ConsensusFS)

New dataset: NeurIPs 2021

Thanks for taking the time to suggest a new dataset!

Description

Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?

Open Problems in Single-Cell Analysis produced a dataset for a NeurIPs 2021 competition. This contains multiomics samples from several individuals, produced at different sequencing facilities and with consensus labels.

Links

Paper:
- Datasets https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/158f3069a435b314a80bdcb024f8e422-Abstract-round2.html
- Competition summary https://proceedings.mlr.press/v176/lance22a.html
Source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122

Anything else?

Anything else you think is important about the dataset

The samples contain multiple modalities (either CITE-seq or 10x Multiome), we are only interested the RNA parts
Possibly use a subset or create two datasets for the different technologies

New method: anticor

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Selects features based on those that have an excess of negative correlations

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Paper: https://www.biorxiv.org/content/10.1101/2022.12.05.519161v1
Code: https://bitbucket.org/scottyler892/anticor_features
Package: https://pypi.org/project/anticor-features

Anything else?

Anything else you think is important about the method

theislab / atlas-feature-selection-benchmark Goto Github PK

atlas-feature-selection-benchmark's Introduction

Atlas feature selection benchmarking

Directory structure

atlas-feature-selection-benchmark's People

Contributors

Stargazers

Watchers

atlas-feature-selection-benchmark's Issues

Description

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Anything else?

Description

Variants

Links

Anything else?

Description

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Links

Anything else?

Description

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Links

Anything else?

Description

Links

Anything else?

Description

Anything else?

Description

Variants

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Links

Anything else?

Description

Links

Anything else?

Description

Variants

Links

Anything else?

Description

Links

Anything else?

Description