bimsbbioinfo / ikarus Goto Github PK

Identifying tumor cells at the single-cell level using machine learning

License: MIT License

Python 3.22% Jupyter Notebook 28.86% HTML 67.92%

bioinformatics cancer-genomics machine-learning single-cell

ikarus's Introduction

ikarus

ikarus is a stepwise machine learning pipeline that tries to cope with a task of distinguishing tumor cells from normal cells. Leveraging multiple annotated single cell datasets it can be used to define a gene set specific to tumor cells. First, the latter gene set is used to rank cells and then to train a logistic classifier for the robust classification of tumor and normal cells. Finally, sensitivity is increased by propagating the cell labels based on a custom cell-cell network. ikarus is tested on multiple single cell datasets to ascertain that it achieves high sensitivity and specificity in multiple experimental contexts. Please find more information in the corresponding publication.

Installation

ikarus currently supports python>=3.8, and can be installed from PyPI:

pip install ikarus

Alterantively, one can install ikarus' master branch directly from github:

python -m pip install git+https://github.com/BIMSBbioinfo/ikarus.git

Usage

The easiest option to get started is to use the provided Tumor/Normal gene lists and the pretrained model:

from ikarus import classifier

model = classifier.Ikarus(signatures_gmt=signatures_path)
model.load_core_model(model_path)
predictions = model.predict(test_adata, 'test_name')

More information on how to train a model or how to create own gene lists is provided in the tutorial notebook.

Example notebooks
Data preparation and basic prediction

ikarus's People

Contributors

Stargazers

Watchers

Forkers

sathiyanmanivannan bit-vs-it cpusummer-wdn skr3178 amhaslam princehuahy qindan2008 schaudge janepolsi yuxinghai yaningyang epaaso

ikarus's Issues

Customize training data

I was very pleased that Ikarus predicted malignant cells with 0.93 accuracy in hepatocellular carcinoma (HCC) scRNA-seq data. I am planning to use it to find malignant cells in other unannotated HCC scRNA-seq datasets. So, I thought I could simply add an annotated HCC dataset to find a new tumor gene set to make it more specific to HCC datasets.

Is it possible to do so?
if yes, how was the "major_hallmark_corrected" found?
Also, what are major, tier_0 ... tier_3 columns in adata.obs?

thank you,
Yulia

Failed to converge when training the model.

Hi, thanks for this great tool.
I managed to get the gmt file with public datasets. But I ran into the error when I try to train the model:

/home/sxykdx/miniconda3/envs/ikarus/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

So, I try to edit here:

ikarus/ikarus/classifier.py

Line 533 in 0a4baf4

model = LogisticRegression(max_iter=1000)

Changed default from 1000 to 2000,4000,10000,40000; still got the same error message. I'm not sure how to fix this. Maybe, I should subsample the data? Thx

Implement different scoring functions

ssGSEA, signscore

incorrectly labels normal epithelial cells as tumor cells

Dear developers,

I'm testing ikarus using data (KUL3-21N and KUL3-21T) from "Lee, Hae-Ock, et al. "Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer." Nature Genetics 52.6 (2020): 594-603."

I noticed that when I used normal samples, ikarus still labels normal epithelial cells as tumor cells, any suggestions for that?

The analysis can be found in https://github.com/neil-n-zhang/ikarus_LeeCRC/tree/main/code

Thanks so much!
Neil

Problems about the feature selection of the model

Hi, I feel that this is a cool research, and try to understand some details of the model, but I have encountered some problems. I wonder if you would be willing to answer:
1、Feature selection: I noticed that the resulting features of ikarus contains two gene sets: 162 tumor cell-specific genes and 1313 normal cell-specific genes. The article uses the intersection and cross-validation of multiple single-cell datasets for feature selection. So what I want to ask is, which single-cell data sets are the intersections of these features that are finally confirmed? (Is it the five datasets used for cross-validation in the original text below: For cross validation, we have used the two lung cancer datasets from Laughney [27] and Lambrechts [28], a colorectal cancer from [29], neuroblastoma dataset from Kildisiute [30], and a head and neck cancer datasets from [31].)
2、Some problems with using the ikarus model: I loaded the already trained ikarus model downloaed from github. Then, when I used the public single-cell data set for testing, I found that if the intersection of the genes in the single-cell data set and the features of the model is less than 80%, an error will be reported during the model test: "input data contains NaN". But I checked my h5ad file and confirmed that there are no NaN values in it. When I partially prune the features of the model to ensure that more than 80% of the features appear in the single-cell dataset, the model test is fine. I was wondering if any of you had a similar problem and tried to fix it (Maybe this is an inherent problem from AUCell ?).

adjust sensitivity of prediction

Hello,

I have been applying Ikarus to a few Pancreatic ductal adenocarcinoma samples.
What I find is that in some samples, there is "over prediction" or "under prediction" for the core prediction.
But what I also saw is that the final prediction (cell-cell network label propagation) was "overdoing it".
Are there parameters to calibrate (or fine-tune) the predictions of either or both steps? Is this possible?
The first step is following logistic regression to make core_pred; The second step is in the cell-cell network label propagation to make final_pred.

For example:

Here ikarus predicts correctly that there are cancer cells in "Ductal cell type 2" but those are not enough to make it through the cell-cell network propagation (there are no malignant cells in the final_pred).
I am looking for a way to boost the tumor predictions on the first step, or the second step.
The ultimate goal is to have tumor cells in the final_pred.

In case I have something wrong here please do correct me. This is based on how I understand the method works.

glmer model failed to converge

Hi everyone,
As with almost everyone, I have run into this warning "model failed to converge" and I need help to clear this warning message while running my glmer analysis.
" Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00272495 (tol = 0.002, component 1)"

I managed to clear it before by changing the optimizer in the first two models I am running but I have tried all the optimizers so far and nothing seem to be working.
glmer code:
model_300 <- glmer(correlation_effect ~language + proficiency.level + verb.freq + language:proficiency.level+ language:verb.freq+ proficiency.level:verb.freq + (1 | ID) + (1 | test.verb),
data =accuracy.data,family = binomial(link = "logit"),control = glmerControl(optimizer = "nloptwrap", optCtrl = list(algorithm= "NLOPT_LN_NELDERMEAD")))

Model summary:

allFit summary:

Trying different optimizers using the following functions:

What else I could do to solve this issue?
Thanks in advance ;)

I ran into some problems when reproducing results of your article.

Hello,

After reading your article “Identifying tumor cells at the single-cell level using machine learning”, I was very interested in the Ikarus method because of its excellent performance in classifying Tumor and Normal cells. But when reproducing the results, I ran into some problems presented as below:

When selecting the gene signature, I used the create_all function in “gene_list.py” uploaded on Github, but I don’t know how to set label_upregs_list and label_downregs_list. I tried every possible setting I could think of, but it all turned out that the overlap of genes in signature and dataset to be smaller than 80%, so that AUCell couldn’t return scores.
It is presented in your article that “a user-defined number of top genes is selected (for our analyses, we used the top 300 genes)”, but if I set top_x as 300, I can’t get a gene signature of more than 1000 genes. Then I tried setting top_x as 1500, but AUCell still couldn’t return scores.

The screenshots of codes and their results are attached below.

ValueError: output array is read-only when calling preprocess_adata

I am trying to run ikarus from R with reticulate. To prepare the data I used this line following the tutorial:

adata = ikarus$data$preprocess_adata(adata)

This give the error:

Error in py_call_impl(callable, dots$args, dots$keywords) :
  ValueError: output array is read-only

BTW, I also wonder what preprocess_adata does internally. If not using this function, would a regular log-normalization (i.e. normalize by total UMI counts, multiplied by a scaling factor, then log-transform with a pseudocount) do the job?

I am using R 4.2.1, python 3.10.13. Since this seems to be an issue happening in python, below I list the python module versions installed:

Package                 Version
----------------------- ------------
aiohttp                 3.8.5
aiosignal               1.3.1
anndata                 0.9.2
arboreto                0.1.6
async-timeout           4.0.3
attrs                   23.1.0
BitVector               3.5.0
bokeh                   3.2.2
boltons                 23.0.0
certifi                 2023.7.22
charset-normalizer      3.2.0
click                   8.1.7
cloudpickle             2.2.1
contourpy               1.1.0
ctxcore                 0.2.0
cycler                  0.11.0
cytoolz                 0.12.2
dask                    2023.9.0
dill                    0.3.7
distributed             2023.9.0
docutils                0.20.1
fonttools               4.42.1
frozendict              2.3.8
frozenlist              1.4.0
fsspec                  2023.9.0
GMM-Demux               0.2.2.1
h5py                    3.9.0
idna                    3.4
igraph                  0.10.6
ikarus                  0.0.3
importlib-metadata      6.8.0
interlap                0.2.7
Jinja2                  3.1.2
joblib                  1.3.2
kiwisolver              1.4.5
leidenalg               0.10.1
llvmlite                0.40.1
locket                  1.0.0
loompy                  3.0.7
lz4                     4.3.2
MarkupSafe              2.1.3
matplotlib              3.7.2
msgpack                 1.0.5
multidict               6.0.4
multiprocessing-on-dill 3.5.0a4
natsort                 8.4.0
networkx                3.1
numba                   0.57.1
numexpr                 2.8.5
numpy                   1.24.4
numpy-groupies          0.9.22
packaging               23.1
pandas                  2.1.0
partd                   1.4.0
patsy                   0.5.3
Pillow                  10.0.0
pip                     23.2.1
psutil                  5.9.5
pyarrow                 13.0.0
pynndescent             0.5.10
pyparsing               3.0.9
pyscenic                0.12.1
python-dateutil         2.8.2
pytz                    2023.3.post1
PyYAML                  6.0.1
requests                2.31.0
scanpy                  1.9.4
scikit-learn            1.3.0
scipy                   1.11.2
seaborn                 0.12.2
session-info            1.0.0
setuptools              68.1.2
six                     1.16.0
sklearn                 0.0.post7
sortedcontainers        2.4.0
statistics              1.0.3.5
statsmodels             0.14.0
stdlib-list             0.9.0
tabulate                0.9.0
tblib                   2.0.0
texttable               1.6.7
threadpoolctl           3.2.0
toolz                   0.12.0
tornado                 6.3.3
tqdm                    4.66.1
tzdata                  2023.3
umap-learn              0.5.3
urllib3                 2.0.4
wheel                   0.41.2
xyzservices             2023.7.0
yarl                    1.9.2
zict                    3.0.0
zipp                    3.16.2

Proposing some initial codes

!python -m pip install git+https://github.com/BIMSBbioinfo/ikarus.git

!pip install pyarrow==0.16.0

append prediction results to anndata object

So far, prediction results are returned as array and optionally stored as csv. Would be convenient to optionally append them in the obs section of the anndata object, too. E.g. enable the option in model.predict(...) via something like "append=True" or "to_adata=True"

ContextualVersionConflict

!python -m pip install git+https://github.com/BIMSBbioinfo/ikarus.git from ikarus import classifier, utils, data

ContextualVersionConflict: (pyarrow 6.0.1 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('pyarrow<0.17.0,>=0.11.1'), {'ctxcore'})

test model load

test for whether the model can be loaded

NA value error?

Hi Jan,

Running into this error here:
Less than 80% of the genes in Normal are present in the expression matrix.
Less than 80% of the genes in Tumor are present in the expression matrix.
/juno/work/greenbaum/users/rahmanj/vardhana_lab/rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but LogisticRegression was fitted without feature names
warnings.warn(
Traceback (most recent call last):
File "ikarus_cell_classifier.py", line 21, in
predictions = model.predict(adata, 'Impact_Rux')
File "rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/ikarus/classifier.py", line 375, in predict
y_pred = self.core_model.predict(scores)
File "rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 447, in predict
scores = self.decision_function(X)
File "rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 429, in decision_function
X = self._validate_data(X, accept_sparse="csr", reset=False)
File "rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/sklearn/base.py", line 577, in _validate_data
X = check_array(X, input_name="X", **check_params)
File rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/sklearn/utils/validation.py", line 899, in check_array
_assert_all_finite(
File "rux_b_cell_single_cell_analysis/rux/lib/python3.8/site-packages/sklearn/utils/validation.py", line 146, in _assert_all_finite
raise ValueError(msg_err)
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

pyscenic does not contain genesig

The latest release of pySCENIC no longer provides pyscenic.genesig:

aertslab/pySCENIC@a137cd1

This has moved to ctxcore.

KeyError: 'tier_0' when trying to plot results

Hello! Thank you for making Ikarus available! I managed to make an initial prediction using the pre-trained model but can't seem to properly visualize the results. This occurs when I try to print the classification metrics and show confusion matrix and UMAPs.

This line of code seems to give me the error: y = adata.obs.loc[:, "tier_0"]. From this line of code, adata = anndata.read_h5ad(path / "adata_umap.h5ad"), I am not seeing anything named tier_0.

I was wondering if there was any thoughts on solving this issue?

Any help would be much appreciated!

Key error- related to joblib version?

code:
import urllib.request
import anndata
import pandas as pd
from pathlib import Path
from ikarus import classifier, utils, data

signatures_path = Path("signatures.gmt")

model = classifier.Ikarus(signatures_gmt = signatures_path, out_dir = "ikarus_model")

core_model_path = Path("core_model.joblib")

model.load_core_model(core_model_path)

error:
File "ikarus_cell_classifier.py", line 13, in
model.load_core_model(core_model_path)
File "/work/rux/lib/python3.8/site-packages/ikarus/classifier.py", line 486, in load_core_model
self.core_model = joblib.load(core_model_path)
File "/work/rux/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 587, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/work/rux/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 506, in _unpickle
obj = unpickler.load()
File "/work/lib/python3.8/pickle.py", line 1210, in load
dispatchkey[0]
KeyError: 10

I get the same error with version 3.8.0 and 3.8.5 python. I suspect it has something to do with a discrepancy in joblib versions?