Please note: at the moment this package is being actively developed and might not always be stable.
SCEPIA predicts transcription factor motif activity from single cell RNA-seq data. It uses computationally inferred epigenomes of single cells to identify transcription factors that determine cellular states. The regulatory inference is based on a two-step process:
- Single cells are matched to a combination of (bulk) reference H3K27ac ChIP-seq or ATAC-seq profiles.
- Using the H3K27ac ChIP-seq or ATAC-seq signal in enhancers associated with hypervariable genes the TF motif activity is inferred.
Currently five different references are available, three for human and two for mouse. Different data sets may give different results, based on a) the type of data (H3K27ac ChIP-seq or ATAC-seq) and b) the different cell types being represented. While SCEPIA does not require exact matching cell types to give good results, it does work best when relatively similar cell types are in the reference.
The following references can be used:
ENCODE.H3K27ac.human
- All H3K27ac experiments from ENCODE. Includes cell lines, tissuesBLUEPRINT.H3K27ac.human
- All H3K27ac cell types from BLUEPRINT (mostly hematopoietic cell types)Domcke.ATAC.fetal.human
- Fetal single cell-based ATAC-seq clusters from 15 different organs (Domcke et al 2020).Cusanovich.ATAC.adult.mouse
- ATAC-seq data of single cell-based clusters from 13 adult mouse tissues (Cusanovich et al, 2018).ENCODE.H3K27ac.mouse
- All H3K27ac experiments from mouse ENCODE.
So sorry, but only human and mouse are supported for now. However, if you have data from other species you can try it if gene names tend to match. Make sure you use gene names as identifiers, and scepia
will run fine. In our (very limited) experience this can yield good results, but there are a lot of assumptions on conservation of regulatory interactions. If you have a large collection of ATAC-seq or ChIP-seq reference experiments available you can also create your own reference with ScepiaDataset.create()
. This is not well-documented at the moment, let us know if you need help to do so.
You will need conda using the bioconda channel.
Make sure you have conda installed. If you have not used bioconda before, first set up the necessary channels (in this order!). You only have to do this once.
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
Now you can create an environment for scepia:
conda create -n scepia "scepia>=0.5.0"
# Note: if you want to use scepia in a Jupyter notebook, you also have to install the following packages: `ipywidgets nb_conda`.
conda activate scepia
You have to install genomes that scepia uses through genomepy. The genomes that are used include hg38
, hg19
, mm10
and mm9
, depending on the reference. For example, to install hg38
:
$ conda activate scepia
$ genomepy install hg38
You only need to do this once for each genome.
Note: this is independent of which genome / annotation you used for your single cell RNA-seq!
Remember to activate the environment before using it
conda activate scepia
The command line script scepia infer_motifs
works on any file that is supported by scanpy.read()
. We recommend to process your data, including QC, filtering, normalization and clustering, using scanpy. If you save the results to an .h5ad
file, scepia
can continue from your analysis to infer motif activity. However, the command line tool also works on formats such as CSV files or tab-separated files. In that case, scepia
will run some basic pre-processing steps. To run scepia
:
scepia infer_motifs <input_file> <output_dir>
A tutorial on how to use scepia
interactively in Jupyter can be found here.
Single cell data should be loaded in an AnnData object. Make sure of the following:
- Gene names are used in
adata.var_names
, not Ensembl identifiers or any other gene identifiers. adata.raw
stores the raw, log-transformed single cell expression data.- The main
adata
object is filtered to contain only hypervariable genes. - Louvain or Leiden clustering has been run.
Once these preprocessing steps are met, infer_motifs()
can be run to infer the TF motif activity. The first time the reference data will be downloaded, so this will take somewhat longer.
from scepia.sc import infer_motifs
# load and preprocess single-cell data using scanpy
infer_motifs(adata, dataset="ENCODE.H3K27ac.human")
The resulting AnnData
object can be saved and loaded as normal.
The approach to determine the enhancer-based regulatory potential (ERP) score per gene is based on the approach developed by Wang et al., 2016. There is one difference, in this approach the score is calculates based only on H3K27ac signal in enhancers. We use log-transformed, z-score normalized H3K27ac read counts in 2kb windows centered at enhancer locations. The ERP score is used to match single cell RNA-seq data to the reference H3K27ac profiles.
To use, an H3K27ac BAM file is needed (mapped to hg38). The -N
argument
specifies the number of threads to use.
scepia area27 <bamfile> <outfile> -N 12
scepia's People
scepia's Issues
Error in infer_motifs()
Hi everyone, I am trying to use your package on a dataset I am working on.
When I try to compute:
adata = infer_motis(adata, reference='ENCODE')
I get the following error during the computation of permutation-based p-values I suppose:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/anaconda3/envs/sc/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2888 try:
-> 2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'combined'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-12-f132522313ca> in <module>
----> 1 adata = infer_motifs(adata, dataset="ENCODE")
~/anaconda3/envs/sc/lib/python3.7/site-packages/scepia/sc.py in infer_motifs(adata, dataset, cluster, n_top_genes, max_cell_types, pfm, min_annotated, num_enhancers, maelstrom, indirect, n_sketch, n_permutations)
694
695 correlate_tf_motifs(
--> 696 adata, indirect=indirect, n_sketch=n_sketch, n_permutations=n_permutations
697 )
698
~/anaconda3/envs/sc/lib/python3.7/site-packages/scepia/sc.py in correlate_tf_motifs(adata, n_sketch, n_permutations, indirect)
831 )[1]
832
--> 833 f2m2["p_adj"] = multipletests(f2m2["combined"], method="fdr_bh")[1]
834 f2m2["-log10(p-value)"] = -np.log10(f2m2["p_adj"])
835
~/anaconda3/envs/sc/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2897 if self.columns.nlevels > 1:
2898 return self._getitem_multilevel(key)
-> 2899 indexer = self.columns.get_loc(key)
2900 if is_integer(indexer):
2901 indexer = [indexer]
~/anaconda3/envs/sc/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
-> 2891 raise KeyError(key) from err
2892
2893 if tolerance is not None:
KeyError: 'combined'
Could you please take a look at this, or point me towards a solution?
Downsample cells for motif correlation significance
Currently, the correlation step with shuffled motif activities takes relatively long with large cell numbers. This can probably be done with a smaller set of subsampled cells.
Enable addition of custom reference data.
It should be possible to add custom H3K27ac reference data to the default ENCODE reference.
- Input BAM file.
- Calculate coverage in reference regions followed by quantile normalization.
- Calculate regulatory potential and add to ENCODE table.
In addition, add saving / loading of these custom references.
Is it possible to supply my own H3K27sc data (in the form of alignment files)?
Thanks for your effort in creating this tool! I wonder if there's a straightforward manner to supply my own H3K27ac bam files to the algorithm, or must I process it to generate the exact output files similar to your example data? Any advice is appreciated. On a separate but similar node, is it possible for me to supply my own cell type labelling, and infer motifs in my labelled cell clusters?
Thank you in advance!
`inf` as probability of correlation between motif and tf
@okan-aydin got an inf
value for a probability. This can be solved by setting the n_permutations
to a higher value.
Putting this here for future reference. Ideally a warning/error is raised
tutorial, use `scanpy.datasets.pbmc3k_processed` instead of `scanpy.datasets.pbmc3k`
Just a suggestion. The preprocessing sees irrelevant to the actual tutorial ๐
cli param selections should be extended
e.g. which database to use
Flexibility in gene names
Add other options for gene identifiers, such ensembl_id.
Add command line option
Command line should be able to read .h5ad
file, infer motifs and write an output .h5ad
file with all motif properties.
error during installation
Hi Simon,
I was trying to install scepia
with pip but I'm facing the following error
pip install git+https://github.com/vanheeringen-lab/scepia.git
Error
Failed building wheel for sklearn-contrib-lightning
Running setup.py clean for sklearn-contrib-lightning
Successfully built scepia
Failed to build sklearn-contrib-lightning
...
...
error: invalid argument '-std=c99' not allowed with 'C++'
error: Command "g++ -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/venu/anaconda3/include -arch x86_64 -I/Users/venu/anaconda3/include -arch x86_64 -std=c99 -I/Users/venu/anaconda3/lib/python3.7/site-packages/numpy/core/include -I/private/var/folders/17/cwzq14vx6fb2m4kzc1ln2tsm0000gn/T/pip-install-_xwmsqoa/sklearn-contrib-lightning/lightning/impl/randomkit -I/Users/venu/anaconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/venu/anaconda3/include/python3.7m -c lightning/impl/adagrad_fast.cpp -o build/temp.macosx-10.7-x86_64-3.7/lightning/impl/adagrad_fast.o -MMD -MF build/temp.macosx-10.7-x86_64-3.7/lightning/impl/adagrad_fast.o.d" failed with exit status 1
Are there any additional dependencies I should be careful about?
Thank you.
Allow flexible index column name in adata.var
adata.var.index.name or adata.var_names.name can be other than None, as I noticed with a publicly available dataset.
SCEPIA expects this column to be called "index" (see code below where this throws an error) which would be the case with adata.var.index.name=None
.
Solution could be to set adata.var.index.name to None within SCEPIA, if we change the index column name within SCEPIA, we should do the same for the adata.raw. Or change the code without the hardcoded "index".
Lines in sc.py where the index column name is taken from the adata object
unique_factors = my_adata.raw.var_names[detected].str.upper()
real = pd.DataFrame(
real,
index=unique_factors,
columns=my_adata.uns["scepia"]["motif_activity"].index,
)
Lines 612-617 in sc.py where the error is thrown
tmp = (
real.reset_index()
.melt(id_vars="index", var_name="motif", value_name="correlation")
.rename(columns={"index": "factor"})
.set_index(["motif", "factor"])
)
Throws a KeyError: 'index'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.