dpeerlab / harmony Goto Github PK

Harmony framework for connecting scRNA-seq data from discrete time points

License: GNU General Public License v2.0

Jupyter Notebook 94.06% Python 5.94%

scrna-seq scrna-seq-analysis mnn batch-correction developmental-trajecotries

harmony's Introduction

Harmony

Harmony is a unified framework for data visualization, analysis and interpretation of scRNA-seq data measured across discrete time points.Harmony constructs an augmented affinity matrix by augmenting the kNN graph affinity matrix with mutually nearest neighbors between successive time points. This augmented affinity matrix forms the basis for generated a force directed layout for visualization and also serves as input for computing the diffusion operator which can be used for trajectory detection using Palantir

Installation and dependencies

Harmony has been implemented in Python3 and can be installed using:
```
 $> pip install harmonyTS
 $> pip install palantir
```
Harmony depends on a number of python3 packages available on pypi and these dependencies are listed in setup.py All the dependencies will be automatically installed using the above commands
To uninstall:
```
 $> pip uninstall harmonyTS
```
If you would like to determine gene expression trends, please install R programming language and the R package GAM . You will also need to install the rpy2 module using
```
 $> pip install rpy2
```
If you would like to speed-up the analysis in case of big datasets, you can run the main functions of this package on a CUDA GPU. To do so please install rapids-0.17 as well as cupy>=9.0.

Usage

A tutorial on Harmony usage and results visualization for single cell RNA-seq data can be found in this notebook: http://nbviewer.jupyter.org/github/dpeerlab/Harmony/blob/master/notebooks/Harmony_sample_notebook.ipynb

The datasets generated as part of the manuscript and harmozined using Harmony are available for exploration at: endoderm-explorer.com

Citations

Harmony was used to harmonize datasets across multiple time points in our manuscript characterizing mouse gut endoderm development. This manuscript is available at Nature. If you use Harmony for your work, please cite our paper.

harmony's People

Contributors

Stargazers

Watchers

Forkers

yandgong307 fenghuijian bacemdatascience louisfaure settylab shoo99 katosh wook2014 panch-pq rmandla hamidghaedi lipingshu

harmony's Issues

Pandas error in sample Harmony code

Hello all,

We're trying out the Harmony code in the sample notebook and it runs fine until it hits the line:
hvg_genes = harmony.utils.hvg_genes(norm_df)
where it throws the following error:

    Traceback (most recent call last):
        ...
        [snip]
        ...  
        File "/home/.../.local/lib/python3.6/site-packages/pandas/core/indexes/category.py", line 503, in reindex
        raise ValueError("cannot reindex with a non-unique indexer")
    ValueError: cannot reindex with a non-unique indexer

Has anyone seen this before? Alternately, are there specific library versions we should be running against? Here are the versions ot the underlying libs that we're using:

Library	Version
Harmony	0.1
Palantir	0.2.1
Pandas	0.24.2
NumPy	1.16.3

We'd really love to use the software - so any ideas or advice is welcome.

Cheers,
Bill

RuntimeWarning: divide by zero encountered in true_divide dists = dists 2/(scaling_factors.values[rows] 2)

Dear authors,

Thanks for developing harmony! It's really cool! I got this warning "RuntimeWarning: divide by zero encountered in true_divide " when following standard pipeline and the embedding looks a little weird.

Do you have any idea what's the possible reasons?

PyPI release

hi, now that harmony is in scanpy.ext, it should be available from pip.

It’s no problem that there’s already something called “harmony”, you just have to make the distribution name (the one on PyPI that you can pip install) different from the module name (the one you import)

Meaning and function of the `n_components` parameter

Harmony has a n_components parameter, which, according to the docstring:

:param pc_components: Minimum number of principal components to use. Specify `None` to use pre-computed components

That value is used for utils.run_pca, but it's not passed over to scanpy's neighboor computation, see

Harmony/src/harmony/core.py

Line 67 in eca0771

sc.pp.neighbors(temp, n_pcs=0, n_neighbors=n_neighbors)

So I wonder what the significance of that parameter actually is?

Also, I find the default value of 1000 a bit high, as scanpy's default here is much smaller, 50 I believe

can't import harmony

Hi my name is Martin and i want to use harmony for my data but i have a little problem

import harmony
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.5/dist-packages/harmony/init.py", line 1, in
from . import core
File "/usr/local/lib/python3.5/dist-packages/harmony/core.py", line 58
print(f'Constucting affinities between {t1} and {t2}...')
^

Can you help me to understand this error?
Do I have the right version of Python to use harmony and palantir?
thanks for your help

Computational efficiency of KNN graph computation

I realised that the first step of harmony, the standard KNN graph computation (output: Nearest neighbor computation...) takes really long - around 10 to 20x longer than the same operation takes in scanpy. Would it be possible to use scanpy's KNN graph implementation in the first step of the algorithm? It would speed up things considerably.

I know that harmony uses a different kernel to get the graph connectivities/affinities, but I guess this could be implemented as a method in scanpy.pp.neighbors, besides umap, gauss and rapid.

Please help with the exact definition of iLISI & cLISI

hello，

I read your wonderful paper "Fast, sensitive and accurate integration of single-cell data with Harmony".

Could you please provide some details about calculating the iLISI & cLISI ?

Thanks !!!

Don’t install things in setup.py

This is very bad behavior:

Harmony/setup.py

Line 19 in 531b4e8

call(['pip3', 'install', 'git+https://github.com/dpeerlab/Palantir.git'])

You have to specify it as a dependency in setup() instead.

Palantir should be on PyPI too, of course.

Use precomputed pca

Hi dear,
Great job with Harmony! I wonder is it possible to extend the harmony.core.augmented_affinity_matrix so that the user can input precomputed reduced dimension data (e.g. PCA, ICA ...)?
Thanks

int16 not suitable for fluidigm data

I'm using harmony to analyse fluidigm data - although it's working brilliantly now I've had to alter the int16 datatype in utils.load_from_csvs as some of our gene counts are too high to be saved as int16.

The data didn't fail to load initially, just gave me negative counts in the matrix, so thought I would flag this up.

Thanks!

Anna

Harmony data pre-processing and time connection

Hi,

when I run harmony to load the data with:
counts = harmony.utils.load_from_csvs(csv_files, sample_names)

I found that values greater than 32767 were transformed to negative value. It is noted that the default dtype is int16 (so the max allowed value is 32767). Should the value be adjusted larger?

And if I have multiple time points, should the time connection data frame look like this?
0 1
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7

load 10X data

Hi,

I'm trying to use harmony for my own 10X scRNAseq data, for the test data it is a csv file, I'm wondering how can I load 10X data into harmony? should I transfer it into a csv? or dose it possible to use any output from scanpy?

Thanks,
Jphe

scikit-learn dependency

Hi all,

I just wanted to point out that this dependency should probably be scikit-learn.

Harmony/setup.py

Line 34 in eca0771

"sklearn",

Find dynamics genes in branches

Hi,

I have used harmony for my data, it works quite well, very great tools. But I have another question which may outside the scope of current harmony. I have already found the different branches for my data, but I want to know why it has those branches, so we want to check what is the dynamics expressed genes in one branch, and what is the different expressed genes between different branches. Dose harmony has any function to do this, or do you have any suggestions how can I do this with other tools. Very appreciate for your help.

Best regards,
Jphe

Unable to reproduce the result shown in the demo notebook

I recently downloaded the demo Harmony notebook just to see how it worked. I noticed that after running the palantir.plot.plot_palantir_results(), while the pseudotime looked correct the entropy graph looked a little bit off:

Essentially the area with the lowest value of pseudotime should agree with the area with the highest entropy (or differentiation potential). However, this is not the case in my run. It seemed that cells with the highest differentiation potential are the primitive endoderm cells at E3.5, not the epiblast cells at E3.5 as shown in the demo notebook.

Also, the imputed expression of FGF4 and GATA6 looked off:

Not sure what happened, but the only modification I have done to the notebook is to change the directory of the demo data provided alongside the package, and the force-directed layout did look promising. I would genuinely appreciate any advice on why this notebook produces weird results on my side...

Is Harmony deterministic?

I have run Harmony several times with the same input and gotten different results. Is Harmony expected to be deterministic? If so, I can try to work up a minimal example of what I am seeing. If not, then that's fine; I just want to know what to expect. Thanks!

Installation error

I use pip install harmonyTS and get the following error:

Building wheels for collected packages: fa2
Building wheel for fa2 (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [33 lines of output]
Installing fa2 package (fastest forceatlas2 python implementation)

  >>>> Cython is installed?
  Yes
  
  >>>> Starting to install!
  
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.9-x86_64-cpython-39
  creating build/lib.macosx-10.9-x86_64-cpython-39/fa2
  copying fa2/fa2util.py -> build/lib.macosx-10.9-x86_64-cpython-39/fa2
  copying fa2/__init__.py -> build/lib.macosx-10.9-x86_64-cpython-39/fa2
  copying fa2/forceatlas2.py -> build/lib.macosx-10.9-x86_64-cpython-39/fa2
  running egg_info
  writing fa2.egg-info/PKG-INFO
  writing dependency_links to fa2.egg-info/dependency_links.txt
  writing requirements to fa2.egg-info/requires.txt
  writing top-level names to fa2.egg-info/top_level.txt
  [03/11/24 17:51:10] ERROR    listing git files failed - pretending     git.py:24
                               there aren't any
  reading manifest file 'fa2.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  writing manifest file 'fa2.egg-info/SOURCES.txt'
  copying fa2/fa2util.c -> build/lib.macosx-10.9-x86_64-cpython-39/fa2
  copying fa2/fa2util.pxd -> build/lib.macosx-10.9-x86_64-cpython-39/fa2
  running build_ext
  Compiling fa2/fa2util.py because it changed.
  [1/1] Cythonizing fa2/fa2util.py
  building 'fa2.fa2util' extension
  error: unknown file type '.pxd' (from 'fa2/fa2util.pxd')
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for fa2
Running setup.py clean for fa2
Failed to build fa2
ERROR: Could not build wheels for fa2, which is required to install pyproject.toml-based projects

GPU accelerated package

Dear Harmony developpers,

I really like your tool as it is working great with my developmental data! So to remove any waiting time while applying it on huge datasets, I converted the core as well as the ForceAtlas2 embedding generation functions to CUDA accelerated ones here: https://github.com/LouisFaure/Harmony-GPU

If you think this would be relevant to implement it to the original package I could propose a pull request.

Cheers

Ordering of the timepoints

In my AnnData object, I have a field adata.obs['day'], which is categorical, calling adata.obs['day'].cat.categories yields

Index(['0', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
       '15', '21', '36'],
      dtype='object')

So the values are strings in the right order. However, when I call Harmony using the scanpy interface, the timepoint connections are created using

    timepoints = adata.obs[tp].unique().tolist()
    timepoint_connections = pd.DataFrame(np.array([timepoints[:-1], timepoints[1:]]).T)

which permutes my timepoints to a random order. To keep the order, I need to change this to

    timepoints = list(adata.obs[tp].cat.categories)
    timepoint_connections = pd.DataFrame(np.array([timepoints[:-1], timepoints[1:]]).T)

It would be very important in my opinion to have some info in the docstring about the format that the time point annotation needs to have in order to results in the expected results. It would also be good to check the dtype of the passed .obs annotation and to create the timepoint connections accordingly, as this is really critical.

Scanpy Implementation does not expose n_neighbors

The scanpy implementation computes the augmented affinity matrix via

 # compute the augmented and non-augmented affinity matrices
    aug_aff, aff = harmony.core.augmented_affinity_matrix(
        adata.to_df(), adata.obs[tp], timepoint_connections, pc_components=n_components,
    )

Unfortunately, it does not expose the n_neighbors parameter, which would be quite important to be able to choose.