netzoo / netzoopy Goto Github PK

View Code? Open in Web Editor NEW

76.0 3.0 34.0 225.53 MB

netZooPy is a network biology package implemented in Python.

Home Page: https://netzoo.github.io/

License: GNU General Public License v3.0

Python 43.81% Shell 0.02% Jupyter Notebook 56.17%

gene-regulatory-network transcription-factors

netzoopy's Introduction

netZooPy is tested on: (OS: Ubuntu + Macos) X (Language: Python v3.7 + Python v3.8 + Python v3.9 + Python v3.10)

Description

netZooPy is a python package to reconstruct, analyse, and plot biological networks.

WARNING: for macos arm64 architectures you might have to manually install pytables. We are only testing macos-13 intel architecture for the moment

WARNING: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering.

Features

netZooPy currently integrates (gpu)PANDA, (gpu)LIONESS, (gpu)PUMA, SAMBAR, CONDOR, OTTER, DRAGON, COBRA, and BONOBO.

PANDA (Passing Attributes between Networks for Data Assimilation) [Glass et al. 2013]: PANDA is a method for estimating bipartite gene regulatory networks (GRNs) consisting of two types of nodes: transcription factors (TFs) and genes. An edge between TF $i$ and gene $j$ indicates that gene $j$ is regulated by TF $i$. The edge weight represents the strength of evidence for this regulatory relationship obtained by integrating three types of biological data: gene expression data, protein-protein interaction (PPI) data, and transcription factor binding motif (TFBM) data. PANDA is an iterative approach that begins with a seed GRN estimated from TFBMs and uses message passing between data types to refine the seed network to a final GRN that is consistent with the information contained in gene expression, PPI, and TFBM data.
PUMA (PANDA Using MicroRNA Associations) [Kuijjer et al.] extends the PANDA framework to model how microRNAs (miRNAs) participate in gene regulatory networks. PUMA networks are bipartite networks that consist of a regulatory layer and a layer of genes being regulated, similar to PANDA networks. While the regulatory layer of PANDA networks consists only of transcription factors (TFs), the regulatory layer of PUMA networks consists of both TFs and miRNAs. A PUMA network is seeded using a combination of input data sources such as motif scans or ChIP-seq data (for TF-gene edges) and an miRNA target prediction tool such as TargetScan or miRanda (for miRNA-gene edges). PUMA uses a message passing framework similar to PANDA to integrate this prior information with gene-gene coexpression and protein-protein interactions to estimate a final regulatory network incorporating miRNAs. Kuijjer and colleagues [7] apply PUMA to 38 GTEx tissues and demonstrate that PUMA can identify important patterns in tissue-specific regulation of genes by miRNA.
CONDOR (COmplex Network Description Of Regulators) [Platig et al. 2016]: CONDOR is a tool for community detection in bipartite networks. Many community detection methods for unipartite networks are based on the concept of maximizing a modularity metric that compares the weight of edges within communities to the weight of edges between communities, prioritizing community assignments with higher values of the former relative to the latter. CONDOR extends this concept to bipartite networks by optimizing a bipartite version of modularity defined by [Barber (2007)]. To enable bipartite community detection on large networks such gene regulatory networks, CONDOR uses a fast unipartite modularity maximization method on one of the two unipartite projections of the bipartite network. In Platig et al. (2016), CONDOR is applied to bipartite networks of single nucleotide polymorphisms (SNPs) and gene expression, where a network edge from a SNP node to a gene node is indicative of an association between the SNP and the gene expression level, commonly known as an expression quantitative trait locus (eQTL). Communities detected with CONDOR contained local hub nodes ("core SNPs") enriched for association with disease, suggesting that functional eQTL relationships are encoded at the community level.
LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) [Kuijjer et al. 2019]: LIONESS is a flexible method for single-sample network integration. The machinery behind LIONESS is a leave-one-out approach. To construct a single-sample network for sample $i$, a first network is estimated on the full dataset and a second network is estimated on the dataset with sample $i$ withheld. The single-sample network is then estimated based on the difference between these two networks. Any method that can be used to estimate a network can be used with LIONESS to estimate single-sample networks. Two common use cases are the use of LIONESS to generate single-sample GRNs based on PANDA and the use of LIONESS to generate single-sample Pearson correlation networks.
SAMBAR (Subtyping Agglomerated Mutations By Annotation Relations) [Kuijjer et al.]: SAMBAR is a tool for studying cancer subtypes based on patterns of somatic mutations in curated biological pathways. Rather than characterize cancer according to mutations at the gene level, SAMBAR agglomerates mutations within pathways to define a pathway mutation score. To avoid bias based on pathway representation, these pathway mutation scores correct for the number of genes in each pathway as well as the number of times each gene is represented in the universe of pathways. By taking a pathway rather than gene-by-gene lens, SAMBAR both de-sparsifies somatic mutation data and incorporates important prior biological knowledge. Kuijjer et al. (2018) demonstrate that SAMBAR is capable of outperforming other methods for cancer subtyping, producing subtypes with greater between-subtype distances; the authors use SAMBAR for a pan-cancer subtyping analysis that identifies four diverse pan-cancer subtypes linked to distinct molecular processes.
OTTER (Optimization to Estimate Regulation) [Weighill et al.]: OTTER is a GRN inference method based on the idea that observed biological data (PPI data and gene co-expression data) are projections of a bipartite GRN between TFs and genes. Specifically, PPI data represent the projection of the GRN onto the TF-TF space and gene co-expression data represent the projection of the GRN onto the gene-gene space. OTTER reframes the problem of GRN inference as a problem of relaxed graph matching and finds a GRN that has optimal agreement with the observed PPI and coexpression data. The OTTER objective function is tunable in two ways: first, one can prioritize matching the PPI data or the coexpression data more heavily depending on one's confidence in the data source; second, there is a regularization parameter that can be applied to induce sparsity on the estimated GRN. The OTTER objective function can be solved using spectral decomposition techniques and gradient descent; the latter is shown to be closely related to the PANDA message-passing approach (Glass et al. 2013).

DRAGON (Determining Regulatory Associations using Graphical models on Omics Networks) [Shutta et al.] is a method for estimating multiomic Gaussian graphical models (GGMs, also known as partial correlation networks) that incorporate two different omics data types. DRAGON builds off of the popular covariance shrinkage method of Ledoit and Wolf with an optimization approach that explicitly accounts for the differences in two separate omics "layers" in the shrinkage estimator. The resulting sparse covariance matrix is then inverted to obtain a precision matrix estimate and a corresponding GGM. Although GGMs assume normally distributed data, DRAGON can be used on any type of continuous data by transforming data to approximate normality prior to network estimation. Currently, DRAGON can be applied to estimate networks with two different types of omics data. Investigators interested in applying DRAGON to more than two types of omics data can consider estimating pairwise networks and "chaining" them together.
COBRA (Co-expression Batch Reduction Adjustment). Batch effects and other covariates are known to induce spurious associations in co-expression networks and confound differential gene expression analyses. These effects are corrected for using various methods prior to downstream analyses such as the inference of co-expression networks and computing differences between them. In differential co-expression analysis, the pairwise joint distribution of genes is considered rather than independently analyzing the distribution of expression levels for each individual gene. Computing co-expression matrices after standard batch correction on gene expression data is not sufficient to account for the possibility of batch-induced changes in the correlation between genes as existing batch correction methods act solely on the marginal distribution of each gene. Consequently, uncorrected, artifactual differential co-expression can skew the correlation structure such that network-based methods that use gene co-expression can produce false, nonbiological associations even using data corrected using standard batch correction. Co-expression Batch Reduction Adjustment (COBRA) addresses this question by computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix. COBRA estimates a reduced set of parameters that express the co-expression matrix as a function of the sample covariates and can be used to control for continuous and categorical covariates. The method is computationally fast and makes use of the inherently modular structure of genomic data to estimate accurate gene regulatory associations and enable functional analysis for high-dimensional genomic data.
BONOBO (Bayesian Optimized Networks Obtained By assimilating Omics data) is a scalable Bayesian model for deriving individual sample-specific co-expression networks by recognizing variations in molecular interactions across individuals. For every sample, BONOBO assumes a Gaussian distribution on the log-transformed centered gene expression and a conjugate prior distribution on the sample-specific co-expression matrix constructed from all other samples in the data. Combining the sample-specific gene expression with the prior distribution, BONOBO yields a closed-form solution for the posterior distribution of the sample-specific co-expression matrices

Quick guide

Clone the repository into your local disk:

git clone https://github.com/netZoo/netZooPy.git

Then install netZooPy through pip:

cd netZooPy
pip3 install -e .

Upon completion you can load netZooPy in your python code through

import netZooPy

Conda installation

On anaconda.org you will find the conda recipes for all platforms. We recommend using conda environments to keep your analyses self-contained and reproducible.

To install netzoopy through conda:

conda install -c netzoo -c conda-forge netzoopy

User guide

Please refer to the documentation website for installation instructions and usage.

License

The software is free and is licensed under the GNU General License v3.0, see the file LICENSE for details.

Feedback/Issues

Please report any issues to the issues page.

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Latest version: 0.10.6

netzoopy's People

Contributors

Stargazers

Watchers

netzoopy's Issues

DRAGON- Ambiguous exception thrown during calculation of p-values

The estimate_p_values_dragon function is raising an exception stating "f(a) and f(b) must have different signs"

Please refer to igraph instead of python-igraph

Please refer to igraph instead of python-igraph in the following location:

https://github.com/netZoo/netZooPy/blob/master/requirements.txt#L6

and remove this entry, since igraph is already included:

https://github.com/netZoo/netZooPy/blob/master/setup.py#L16

Please see igraph/python-igraph#699 for an explanation.

Additionally, the igraph-related troubleshooting section has been obsolete for a while and can be removed. Binary wheels are made available for all common platforms.

https://github.com/netZoo/netZooPy/blob/master/docs/install/index.md?plain=1#L36

Finally, I would strongly recommend not constraining the igraph version to before 0.10, as you are missing out on many bugfixes. I see that this was done to ensure that some functions keep returning the same cluster assignment. I am not sure which functions you are referring to, presumably some randomized ones. Keep in mind that generally there is no single correct cluster assignment, which is precisely why many of the algorithms are randomized. To get an accurate picture about clustering, you need to run the algorithm multiple times, and see if the result is reasonably stable. Picking just one result is not a scientifically solid decision ...

no --with-header option in PANDA

Hello, I though PANDA would have a --with-header option to be able to import expression with header (sample names). This feature would be great. Thank you!

Update CONDOR unit testing to account for stochasticity

Follow up on Issue #312

Reproductibility netzoo python/matlab

Using the 0.9.11 version of netzoopy (conda), I still have marginal differences between results I obtain from matlab and netzoopy PANDA and LIONESS. This may be due to different defaultways of including genes absent from the prior but present in the expression matrix . I have overall very good correlations between matlab and netzoopy results, both with panda and lioness. I have good Spearman's correlation coefficients between matlab and netzoopy results for my dataset (>0.96), which contains 66 samples, and about 106 TFBS and 32,165 genes.
When I compare matlab and netzoopy results for PANDA I have :
max(abs(network1-network2))=1.05436 (but 3rd quartile is 0.021781)
And the same for LIONESS:
max(abs(network1-network2))=31.80962 (but 3rd quartile is 0.05113)
So this is really afew edge cases.
I can send the data through fileShare, but they are too heavy (and confidential) to be uploaded here :-).

-save_memory not very informative as a flag

-save_memory returns the adjacency matrix instead of the full edge list. So it saves space on your disk, but not memory. Wouldn't it be more constructive to rename this flag to --output_adj_matrix ?

netZooPy is an R package to reconstruct, analyse, and plot biological networks.

netzoopy is a python package

Error using netZooPY and PyPanda with toydata

Hi,

I have anaconda, python 3.7, and windows 10.

I installed netzoopy through conda:

git clone https://github.com/netZooPy/netZooPy.git
cd netZooPy
py -3 setup.py install

I followed the vignette to test the library:
https://github.com/netZoo/netZooPy/blob/master/UserGuide.md

I used py -3 instead python3, since it not recognized python 3 (but up to my knowledge should be the same).

from pypanda.panda import Panda
from pypuma.puma import Puma
from pypanda.lioness import Lioness

I encoutered an error trying to import Puma:

ModuleNotFoundError: No module named 'pypuma'

panda and Lioness were imported without issues.

then:
panda_obj = Panda('../../tests/ToyData/ToyExpressionData.txt', '../../tests/ToyData/ToyMotifData.txt', '../../tests/ToyData/ToyPPIData.txt', remove_missing=False)

generated issues with the file localization.
I updated the localization:

expression_data='netZooPy/tests/ToyData/ToyExpressionData.txt'
motif_data='netZooPy/tests/ToyData/ToyMotifData.txt'
ppi_data='netZooPy/tests/ToyData/ToyPPIData.txt'

panda_obj = Panda(expression_data, motif_data, ppi_data, remove_missing=False)

this is resulting in the following error:

File "", line 1, in
panda_obj = Panda(expression_data, motif_data, ppi_data, remove_missing=False)

File "C:\Users\Bio03\Anaconda3\lib\site-packages\pypanda-0.1-py3.7.egg\pypanda\panda.py", line 31, in init
self.__motif_data_to_matrix()

File "C:\Users\Bio03\Anaconda3\lib\site-packages\pypanda-0.1-py3.7.egg\pypanda\panda.py", line 83, in __motif_data_to_matrix
idx = np.ravel_multi_index((idx_tfs, idx_genes), self.motif_matrix.shape)

TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled

I read online that this issue can arise from the fact the files are not in the correct but If I use pd.read_csv I can load the individual files.

how I can salve it?

Thank you very much for your help

Add sphinx_rtd_theme to yaml file

Read the docs returns: Could not import extension sphinx_rtd_theme.

lioness puma variable copy

Hi @violafanfani ,

I believe the variable copy issue exists in PUMA as well,
https://github.com/netZoo/netZooPy/blob/master/netZooPy/lioness/lioness_for_puma.py#L59

missing click dependency

Conda installation misses click as dependency

Puma object has no attribute

I am using my own mRNA data and miR->gene target priors from the PUMA paper (https://zenodo.org/record/1313768). My mRNA data consists of a matrix of genes (HUGO gene symbols) in the rows and normalized, log2-transformed counts for each individual subject in the columns. There is no header in the mRNA expression data file. The mRNA data was filtered to include only genes that are also in the miR->gene target file and vice versa because I was getting the error below, and Kimbie suggested filtering the mRNA and miR->gene target file to make sure they completely intersected. However, I'm still getting the same error below.

(netZooPy-ENV) sombrero07<09:22:04> /udd/reawa/netZooPy/netZooPy/puma/run_puma.py -e /udd/reawa/VDAART_atopic_march/PUMA/cbmrna_filtered.txt -i /udd/reawa/VDAART_atopic_march/PUMA/TargetScanPrior_cb_filtered.txt -o /udd/reawa/VDAART_atopic_march/PUMA/puma_cb.txt -q /udd/reawa/VDAART_atopic_march/PUMA/output_lioness_cb.txt
Input data:
Expression: /udd/reawa/VDAART_atopic_march/PUMA/cbmrna_filtered.txt
Motif data: None
PPI data: None
miR file: /udd/reawa/VDAART_atopic_march/PUMA/TargetScanPrior_cb_filtered.txt
Start Puma run ...
Loading expression data ...
Elapsed time: 1.16 sec.
Duplicate gene symbols detected. Consider averaging before running PANDA
No PPI data given: ppi matrix will be an identity matrix of size 0
Calculating coexpression network ...
Elapsed time: 2.60 sec.
Returning the correlation matrix of expression data in <Panda_obj>.correlation_matrix
Traceback (most recent call last):
File "/udd/reawa/netZooPy/netZooPy/puma/run_puma.py", line 89, in
sys.exit(main(sys.argv[1:]))
File "/udd/reawa/netZooPy/netZooPy/puma/run_puma.py", line 76, in main
puma_obj = Puma(expression_data, motif, ppi, miR, save_tmp=True, remove_missing=rm_missing, keep_expression_matrix=bool(lioness_file))
File "/udd/reawa/netZooPy/netZooPy/puma/puma.py", line 78, in init
Panda.processData(self, modeProcess, motif_file, expression_file, ppi_file, remove_missing, keep_expression_matrix)
File "/udd/reawa/netZooPy/netZooPy/panda/panda.py", line 347, in processData
self.__pearson_results_data_frame()
AttributeError: 'Puma' object has no attribute '_Panda__pearson_results_data_frame'

PANDA does not take both Paths and Dataframes

We ran into an issue when running PANDA and discovered that it was not able to run both paths and data frames for inputs (For example, PPI as a path and Expression as a data frame).

LionessPuma issue: compute_puma() missing 1 required positional argument: 'sorted_index'

from netZooPy.panda.panda import Panda
from netZooPy.puma.puma import Puma
from netZooPy.lioness.lioness import Lioness
from netZooPy.lioness.lioness_for_puma import LionessPuma
import pandas as pd
import os
from io import StringIO 
import sys

class Capturing(list):
    def __enter__(self):
        self._stdout = sys.stdout
        sys.stdout = self._stringio = StringIO()
        return self
    def __exit__(self, *args):
        self.extend(self._stringio.getvalue().splitlines())
        del self._stringio    # free up some memory
        sys.stdout = self._stdout
def read_expression_file(filepath = '',header = 'infer',server = 'PANDA',sample_mode = ''):
    header_float= False
    df = pd.read_csv(filepath,delimiter = ',',index_col = 0,header = 'infer')
    if server == 'LIONESS':
        if sample_mode == 'sample_name':
            return df
        elif sample_mode == "sample_num":
            try:
                header_int = [int(col) for col in  df.columns]
                header_float = [float(col) for col in df.columns]
                sample_list = list(range(len(df.columns)))
                if sample_list == header_int or sample_list == list(df.columns):
                    return df
                df = pd.read_csv(filepath,index_col =0 ,sep = ',', header = None)
                return df
            except:
                return df

    try:
        header_float = [float(col) for col in df.columns]
        header_float = True
    except:
        header_float = False
    if header_float:
        df = pd.read_csv(filepath,index_col =0 ,sep = ',', header = None)
    return df
def create_panda_obj(dataframes = {} , filepaths = {},alpha = 0.1,mode = 'union',precision = "double"):
    #print("panda_input_filepaths: ",filepaths)
    panda_obj = None
    filepaths_values = list(filepaths.values())
    if isinstance(filepaths_values,list):
        if len(filepaths_values) == 3:
            if filepaths_values[0] == "" or filepaths_values[1] == "" or filepaths_values[2] == "":
                filepaths_values = []
        elif len(filepaths_values) < 3:
            filepaths_values = []

    else:
        filepaths_values = []
    if len(dataframes.keys()) > 0:
        panda_obj = Panda(dataframes['expression'],dataframes['motif'],dataframes['ppi'],modeProcess = mode, alpha = alpha,save_memory=False,precision=precision,remove_missing=False, keep_expression_matrix=True)
    elif len(filepaths_values)>0:
        panda_obj = Panda(filepaths["expression"],filepaths["motif"],filepaths["ppi"],modeProcess = mode, alpha = alpha,save_memory=False,precision=precision,remove_missing=False, keep_expression_matrix=True)
    #print("panda_obj: ",panda_obj)
    return panda_obj
def create_puma_obj(dataframes = {} , filepaths = {},alpha = 0.1, mode = "union",precision = "double",keep_expression_matrix= True):
    #print("#######creating puma obj#####")
    puma_obj = None
    all_files_uploaded = True
    #print(filepaths)
    if len(filepaths.keys()) > 0:
        puma_obj = Puma(filepaths["expression"],filepaths["motif"],None,filepaths["mir"],save_memory=False,alpha = alpha, modeProcess = mode,precision = precision,keep_expression_matrix=keep_expression_matrix,save_tmp=False)
       
    if len(dataframes.keys()) > 0:
        puma_obj =  Puma(dataframes['expression'],dataframes['motif'],None,dataframes['mir'],alpha = alpha, modeProcess = mode,precision = precision,keep_expression_matrix=keep_expression_matrix,save_tmp=False)
    
    return puma_obj 
def run_lioness(dataframes = {},input_mode = 'PANDA',alpha = 0.1,mode = 'union',precision='double',start = 0 ,end = 0):
    t = type('test', (object,), {})()
    cwd = os.getcwd()
    output = []
    lioness_results_folder = os.path.join(cwd,"results","lioness")
    if not os.path.isdir(lioness_results_folder):
        os.makedirs(lioness_results_folder)
    
    try:
        if input_mode == "PANDA":
            
            with Capturing() as output:
                try:
                    panda_obj = create_panda_obj(dataframes= dataframes,alpha = float(alpha), mode = mode,precision=precision,remove_missing=False, keep_expression_matrix= True)
                except:
                    return {'status': 'failed', 'reason': 'PANDA_ANALYSIS_ERROR'}

                print("panda_obj:", panda_obj)
                try:
                    lioness_obj = Lioness(panda_obj,save_dir = lioness_results_folder,start = start,end= end,save_fmt = "aaa",alpha = float(alpha),precision=precision)
                except:
                    return {'status': 'failed', 'reason': 'LIONESS_ANALYSIS_ERROR'}

                try:
                    Panda.processData(self=t,modeProcess=mode, motif_file=dataframes["motif"], expression_file=dataframes["expression"], ppi_file=dataframes["ppi"], remove_missing=False, keep_expression_matrix=False)
                except:
                    return {'status': 'failed', 'reason': 'PANDA_ANALYSIS_ERROR'}

                W = t.motif_matrix_unnormalized
                tfs = t.unique_tfs
                genes = t.gene_names
                #print("####W######")
                # #print(W)
                # #print(W.shape)
                W_df = pd.DataFrame(W,index = tfs,columns = genes)
                W_df= W_df.stack().reset_index().rename(columns={'level_0':'tf','level_1':'gene', 0:'motif'})
                lioness_obj_df = pd.DataFrame(lioness_obj.export_lioness_results)
                print(lioness_obj_df)
                col_names =  ["tf","gene","force"]
                lioness_obj_df.columns = col_names
                lioness_obj_df["motif"] = W_df['motif']
                cols_order = ["tf","gene","motif","force"]
                lioness_obj_df = lioness_obj_df[cols_order]
                adj_matrix = create_adj_matrix_from_rows_df(df = lioness_obj_df,gene_names= genes, unique_tfs= tfs,server_name = "lioness")
                #print("after:",adj_matrix.shape)
                #print("###adj_matrix_targeting######")
                #print(adj_matrix)
                lioness_adj_mat_df = pd.DataFrame(adj_matrix,index = tfs,columns = genes)
        elif input_mode == "PUMA":
            #puma.py require mir to be a filepath
            
            with Capturing() as output:
                puma_obj = create_puma_obj(dataframes = dataframes,alpha = alpha, mode = mode,precision = precision,keep_expression_matrix=True)
                #print("puma_obj:",puma_obj)
                lioness_obj = LionessPuma(puma_obj,save_dir = lioness_results_folder,start = start,end= end,save_fmt = "aaa",alpha = float(alpha),precision=precision, )
                #print("lkionessPuma:",lioness_obj)
                Panda.processData(self=t,modeProcess=mode, motif_file=dataframes["motif"], expression_file=dataframes["expression"], ppi_file=None, remove_missing=False, keep_expression_matrix=False)
            W = t.motif_matrix_unnormalized
            tfs = t.unique_tfs
            genes = t.gene_names
            #print("####W######")
            # #print(W)
            # #print(W.shape)
            W_df = pd.DataFrame(W,index = tfs,columns = genes)
            W_df= W_df.stack().reset_index().rename(columns={'level_0':'tf','level_1':'gene', 0:'motif'})
            lioness_matrix = lioness_obj.export_lioness_results
            #print("###########3LIONESS_MATRIX############")
            # #print(type(lioness_matrix))
            # #print(lioness_matrix)
            if isinstance(lioness_matrix,np.ndarray):
                lioness_obj_df = pd.DataFrame(lioness_matrix,columns = ["tf","gene","motif","force"])
            adj_matrix = create_adj_matrix_from_rows_df(df = lioness_obj_df,gene_names= genes, unique_tfs= tfs,server_name = "lioness")
            #print("after:",adj_matrix.shape)
            #print("###adj_matrix_targeting######")
            #print(adj_matrix)
            lioness_adj_mat_df = pd.DataFrame(adj_matrix,index = tfs,columns = genes)
        elif input_mode == "Coexpression":

            with Capturing() as output:
                panda_obj = create_panda_obj(dataframes = dataframes,alpha = float(alpha), mode = mode,precision=precision,remove_missing=False, keep_expression_matrix= True)    
                lioness_obj = Lioness(panda_obj,save_dir = lioness_results_folder,start = start,end= end,save_fmt = "aaa",alpha = float(alpha),precision=precision)
            # Panda.processData(self=t,modeProcess=mode, motif_file=lioness_input_filepaths["motif"], expression_file=lioness_input_filepaths["expression"], ppi_file=None, remove_missing=False, keep_expression_matrix=False)
            # W = None
            # tfs = t.unique_tfs
            # genes = t.gene_names
            col_names = ["gene1","gene2","force"]
            #setting columns = ["tf,"gene","force"] just to make the frontend work 
            fake_col_names = ["tf","gene","force"]
            # otherwise the names should be gene1,gene2,force
            lioness_obj_df = pd.DataFrame(lioness_obj.export_lioness_results)
            lioness_obj_df.columns = fake_col_names
            # lioness_obj_df["motif"] = -1
            # cols_order = ["tf","gene","motif","force"]
            cols_order = ["tf","gene","force"]
            lioness_obj_df = lioness_obj_df[cols_order]
            lioness_adj_mat_df = Lioness(panda_obj,save_dir = lioness_results_folder,start = start,end= end,save_fmt = "aaa",alpha = float(alpha),precision=precision,output = 'gene_targeting').export_lioness_results
    except Exception as e:
        print('###run_lioness_exception###')
        print(e)
        lioness_obj_df,lioness_adj_mat_df = None,None
    return lioness_obj_df,lioness_adj_mat_df,output
def main():
  folder = '/home/'
  lioness_input_filepaths = {"expression": folder + 'ToyExpressionData.csv',  "motif":  folder + 'ToyMotifData.csv',"mir":  folder + 'ToyMiRList.csv','ppi':  folder + 'ToyPPIData.csv'}
  mode = 'union'
  precision = 'double'
  start = 1
  end = 1
  sample_mode = 'sample_num'
  input_mode = "PUMA"
  sample_num = 1
  alpha = 0.1


  sep = ','
  dataframes = {
      'expression':None,
      'motif': None,
      'ppi': None,
      'mir':None
      }
  dataframes['expression'] = read_expression_file(filepath = lioness_input_filepaths['expression'],header = 'infer',server = 'LIONESS',sample_mode = sample_mode)
  
  if input_mode == "PANDA":
      dataframes.update({
      'motif': pd.read_csv(lioness_input_filepaths['motif'],sep  = sep,header = None),
      'ppi': pd.read_csv(lioness_input_filepaths['ppi'],sep  = sep,header = None)
      })
      
   
  elif input_mode == "PUMA":
      #puma.py require mir to be a filepath
      dataframes.update({
      'motif': pd.read_csv(lioness_input_filepaths['motif'],sep  = sep,header = None),
      'mir': lioness_input_filepaths['mir'],#pd.read_csv(lioness_input_filepaths['mir'],sep  = sep,header = None)
          
      })
      
  elif input_mode == "Coexpression":
      pass
  else:
      return {'status': 'failed', 'reason': 'INVALID_INPUT_MODE_ARGUMENT'}
  lioness_obj_df,lioness_adj_mat_df,output = run_lioness(dataframes = dataframes,input_mode = input_mode,alpha = alpha,mode = mode,precision=precision,start =start ,end = end )
main()

Legacy mode without remove_missing=True causes shape mismatch

There is a missing [commind2] index/subset at https://github.com/netZoo/netZooPy/blob/master/netZooPy/panda/panda.py#L522

This bug is triggered if you invoke the Panda constructor with modeProcess='legacy', and keep the default value of remove_missing=False

Initialization of protein-protein interaction in Panda

I noticed in this line of code, a value of 1 is assigned to the self-interaction of proteins. I am concerned about the implications of this assignment on the normalization of PPI values. In my dataset, the PPI values are relatively small, and this assignment of 1 to self-interactions prior to normalization could render these values negligible. Is this assignment a necessary step before normalization, or should there be a specific format for the PPI values to address this issue?

Error in generating Lioness file (-q) with command line panda

I am running Panda with normalized expression data and ppi/tf data downloaded from grand (tissues_ppi and tissues_motif). It appears to run without issues until the end on the lioness call, where it encounters an error that export_panda_results is not assigned:

step: 39, hamming: 0.0008884414281395395
Running panda took: 11017.05 seconds!
Saving PANDA network to data/panda_output/zm4_test_out.txt ...
  Elapsed time: 17.93 sec.
Loading input data ...
  Elapsed time: 0.00 sec.
Traceback (most recent call last):
  File "../netZooPy/netZooPy/panda/run_panda.py", line 85, in <module>
    sys.exit(main(sys.argv[1:]))
  File "../netZooPy/netZooPy/panda/run_panda.py", line 80, in main
    lioness_obj = Lioness(panda_obj)
  File "/Users/laurenhsu/Documents/GitHub/netZooPy/netZooPy/lioness/lioness.py", line 94, in __init__
    self.export_panda_results = obj.export_panda_results
AttributeError: 'Panda' object has no attribute 'export_panda_results'

The initial command I used was:

python ../netZooPy/netZooPy/panda/run_panda.py -e data/expr/zm4_test.txt -m data/fromgrand/tissues_motif.txt -p data/fromgrand/tissues_ppi.txt -o data/panda_output/zm4_test_out.txt -q data/panda_output/zm4_test_outq.txt

The -o file does save properly. I’d like to run lioness on the panda output, but am not sure how to proceed since the -q option file wasn’t outputted and the error occurred on that step.

Thanks!

lioness.export_lioness_results does not preserve sample names.

Input expression pandas dataframe contains sample names in header. After panda -> lioness run, lioness.export_lioness_results does not preserve sample names.

In expression dataset (truncated portion) from ToyExpressionData.txt:

         sample_1  sample_2  sample_3
0                                    
AACSL    0.141431 -4.153056  2.854971
AAK1     3.528478 -0.949701  1.039986
ABCA17P -2.597842  3.970710 -2.809212
ABCB8    0.352052 -1.866545 -0.007765
ABCC1   -4.638927  2.440799 -1.655580
...           ...       ...       ...
ZNF826  -4.294209 -4.498573  2.786462
ZNF845  -1.661144 -6.986089  2.273928
ZNF878   3.395504 -6.274497  0.455548
ZSWIM3  -0.494841  2.840674 -3.816640
ZWILCH   0.694298 -2.725693 -1.752258

[1000 rows x 3 columns]

Process is as follows:

panda_obj = Panda(exp_data,
    motif_data,
    ppi_data,
    remove_missing=False, 
    keep_expression_matrix=True, save_memory=False, modeProcess='legacy'
)
lioness_obj = Lioness(panda_obj, start=1, end=3)

Upon trying to extract the total networks, the sample names found in the panda dataframe (for expression data) column names are lost:

>>> lioness_obj.export_lioness_results
           tf    gene         0         1         2
0         AHR   AACSL -0.306151  -2.74038 -0.330492
1          AR   AACSL -0.955684 -0.390744 -0.814485
2      ARID3A   AACSL  3.104159  3.270957  3.185894
3        ARNT   AACSL  2.615047  1.257329  2.585935
4       BRCA1   AACSL  0.081374  0.660095  0.132886
...       ...     ...       ...       ...       ...
86995    TLX1  ZWILCH  -1.00726 -1.117813 -1.059336
86996    TP53  ZWILCH -0.731466 -0.389853 -0.738892
86997    USF1  ZWILCH -0.783222 -1.134095 -0.742434
86998     VDR  ZWILCH -1.002372 -0.696165 -1.090017
86999     YY1  ZWILCH  2.875387  2.784778  3.319501

[87000 rows x 5 columns]

sample names not strings

When sample names are numbers they are not recognized as strings and they don't match with the motif tab.

Add lioness start and end behavior

Add flag for lioness where we can run lioness with all samples as background, but computes lioness on a subset of samples. Right now start and end are used only to subselect the dataset

PUMA

PUMA's description missing in README file

save_tmp

I noticed that the default value of save_tmp is True, I'd suggest to have it False, or to place a os.makedirs('./tmp',exists_ok=True) so that it creates the folder ./tmp if it doesn't exists.

Lioness for dragon merge_col

I am trying to run Lioness Dragon and I am encountering an issue with the merge column.
Shouldn't the renaming be based on the ext2 here?

https://github.com/netZoo/netZooPy/blob/98a46ce0682fbfa35894190f690fb28a47379736/netZooPy/lioness/lioness_for_dragon.py#L90C35-L90C35

@katehoffshutta

Move IO functions outside of PANDA

Currently all reading functions (PPI, motif, expression) are inside the preprocessing steps of PANDA and are not accessible from the outside.
These need to be methods of the PANDA class.
Main issue is that this might change the way PANDA is called.

condor results change when updating python-igraph to > 0.7.1.post6

When python-igraph is updated to > 0.7.1.post6, the results of the community assignement changed in the unit test.
I temporarily set the version to 0.7.1.post6 in #81 but we need to investigate.
@genisott any ideas?

IGRAPH version for condor

After the recent updates by @genisott, Igraph has to be => 0.9.6 for CONDOR to work without throwing errors netZoo/netbooks#4

run_lioness.py from command line doesn't work

Documentation here mentioned run_lioness.py can be directly run from the command line like python run_lioness.py -e expression.npy -m motif.npy -p ppi.npy -n panda.npy -o /tmp -f npy 1 100 However, I got this issue suggesting some bugs regarding this usage.

$ python3 run_lioness.py -e /Users/tian/Documents/GitHub/netZooPy/tests/ToyData/ToyExpressionData.txt
Traceback (most recent call last):
File "run_lioness.py", line 22, in
from lioness import Lioness
File "/Users/tian/Documents/GitHub/netZooPy/netZooPy/lioness/lioness.py", line 5, in
from .timer import Timer
ImportError: attempted relative import with no known parent package

DRAGON computes correlations but not p-values

Hi @katehoffshutta,

I am posting here for tracking purposes. So In DRAGON, sometimes when the data is large, I can't compute p-values to reduce large-scale networks.

The function fails in
https://github.com/netZoo/netZooPy/blob/master/netZooPy/dragon/dragon.py#L242

Here are the parameters of the failure:
Dlogli11 = lambda x: (1./4p1(p1-1)
*(sc.digamma(x/2)-sc.digamma((x-1)/2))
+term_Dlogli11)

term_Dlogli11=-77.92618920329457
p1=21337
n=832

Dlogli11(1.001)=227465526608.52557
Dlogli11(1000*n)=58.86679539046301

Here is the error:

Traceback (most recent call last):
File "", line 2, in
File "/home/ubuntu/netZooPy/netZooPy/dragon/dragon.py", line 311, in estimate_p_values_dragon
simultaneous=simultaneous) for seedi in range(10)]
File "/home/ubuntu/netZooPy/netZooPy/dragon/dragon.py", line 311, in
simultaneous=simultaneous) for seedi in range(10)]
File "/home/ubuntu/netZooPy/netZooPy/dragon/dragon.py", line 250, in estimate_kappa_dragon
kappa11 = optimize.bisect(Dlogli11, 1.001, 1000*n)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/scipy/optimize/zeros.py", line 549, in bisect
r = _zeros._bisect(f, a, b, xtol, rtol, maxiter, args, full_output, disp)
ValueError: f(a) and f(b) must have different signs

Header in savetxt

The header argument in
https://github.com/netZoo/netZooPy/blob/master/netZooPy/lioness/lioness.py#L392

should be a string not a boolean.

From Numpy doc

headerstr, optional
String that will be written at the beginning of the file.

I suggest adding this case to unit tests

Wrong prior order in ligress

The output panda table for each of the ligress networks reports the motif prior in the wrong order.

Tab delimiter for saving PANDA is wrong

The tab delimiter when saving PANDA networks seems to be wrong, i.e., /t instead of \t.

Unable to find hdf5 when building locally (Mac M1)

When attempting to build netZooPy with local changes by running pip install -e . from the root directory of the repository, I get the following error:

      .. ERROR:: Could not find a local HDF5 installation.
         You may need to explicitly state where your local HDF5 headers and
         library can be found by setting the ``HDF5_DIR`` environment
         variable or by using the ``--hdf5`` command-line option.

The error persists after reinstalling hdf5 with homebrew and setting the HDF5_DIR environment variable as the error message suggested. This is happening on a Mac with M1 chip running Monterey 12.6.3.

PUMA MiR issue

currently, PUMA mir argument requires a filepath but this should facilitate a pandas dataframe as well

Singular shrunken covariance matrix in DRAGON

The DRAGON precision matrix is the inverse of the shrunken covariance (1-lambda)S + lambdaT where T is the target matrix of covariances on the diagonal and should be full rank. However, we do not currently have a check to see if T is full rank. If any of the variables have zero variance, (1-lambda)S + lambdaT can be singular. DRAGON will not recognize this problem; it will try to invert the matrix and the numpy.linalg matrix inversion function will throw an exception that is unclear.

I plan to handle this by adding default behavior that any variables with zero variance are excluded from the analysis and a note warning of the excluded variables is printed to the console.

I also will add a try-catch on the rank of (1-lambda)S + lambdaT so that DRAGON provides an informative exception before it can percolate down to numpy.linalg.

Rename Ligress to Bonobo

There is a ligress folder in netZoopy/netZoopy which should be Bonobo instead

suspect panda_indegree doesn't work

I suspect the panda_indegree doesn't work as line 361 in panda.py attribute 'panda_results' hasn't been defined before.

panda --save_memory outputs the adjacency but without gene names

When using --save_memory the output is an adjacency matrix but without index and column names (i.e., without the gene names).

LIONESS Py error

I am running lioness with cli like this:

netzoopy lioness -e /home/ubuntu/partial_expression.tsv -m /home/ubuntu/whole_blood_tissues_motif.txt -p /home/ubuntu/whole_blood_tissues_ppi.txt -op /home/ubuntu/copdgene_yes_cp_dev_panda_lion.txt -ol /home/ubuntu/yes_copd_dev_lion/ --computing gpu --panda_start 1 --panda_end 5 --precision single --save_memory --mode_process intersection --save_single_lioness --ignore_final

And I am getting this error:

File "/opt/conda/lib/python3.10/site-packages/netZooPy/lioness/lioness.py", line 135, in init
self.export_panda_results = obj.export_panda_results
AttributeError: 'Panda' object has no attribute 'export_panda_results'. Did you mean: 'save_panda_results'?

can you please help me :)

labels genes and tf in Panda output

Hi,

I have run pandas on this data:
Gene expression initial shape: (5217, 50)
Motif Shape:
Tf 105
Gene 3551

I can't find the labels for the rows and columns of the output matrix panda.panda_network. I get:
panda.num_genes = 5088
panda.panda_network.shape = (105, 5088)
panda.motif_matrix.shape = (105, 5088)
len(panda.motif_genes) = 3551
len(panda.expression_genes) = 5217

Which genes should I use for the columns of panda.panda_network? Thanks :)

Lioness indices

Feature request by @talkhanz to add an argument in lioness to produce a set of noncontinuous samples for example [4,5,10] as opposed to start and end arguments that produce all samples between start and end.

Circular import

Hi, I'm not sure if it's my environment causing the problem, but I can't import the package, due to a circular input:

ImportError                               Traceback (most recent call last)
<ipython-input-2-600f66c3916d> in <module>
      1 ##
----> 2 from netZooPy.panda.panda import Panda
      3 import pandas as pd
      4 import matplotlib.pyplot as plt
      5 import numpy as np

~/.conda/envs/pypandaenv1/lib/python3.8/site-packages/netZooPy/__init__.py in <module>
      1 from __future__ import absolute_import
      2 
----> 3 from netZooPy import panda
      4 from netZooPy import puma
      5 from netZooPy import lioness

ImportError: cannot import name 'panda' from partially initialized module 'netZooPy' (most likely due to a circular import) (/***/****/.conda/envs/pypandaenv1/lib/python3.8/site-packages/netZooPy/__init__.py)

on mi pip freeze I have one 1 version of netZooPy:

[...]
# Editable install with no version control (netZooPy==0.8)
-e /****/****/.conda/envs/pypandaenv1/lib/python3.8/site-packages
[...]

installed from git has suggested :

pip install git+git://github.com/netZoo/netZooPy.git

What do you suggest?

Catch gene mismatches in PANDA

This isn't so much a bug as an avoidable user error which would be easy to catch/implement. I can submit a PR with a basic patch, if you'd like.

Further explanation:
We encountered this in JQ's WebMeV project where users upload/explore their own data. Clearly, if one were using the tool directly, they would likely recognize that the symbols in their expression matrix and motif priors don't have any intersection. In our case, someone didn't read the directions carefully and submitted an expression matrix with mouse symbols. Naturally, these symbols did not have any intersection with the symbols contained in the human motif prior matrix (second column). As a result, an error is raised here: https://github.com/netZoo/netZooPy/blob/master/netZooPy/panda/panda.py#L542-L544

(namely, the idx_genes array is empty)

The exception was raised by numPy, but the reason was not immediately clear:

  File "/opt/conda/lib/python3.10/site-packages/netZooPy/panda/panda.py", line 132, in __init__
    self.processData(
  File "/opt/conda/lib/python3.10/site-packages/netZooPy/panda/panda.py", line 555, in processData
    idx = np.ravel_multi_index(
TypeError: indices must be integral: the provided empty sequence was inferred as float. Wrap it with 'np.array(indices, dtype=np.intp)'

So an easy fix here would be to assert a non-zero length of idx_genes and idx_tfs, raising a clear exception if either is empty.

Deprecated append in Lioness with pandas 2.0

I get this error when running netZooR through the Python link

Error: AttributeError: 'DataFrame' object has no attribute 'append'

when I use netZooPy with pandas v2.0.

Reference

netZooR needs to be updated after this