engreitzlab / gene_network_evaluation Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 6.0 156.76 MB

Evaluation framework for computationally inferred gene networks from single-cell data.

Python 0.98% R 0.07% HTML 73.75% Jupyter Notebook 25.20% CSS 0.01%

enhancer-prediction gene-co-expression-network gene-regulatory-network single-cell-genomics

gene_network_evaluation's Issues

GSEA plot for individual program / term enrichment

Do we even want this for programs. What does it add?

Ran out of time to add details.

Get a demo dashboard for TeloHAEC on a server and shared

Rename column in `compute_perturbation_association` to "program_name" for consistency

Align dashboard with config from evaluation pipeline

Have th pipeline dump a config.yaml in the output directory that I can use for the dashboard. Here is an example:

output_loc: /cellar/users/aklie/opt/gene_program_evaluation/dashapp/example_data/iPSC_EC_evaluations/cNMF_60_0.2_gene_names.h5mu
categorical_keys: ['sample']
continuous_keys: None
dim_reduce_keys: None
workdir: /cellar/users/aklie/opt/gene_program_evaluation/dashapp/example_data/iPSC_EC_evaluations
data_key: rna
annotations_loc: "annotations.csv"  # if none defaults to annotations.csv

User can manually add a location for annotations to be dumped as above. Otherwise it defaults to eval directory.

This is important for things like:

Knowing what categorical keys to plot
Knowing what the inference key (data_key) was

How are we handling promoter vs enhancer motif outputs from evaluation?

We used to have a column for this in the output file for motif enrichment, but that is not there in the latest outputs (presumably because E2G links weren't there yet).

ProgramID	EPType	TFMotif	PValue	FDR	Enrichment
K60_1	Promoter	AHR	0.044631	0.210088	1.594955
K60_10	Promoter	AHR	0.351685	0.67633	1.242518
K60_11	Promoter	AHR	0.681555	0.885289	0.901666
K60_12	Promoter	AHR	0.446282	0.745748	1.204339

How do we want to handle this more generally? The two ways I could see for the dashboard:

Include a typecolumn for this regardless of what enrichment is run on.
Output separate files for each type and name differently

seems better and more flexible, but either will work.

Code maintenance and good practices

We should complete functional and integration tests once the codebase stabilises.
We should implement continuous integration and code linting.

General procedure to assess goodness of fit rather than explained variance ratio.

It isn't easy to come up with a generic procedure for this calculation in the former scenario since many methods further process the expression matrix internally without reporting it.
The model output can be arbitrarily worse for non-linear methods and produce negative values for this evaluation.

For k selection (assuming good model fit) we can get a knee plot by plotting the variance explained by each component w.r.t. total modeled variance. This evaluation will focus on selecting the appropriate k while we can introduce a different evaluation to assess goodness of fit.

Or we could come up with a generalisable evaluation (e.g. information based) to compute goodness of fit that can also be used for k selection.

Should we deprecate all the GRN stuff for now?

Should we simplify the repo by branching the current main and then removing all the GRN stuff from it? This includes:

inference
evaluation
plotting

Update README to reflect current state of the evaluations offered

Would be great to have something to point people to. It probably should include either information or references on:

Pipeline inputs/outputs
Details on evaluations

Adding package versions to outputs from evaluation and (potentially) inference

Display all the package versions used for inference and evaluation on a tab of the dashboard. Is this something that snakemake can easily do (or already does)?

Would just be a file output to the evaluations directory like versions.yaml or something. Here is an example I made up for the iPSC data:

inference_software_versions:
  cnmf: 0.0.q
  numpy: 1.20.2
  pandas: 1.2.4
evaluation_software_versions:
  gene_program_evaluation: 0.0.1
  mudata: 0.0.1
  joblib: 1.0.1
  scipy: 1.6.2
  numpy: 1.20.2
  pandas: 1.2.4
  scikit-learn: 0.24.2
  scikit-posthocs: 0.6.6
  seaborn: 0.11.1
  gseapy: 0.10.1
  pymemesuite: 0.0.1
  google-cloud-bigquery: 2.26.0

But we can pretty much do anything as long as it's consistent and makes sense.

Add cNMF inference code to `examples/inference/iPSC_EC/cNMF`

We need code for cNMF inference in examples/inference/iPSC_EC/cNMF for jamboree. Just a minimal example of what we have now to reproduce a cNMF result on an input dataset.

Ensure consistency of multiple test correction methods

Add an evaluation for model robustness

We need to add an evaluation that tests the robustness of programs across multiple runs (seeds) and also across multiple K-values.

A weak test can assess similarity of the overall information captured by each run.
A stronger test would compare programs across runs and assess consistency.

Single run analysis page, Section 2: Covariate association

I currently have a section of the dashboard meant to explore the association with covariates, and I'm not sure exactly what to populate it with:

Part of this is due to not having the latest covariate association output. But I think this warrants some discussion.

This issue is a meta-issue for a few others: #24 , #23

Update gene set enrichment functionality

Take in low threshold of program weights to remove prior to run
Add in capability to use top N genes
Rename columns to standardized set (see below)
Reconfigure dashapp to load multiple results and have a dropdown to select which one to use)

Trait enrichment results should be processed in evaluations

To generate a PheWAS plot, we need to run https://github.com/EngreitzLab/gene_network_evaluation/blob/main/src/plotting/plot_gwas_enrichment.py#L10.

To avoid extra computation in the dashboard, we should just automatically compute this during evaluation and save it to a separate file trait_enrichment_processed.txt

Structure plot

I think this visualization does a nice job showing how a given cell distributes its function across programs.

The Topyfic version is pretty complex, and it might need to be substantially refactored to be plotted in the dashboard. A key part of this is efficiency. We don't want it to take 10s everytime the user reloads the page

Calibration of association tests

Most tests (e.g. covariate association, perturbation association) give highly inflated significance (very low p-values) under typical thresholds (e.g. 0.05). We need to implement methods to calibrate the output of these tests against appropriate null p-value distributions.

Separately, reporting log2FC would also help interpretation as well as compared to using test statistics. In the dashboard we should prioritize reporting log2FC.

Add Topyic inference in `examples/inference/iPSC_EC/Topyfic`

We need code for Topyic inference in examples/inference/iPSC_EC/Topyfic. Just a minimal example of what we have now to reproduce a cNMF result on an input dataset.

Replace `pymemesuite` FIMO implementation with `tangermeme`

Current scripted implementation: https://github.com/EngreitzLab/gene_network_evaluation/blob/jamboree-gene-programs-2024/app/tests/motif_enrichment.ipynb

Was failing due to documented tangermeme issue: jmschrei/tangermeme#18

Could be fixed with v0.0.3 release today

Add support for perturbation association tests stratified per categorical level

In many scenarios, it is preferable to test for significant perturbations per cell type or stage of differentiation.

Add log2fc and FDR to association test outputs

How do we define a term as uniquely enriched or associated?

Been using these functions that @aron0093 originally implemented:

def count(categorical_var, count_var, dataframe):
    counts_df = dataframe.value_counts([categorical_var, count_var])
    counts_df = counts_df.groupby(categorical_var).sum()
    counts_df = counts_df.sort_values(ascending=False)
    counts_df = pd.DataFrame(counts_df.reset_index().values,
                             columns=[categorical_var, count_var])
    return counts_df


def count_unique(categorical_var, count_var, dataframe, cummul=False, unique=False):
    counts_df = count(categorical_var, count_var, dataframe)
    new_df = []
    terms = []
    for prog in counts_df[categorical_var].unique():
        terms_ = dataframe.loc[dataframe[categorical_var] == prog, count_var].unique()
        unique_terms = [term for term in terms_ if term not in terms]
        terms.extend(unique_terms)
        new_df.append([prog, len(unique_terms)])
    new_df = pd.DataFrame(new_df, columns=[categorical_var, count_var])
    if cummul:
        new_df[count_var] = new_df[count_var].cumsum()
    if unique:
        return counts_df
    else:
        return new_df

Data

program	geneset	p-value	adjusted p-value
program1	genesetA	0.01	0.01
program1	genesetA	0.02	0.02
program2	genesetA	0.03	0.03
program2	genesetA	0.04	0.04
program3	genesetB	0.05	0.05

Can think of three ways to count terms:

1. All enriched terms

Count everything, including terms enriched multiple times in the same program (shouldn't happen right?) and terms enriched in multiple programs.

count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)

program	geneset
program1	2
program2	2
program3	1

2. Unique within a program, but can be repeated across programs

i.e if program1 and program2 are both enriched for genesetA we count it for both programs

unique_data = data.drop_duplicates(subset=[categorical_var, count_var])
count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)

program	geneset
program1	1
program2	1
program3	1

3. Unique across all programs

i.e. if program1 and program2 are both enriched for genesetA, but program1 has a much lower adjusted p-value, we only count genesetA for program1

unique_data = data.sort_values(by=sig_var)
unique_data = unique_data.drop_duplicates(subset=count_var)
unique_df = count_unique(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
unique_df = unique_df.sort_values(count_var, ascending=False)

program	geneset
program1	1
program3	1

**Note that I didn't use count_unique(..., unique=True) here because I think it arbitrarily selects which program to bin a term in when it is duplicated across terms, rather than selecting the one it is most enriched for.

Which to use?

I think it depends. To me most of the time I think 2 is the right option since we could easily have redundancy between programs and we want that captured. But maybe we make this something a dashboard user can select?

Rerun iPSC_EC evaluation pipeline after update

Topic-trait correlation heatmap

This is something that @nargesr implements in Topyfic: https://github.com/mortazavilab/Topyfic/blob/main/Topyfic/analysis.py#L438.

Correct me if I'm wrong, but it looks like it tries to binarize all categorical traits you pass in and calculate a spearman correlation between those binary vectors and the cell participation vector for each program. It also tries to remove noise from the bottom of the participation vector using the min value as a threshold.

I like the idea of having some kind of global view of the associations between each program (x-axis) and covariates (y-axis). A heatmap seems a logical choice, but we should come to an agreement on what the statistic we want to show is and how we want to present it. Spearman could make sense on continuous variables, but maybe kruskal-wallis makes more sense for categorical variables.

Whatever we end up, we should probably implement it in plotly (shouldn't be too hard)

The results dataframe should contain "program_name" column. Currently it is the index which is inconsistent from the output from other evaluations.
The p-values are automatically rounded off. Explicitly setting dtype should fix this.