Code Monkey home page Code Monkey logo

gene_network_evaluation's Issues

Align dashboard with config from evaluation pipeline

Have th pipeline dump a config.yaml in the output directory that I can use for the dashboard. Here is an example:

output_loc: /cellar/users/aklie/opt/gene_program_evaluation/dashapp/example_data/iPSC_EC_evaluations/cNMF_60_0.2_gene_names.h5mu
categorical_keys: ['sample']
continuous_keys: None
dim_reduce_keys: None
workdir: /cellar/users/aklie/opt/gene_program_evaluation/dashapp/example_data/iPSC_EC_evaluations
data_key: rna
annotations_loc: "annotations.csv"  # if none defaults to annotations.csv

User can manually add a location for annotations to be dumped as above. Otherwise it defaults to eval directory.

This is important for things like:

  1. Knowing what categorical keys to plot
  2. Knowing what the inference key (data_key) was

How are we handling promoter vs enhancer motif outputs from evaluation?

We used to have a column for this in the output file for motif enrichment, but that is not there in the latest outputs (presumably because E2G links weren't there yet).

ProgramID EPType TFMotif PValue FDR Enrichment
K60_1 Promoter AHR 0.044631 0.210088 1.594955
K60_10 Promoter AHR 0.351685 0.67633 1.242518
K60_11 Promoter AHR 0.681555 0.885289 0.901666
K60_12 Promoter AHR 0.446282 0.745748 1.204339

How do we want to handle this more generally? The two ways I could see for the dashboard:

  1. Include a typecolumn for this regardless of what enrichment is run on.
  2. Output separate files for each type and name differently
  1. seems better and more flexible, but either will work.

Code maintenance and good practices

  1. We should complete functional and integration tests once the codebase stabilises.
  2. We should implement continuous integration and code linting.

General procedure to assess goodness of fit rather than explained variance ratio.

  1. It isn't easy to come up with a generic procedure for this calculation in the former scenario since many methods further process the expression matrix internally without reporting it.

  2. The model output can be arbitrarily worse for non-linear methods and produce negative values for this evaluation.

For k selection (assuming good model fit) we can get a knee plot by plotting the variance explained by each component w.r.t. total modeled variance. This evaluation will focus on selecting the appropriate k while we can introduce a different evaluation to assess goodness of fit.

Or we could come up with a generalisable evaluation (e.g. information based) to compute goodness of fit that can also be used for k selection.

Adding package versions to outputs from evaluation and (potentially) inference

Display all the package versions used for inference and evaluation on a tab of the dashboard. Is this something that snakemake can easily do (or already does)?

Would just be a file output to the evaluations directory like versions.yaml or something. Here is an example I made up for the iPSC data:

inference_software_versions:
  cnmf: 0.0.q
  numpy: 1.20.2
  pandas: 1.2.4
evaluation_software_versions:
  gene_program_evaluation: 0.0.1
  mudata: 0.0.1
  joblib: 1.0.1
  scipy: 1.6.2
  numpy: 1.20.2
  pandas: 1.2.4
  scikit-learn: 0.24.2
  scikit-posthocs: 0.6.6
  seaborn: 0.11.1
  gseapy: 0.10.1
  pymemesuite: 0.0.1
  google-cloud-bigquery: 2.26.0

But we can pretty much do anything as long as it's consistent and makes sense.

Add an evaluation for model robustness

We need to add an evaluation that tests the robustness of programs across multiple runs (seeds) and also across multiple K-values.

  1. A weak test can assess similarity of the overall information captured by each run.
  2. A stronger test would compare programs across runs and assess consistency.

Single run analysis page, Section 2: Covariate association

I currently have a section of the dashboard meant to explore the association with covariates, and I'm not sure exactly what to populate it with:
image

Part of this is due to not having the latest covariate association output. But I think this warrants some discussion.

This issue is a meta-issue for a few others: #24 , #23

Update gene set enrichment functionality

  • Take in low threshold of program weights to remove prior to run
  • Add in capability to use top N genes
  • Rename columns to standardized set (see below)
  • Reconfigure dashapp to load multiple results and have a dropdown to select which one to use)

Structure plot

I think this visualization does a nice job showing how a given cell distributes its function across programs.

The Topyfic version is pretty complex, and it might need to be substantially refactored to be plotted in the dashboard. A key part of this is efficiency. We don't want it to take 10s everytime the user reloads the page

Calibration of association tests

Most tests (e.g. covariate association, perturbation association) give highly inflated significance (very low p-values) under typical thresholds (e.g. 0.05). We need to implement methods to calibrate the output of these tests against appropriate null p-value distributions.

Separately, reporting log2FC would also help interpretation as well as compared to using test statistics. In the dashboard we should prioritize reporting log2FC.

How do we define a term as uniquely enriched or associated?

Been using these functions that @aron0093 originally implemented:

def count(categorical_var, count_var, dataframe):
    counts_df = dataframe.value_counts([categorical_var, count_var])
    counts_df = counts_df.groupby(categorical_var).sum()
    counts_df = counts_df.sort_values(ascending=False)
    counts_df = pd.DataFrame(counts_df.reset_index().values,
                             columns=[categorical_var, count_var])
    return counts_df


def count_unique(categorical_var, count_var, dataframe, cummul=False, unique=False):
    counts_df = count(categorical_var, count_var, dataframe)
    new_df = []
    terms = []
    for prog in counts_df[categorical_var].unique():
        terms_ = dataframe.loc[dataframe[categorical_var] == prog, count_var].unique()
        unique_terms = [term for term in terms_ if term not in terms]
        terms.extend(unique_terms)
        new_df.append([prog, len(unique_terms)])
    new_df = pd.DataFrame(new_df, columns=[categorical_var, count_var])
    if cummul:
        new_df[count_var] = new_df[count_var].cumsum()
    if unique:
        return counts_df
    else:
        return new_df 

Data

program geneset p-value adjusted p-value
program1 genesetA 0.01 0.01
program1 genesetA 0.02 0.02
program2 genesetA 0.03 0.03
program2 genesetA 0.04 0.04
program3 genesetB 0.05 0.05

Can think of three ways to count terms:

1. All enriched terms

Count everything, including terms enriched multiple times in the same program (shouldn't happen right?) and terms enriched in multiple programs.

count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
program geneset
program1 2
program2 2
program3 1

2. Unique within a program, but can be repeated across programs

i.e if program1 and program2 are both enriched for genesetA we count it for both programs

unique_data = data.drop_duplicates(subset=[categorical_var, count_var])
count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
program geneset
program1 1
program2 1
program3 1

3. Unique across all programs

i.e. if program1 and program2 are both enriched for genesetA, but program1 has a much lower adjusted p-value, we only count genesetA for program1

unique_data = data.sort_values(by=sig_var)
unique_data = unique_data.drop_duplicates(subset=count_var)
unique_df = count_unique(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
unique_df = unique_df.sort_values(count_var, ascending=False)
program geneset
program1 1
program3 1

**Note that I didn't use count_unique(..., unique=True) here because I think it arbitrarily selects which program to bin a term in when it is duplicated across terms, rather than selecting the one it is most enriched for.

Which to use?

I think it depends. To me most of the time I think 2 is the right option since we could easily have redundancy between programs and we want that captured. But maybe we make this something a dashboard user can select?

Topic-trait correlation heatmap

This is something that @nargesr implements in Topyfic: https://github.com/mortazavilab/Topyfic/blob/main/Topyfic/analysis.py#L438.

Correct me if I'm wrong, but it looks like it tries to binarize all categorical traits you pass in and calculate a spearman correlation between those binary vectors and the cell participation vector for each program. It also tries to remove noise from the bottom of the participation vector using the min value as a threshold.

I like the idea of having some kind of global view of the associations between each program (x-axis) and covariates (y-axis). A heatmap seems a logical choice, but we should come to an agreement on what the statistic we want to show is and how we want to present it. Spearman could make sense on continuous variables, but maybe kruskal-wallis makes more sense for categorical variables.

Whatever we end up, we should probably implement it in plotly (shouldn't be too hard)

Fix: categorical association dataframe output

  1. The results dataframe should contain "program_name" column. Currently it is the index which is inconsistent from the output from other evaluations.
  2. The p-values are automatically rounded off. Explicitly setting dtype should fix this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.