engreitzlab / gene_network_evaluation Goto Github PK
View Code? Open in Web Editor NEWEvaluation framework for computationally inferred gene networks from single-cell data.
Evaluation framework for computationally inferred gene networks from single-cell data.
Do we even want this for programs. What does it add?
Ran out of time to add details.
Have th pipeline dump a config.yaml
in the output directory that I can use for the dashboard. Here is an example:
output_loc: /cellar/users/aklie/opt/gene_program_evaluation/dashapp/example_data/iPSC_EC_evaluations/cNMF_60_0.2_gene_names.h5mu
categorical_keys: ['sample']
continuous_keys: None
dim_reduce_keys: None
workdir: /cellar/users/aklie/opt/gene_program_evaluation/dashapp/example_data/iPSC_EC_evaluations
data_key: rna
annotations_loc: "annotations.csv" # if none defaults to annotations.csv
User can manually add a location for annotations to be dumped as above. Otherwise it defaults to eval directory.
This is important for things like:
We used to have a column for this in the output file for motif enrichment, but that is not there in the latest outputs (presumably because E2G links weren't there yet).
ProgramID | EPType | TFMotif | PValue | FDR | Enrichment |
---|---|---|---|---|---|
K60_1 | Promoter | AHR | 0.044631 | 0.210088 | 1.594955 |
K60_10 | Promoter | AHR | 0.351685 | 0.67633 | 1.242518 |
K60_11 | Promoter | AHR | 0.681555 | 0.885289 | 0.901666 |
K60_12 | Promoter | AHR | 0.446282 | 0.745748 | 1.204339 |
How do we want to handle this more generally? The two ways I could see for the dashboard:
type
column for this regardless of what enrichment is run on.It isn't easy to come up with a generic procedure for this calculation in the former scenario since many methods further process the expression matrix internally without reporting it.
The model output can be arbitrarily worse for non-linear methods and produce negative values for this evaluation.
For k selection (assuming good model fit) we can get a knee plot by plotting the variance explained by each component w.r.t. total modeled variance. This evaluation will focus on selecting the appropriate k while we can introduce a different evaluation to assess goodness of fit.
Or we could come up with a generalisable evaluation (e.g. information based) to compute goodness of fit that can also be used for k selection.
Should we simplify the repo by branching the current main and then removing all the GRN stuff from it? This includes:
Would be great to have something to point people to. It probably should include either information or references on:
Display all the package versions used for inference and evaluation on a tab of the dashboard. Is this something that snakemake can easily do (or already does)?
Would just be a file output to the evaluations directory like versions.yaml
or something. Here is an example I made up for the iPSC data:
inference_software_versions:
cnmf: 0.0.q
numpy: 1.20.2
pandas: 1.2.4
evaluation_software_versions:
gene_program_evaluation: 0.0.1
mudata: 0.0.1
joblib: 1.0.1
scipy: 1.6.2
numpy: 1.20.2
pandas: 1.2.4
scikit-learn: 0.24.2
scikit-posthocs: 0.6.6
seaborn: 0.11.1
gseapy: 0.10.1
pymemesuite: 0.0.1
google-cloud-bigquery: 2.26.0
But we can pretty much do anything as long as it's consistent and makes sense.
We need code for cNMF inference in examples/inference/iPSC_EC/cNMF for jamboree. Just a minimal example of what we have now to reproduce a cNMF result on an input dataset.
We need to add an evaluation that tests the robustness of programs across multiple runs (seeds) and also across multiple K-values.
To generate a PheWAS plot, we need to run https://github.com/EngreitzLab/gene_network_evaluation/blob/main/src/plotting/plot_gwas_enrichment.py#L10.
To avoid extra computation in the dashboard, we should just automatically compute this during evaluation and save it to a separate file trait_enrichment_processed.txt
I think this visualization does a nice job showing how a given cell distributes its function across programs.
The Topyfic version is pretty complex, and it might need to be substantially refactored to be plotted in the dashboard. A key part of this is efficiency. We don't want it to take 10s everytime the user reloads the page
Most tests (e.g. covariate association, perturbation association) give highly inflated significance (very low p-values) under typical thresholds (e.g. 0.05). We need to implement methods to calibrate the output of these tests against appropriate null p-value distributions.
Separately, reporting log2FC would also help interpretation as well as compared to using test statistics. In the dashboard we should prioritize reporting log2FC.
We need code for Topyic inference in examples/inference/iPSC_EC/Topyfic. Just a minimal example of what we have now to reproduce a cNMF result on an input dataset.
Current scripted implementation: https://github.com/EngreitzLab/gene_network_evaluation/blob/jamboree-gene-programs-2024/app/tests/motif_enrichment.ipynb
Was failing due to documented tangermeme
issue: jmschrei/tangermeme#18
Could be fixed with v0.0.3 release today
In many scenarios, it is preferable to test for significant perturbations per cell type or stage of differentiation.
Been using these functions that @aron0093 originally implemented:
def count(categorical_var, count_var, dataframe):
counts_df = dataframe.value_counts([categorical_var, count_var])
counts_df = counts_df.groupby(categorical_var).sum()
counts_df = counts_df.sort_values(ascending=False)
counts_df = pd.DataFrame(counts_df.reset_index().values,
columns=[categorical_var, count_var])
return counts_df
def count_unique(categorical_var, count_var, dataframe, cummul=False, unique=False):
counts_df = count(categorical_var, count_var, dataframe)
new_df = []
terms = []
for prog in counts_df[categorical_var].unique():
terms_ = dataframe.loc[dataframe[categorical_var] == prog, count_var].unique()
unique_terms = [term for term in terms_ if term not in terms]
terms.extend(unique_terms)
new_df.append([prog, len(unique_terms)])
new_df = pd.DataFrame(new_df, columns=[categorical_var, count_var])
if cummul:
new_df[count_var] = new_df[count_var].cumsum()
if unique:
return counts_df
else:
return new_df
program | geneset | p-value | adjusted p-value |
---|---|---|---|
program1 | genesetA | 0.01 | 0.01 |
program1 | genesetA | 0.02 | 0.02 |
program2 | genesetA | 0.03 | 0.03 |
program2 | genesetA | 0.04 | 0.04 |
program3 | genesetB | 0.05 | 0.05 |
Can think of three ways to count terms:
Count everything, including terms enriched multiple times in the same program (shouldn't happen right?) and terms enriched in multiple programs.
count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
program | geneset |
---|---|
program1 | 2 |
program2 | 2 |
program3 | 1 |
i.e if program1
and program2
are both enriched for genesetA
we count it for both programs
unique_data = data.drop_duplicates(subset=[categorical_var, count_var])
count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
program | geneset |
---|---|
program1 | 1 |
program2 | 1 |
program3 | 1 |
i.e. if program1
and program2
are both enriched for genesetA
, but program1
has a much lower adjusted p-value, we only count genesetA
for program1
unique_data = data.sort_values(by=sig_var)
unique_data = unique_data.drop_duplicates(subset=count_var)
unique_df = count_unique(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
unique_df = unique_df.sort_values(count_var, ascending=False)
program | geneset |
---|---|
program1 | 1 |
program3 | 1 |
**Note that I didn't use count_unique(..., unique=True)
here because I think it arbitrarily selects which program to bin a term in when it is duplicated across terms, rather than selecting the one it is most enriched for.
I think it depends. To me most of the time I think 2 is the right option since we could easily have redundancy between programs and we want that captured. But maybe we make this something a dashboard user can select?
This is something that @nargesr implements in Topyfic: https://github.com/mortazavilab/Topyfic/blob/main/Topyfic/analysis.py#L438.
Correct me if I'm wrong, but it looks like it tries to binarize all categorical traits you pass in and calculate a spearman correlation between those binary vectors and the cell participation vector for each program. It also tries to remove noise from the bottom of the participation vector using the min value as a threshold.
I like the idea of having some kind of global view of the associations between each program (x-axis) and covariates (y-axis). A heatmap seems a logical choice, but we should come to an agreement on what the statistic we want to show is and how we want to present it. Spearman could make sense on continuous variables, but maybe kruskal-wallis makes more sense for categorical variables.
Whatever we end up, we should probably implement it in plotly (shouldn't be too hard)
Currently the motif enrichment considers all motif matches that have a FDR < 0.05 While this threshold can be adjusted in the internals it is not a parameter in the user-facing enrichment function.
Rerun the evaluations using the updated "pipeline" on the TeloHAEC dataset
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.