rasmussenlab / move Goto Github PK

MOVE (Multi-Omics Variational autoEncoder) for integrating multi-omics data and identifying cross modal associations

Home Page: https://move-dl.readthedocs.io/

License: MIT License

Jupyter Notebook 75.73% Python 24.27%

associations-inference bayesian-inference integration multi-omics python variational variational-autoencoder variational-inference multi-modal

move's Introduction

MOVE (Multi-Omics Variational autoEncoder)

The code in this repository can be used to run our Multi-Omics Variational autoEncoder (MOVE) framework for integration of omics and clinical variabels spanning both categorial and continuous data. Our approach includes training ensemble VAE models and using in silico perturbation experiments to identify cross omics associations. The manuscript has been published in Nature Biotechnology:

Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. et al. Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01520-x

We developed the method based on a Type 2 Diabetes cohort from the IMI DIRECT project containing 789 newly diagnosed T2D patients. The cohort and data creation is described in Koivula et al. and Wesolowska-Andersen et al.. For the analysis we included the following data:

Multi-omics data sets:

Genomics
Transcriptomics
Proteomics
Metabolomics
Metagenomics

Other data sets:

Clinical data (blood measurements, imaging data, ...)
Questionnaire data (diet etc)
Accelerometer data
Medication data

Installation

Installing MOVE package

MOVE is written in Python and can be installed using pip:

>>> pip install move-dl

Requirements

MOVE should run on any environmnet where Python is available. The variational autoencoder architecture is implemented in PyTorch.

The training of the VAEs can be done using CPUs only or GPU acceleration. If you do not have powerful GPUs available, it is possible to run using only CPUs. For instance, the tutorial data set consisting of simulated drug, metabolomics and proteomics data for 500 individuals runs fine on a standard macbook.

Note: The pip installation of move-dl does not setup your local GPU automatically

The MOVE pipeline

MOVE has five-six steps:

01. Encode the data into a format that can be read by MOVE
02. Finding the right architecture of the network focusing on reconstruction accuracy
03. Finding the right architecture of the network focusing on stability of the model
04. Use model, determined from steps 02-03, to create and analyze the latent space
05. Identify associations between a categorical and continuous datasets
05a. Using an ensemble of VAEs with the t-test approach
05b. Using an ensemble of VAEs with the Bayesian decision theory approach
06. If both 5a and 5b were run select the overlap between them

How to run MOVE

Please refer to our documentation for examples and tutorials on how to run MOVE.

Additionally, you can copy this notebook and follow its instructions to get familiar with our pipeline.

Data sets

DIRECT data set

The data used in notebooks are not available for testing due to the informed consent given by study participants, the various national ethical approvals for the study, and the European General Data Protection Regulation (GDPR). Therefore, individual-level clinical and omics data cannot be transferred from the centralized IMI-DIRECT repository. Requests for access to summary statistics IMI-DIRECT data, including those presented here, can be made to [email protected]. Requesters will be informed on how summary-level data can be accessed via the DIRECT secure analysis platform following submission of appropriate application. The IMI-DIRECT data access policy is available here.

Simulated and publicaly available data sets

We have therefore provided two datasets to test the workflow: a simulated dataset and a publicly-available maize rhizosphere microbiome data set.

Citation

To cite MOVE, use the following information:

Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. et al. Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01520-x

move's People

Contributors

Stargazers

Watchers

move's Issues

Adhere to PEP8 style guide

Run black and isort to keep the code tidy & readable.

Best latent representation

We should discuss/analyse how the best latent representation is chose at the end of step 03 stability

Simplify plotting functions

Is it possible to simplify the plotting functions to they work for all cases? Ie one approach could be to make the plots as individual plots and not composite figures

👩🏽‍🏫 Update tutorial

Update tutorial notebooks
#60

QoL changes

The following is a list of low-priority changes to do to MOVE:

#39
Rename processed_data_path in config to results_path
Add type hints to methods in vae.py module

Improvement: Reduce array size for storing differences for Bayes Factors

While explaining Bayes Factors in the code (marc to me), we noticed that there could be a potential memory optimization:

MOVE/src/move/tasks/identify_associations.py

Lines 167 to 179 in 92bced0

    
                   mean_diff[i, :, :] += diff * normalizer 
        
           # Calculate Bayes factors 
        
           logger.info("Identifying significant features") 
        
           bayes_k = np.empty((num_perturbed, num_continuous)) 
        
           for i in range(num_perturbed): 
        
               mask = feature_mask[:, [i]] | nan_mask  # 2D: N x C 
        
               diff = np.ma.masked_array(mean_diff[i, :, :], mask=mask)  # 2D: N x C 
        
               prob = np.ma.compressed(np.mean(diff > 1e-8, axis=0))  # 1D: C 
        
               bayes_k[i, :] = np.log(prob + 1e-8) - np.log(1 - prob + 1e-8) 
        
           # Calculate Bayes probabilities 
        
           bayes_abs = np.abs(bayes_k)

Storing the direction could be done in signed 8 bit integers (or what ever is the lowest)

MOVE/src/move/tasks/identify_associations.py

Line 167 in 92bced0

mean_diff[i, :, :] += diff * normalizer

as only the boolean array is used later

MOVE/src/move/tasks/identify_associations.py

Line 175 in 92bced0

prob = np.ma.compressed(np.mean(diff > 1e-8, axis=0)) # 1D: C

About the perturbation

Hello, I saw in your article that samples that have received drug treatment will be eliminated when perturbing. What I want to ask is if I use the move tool to disturb, do I need to take the sample that received a certain drug by myself? Remove it, or the current code will automatically remove it for me.
Thanks.

plot_reconstruction

plot_reconstruction function in visualization_utils.py has the headers hard-coded so it won't work if other or only a subset af data are used:

  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/move/04_analyze_latent/__main__.py", line 69, in main
    plot_reconstruction_distribs(processed_data_path, cat_total_recon, all_values)
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/move/utils/visualization_utils.py", line 285, in plot_reconstruction_distribs
    df = pd.DataFrame(cat_total_recon + all_values, index = ['Clinical\n(categorical)', 'Genomics', 'Drug data', 'Clinical\n(continuous)', 'Diet +\n wearables','Proteomics','Targeted\nmetabolomics','Untargeted\nmetabolomics', 'Transcriptomics', 'Metagenomics'])
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/frame.py", line 729, in __init__
    mgr = arrays_to_mgr(
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 125, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 628, in _homogenize
    com.require_length_match(val, index)
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/common.py", line 557, in require_length_match
    raise ValueError(
ValueError: Length of values (3) does not match length of index (10)

02_optimize_reconstruction write out results

In 02_optimize_reconstruction make it write out a table format with what was the best models. Currently only saved in numpy npy. As the plotting often fails, this could be where one can see what were actually the best models

04_analyze_latent fails due to hardcoded values

Names are hardcoded (Clinical, Genomics, etc) and should instead be taken from the data input

python -m move.04_analyze_latent

Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/move/04_analyze_latent/__main__.py", line 74, in main
    plot_reconstruction_distribs(processed_data_path, cat_total_recon, all_values)
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/move/utils/visualization_utils.py", line 285, in plot_reconstruction_distribs
    df = pd.DataFrame(cat_total_recon + all_values, index = ['Clinical\n(categorical)', 'Genomics', 'Drug data', 'Clinical\n(continuous)', 'Diet +\n wearables','Proteomics','Targeted\nmetabolomics','Untargeted\nmetabolomics', 'Transcriptomics', 'Metagenomics'])
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/frame.py", line 729, in __init__
    mgr = arrays_to_mgr(
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 125, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 628, in _homogenize
    com.require_length_match(val, index)
  File "/Users/kjv627/miniforge3/envs/move_dev/lib/python3.9/site-packages/pandas/core/common.py", line 557, in require_length_match
    raise ValueError(
ValueError: Length of values (3) does not match length of index (10)

Reduce memory of step 05

Reduce memory of step 05 by changing from dicts to another data structure. Seems to be in the function train_model_association when it saves reconstruction results:

---> Works:    with open(path + "results/results_" + version + ".npy", 'wb') as f:
        np.save(f, results)
---> File is truncated   with open(path + "results/results_recon_" + version + ".npy", 'wb') as f:
        np.save(f, recon_results)
    with open(path + "results/results_groups_" + version + ".npy", 'wb') as f:
        np.save(f, groups)
    with open(path + "results/results_recon_mean_baseline_" + version + ".npy", 'wb') as f:
        np.save(f, mean_bas)
    with open(path + "results/results_recon_no_corr_" + version + ".npy", 'wb') as f:
        np.save(f, recon_results_1)

Ie when it tries to save the recon_results as results/results_recon_v1.npy
When i try to load it I get EOFError:

recon_results = np.load(processed_data_path + "results/results_recon_" + version + ".npy", allow_pickle=True).item()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kjv627/miniforge3/envs/move/lib/python3.9/site-packages/numpy/lib/npyio.py", line 430, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/Users/kjv627/miniforge3/envs/move/lib/python3.9/site-packages/numpy/lib/format.py", line 747, in read_array
    array = pickle.load(fp, **pickle_kwargs)
EOFError: Ran out of input

So it seems it wasnt written correctly

numpy 2.0 update

It looks like currently the installation uses numpy 2.* - which leads to errors:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/move-dl", line 5, in <module>
    from move.__main__ import main
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/__init__.py", line 5, in <module>
    from move import conf, data, models  # noqa:E402
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/conf/__init__.py", line 3, in <module>
    from move.conf.schema import MOVEConfig
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/conf/schema.py", line 18, in <module>
    from move.models.vae import VAE
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/models/__init__.py", line 3, in <module>
    from move.models.vae import VAE
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/models/vae.py", line 10, in <module>
    from move.core.typing import FloatArray, IntArray
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/core/__init__.py", line 3, in <module>
    from move.core import logging, typing
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/move/core/typing.py", line 13, in <module>
    FloatArray = NDArray[np.float_]
                         ^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/numpy/__init__.py", line 397, in __getattr__
    raise AttributeError(
AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead.. Did you mean: 'float[16](https://github.com/RasmussenLab/MOVE/actions/runs/9566168242/job/26370771716?pr=95#step:5:17)'?

@ri-heme Should we switch to np.float64 in the type annotations?

Set-up CI workflow

Set up a CI workflow using GitHub Actions to automatically lint code and do unit testing.

Issue of "move-dl"cannot be found after I installed "MOVE" successfully.

Dear Sir,
I encountered an issue at a certain step below, where the terminal responded with an error message stating that it could not find "move-dl". I would be immensely grateful if you could provide some guidance or solutions to rectify this issue.
From here:
on the parent directory of the config folder (in this example, it is the tutorial folder), and proceed to run:

cd tutorial
move-dl data=random_small task=encode_data — Cannot find move-dl

Your help would significantly contribute to my understanding and application of your work. I appreciate your time in advance.

Add more user info on step 05

Add more information when testing associations in step 05. E.g. Testing: <feature X> where in our case it would be drugs

Update tutorial text

Changes to tutorial:

stability plots x-labels can't be seen

Error during __tune_reconstruction: score in calculate_accuracy (metrics.py) cannot be calculated

I'm currently training MOVE on proteomics data in combination with lots of categorical data (with a few missing values). My input data is structured as instructed (1 Feature/File, missing values = NA).

When MOVE tries to calculate the score during reconstruction tuning, it struggles with the missing values since num_features has the original length (including masked entries) but y_true and y_pred have lengths n - n_masked. Excluding all categorical features containing missing values results in a successful run. What is the correct way to fix that error?
analysis\metrics.py

The Error thrown is below:

Error executing job with overrides: ['task.batch_size=10', 'task.model.num_hidden=[500]', 'task.training_loop.num_epochs=40', 'experiment=mpn__tune_reconstruction']
Traceback (most recent call last):
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\__main__.py", line 38, in main
    move.tasks.tune_model(config)
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\tasks\tune_model.py", line 249, in tune_model
    _tune_reconstruction(task_config)
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\tasks\tune_model.py", line 216, in _tune_reconstruction
    accuracy = calculate_accuracy(cat[mask], cat_recon)
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\analysis\metrics.py", line 36, in calculate_accuracy
    scores = np.ma.compressed(np.sum(y_true == y_pred, axis=1)) / num_features
ValueError: operands could not be broadcast together with shapes (118,) (131,)

defining entrypoints

I tried out of curiosity to define an entrypoint in setup.cfg, and I ran in an error regarding the uncommon module naming convention

[options.entry_points]
console_scripts =
    move-encode-data = move.01_encode_data.__main__:main

move-encode-data --help
Traceback (most recent call last):
  File "C:\Users\enryh\anaconda3\envs\move\lib\runpy.py", line 189, in _run_module_as_main
    mod_name, mod_spec, code = _get_main_module_details(_Error)
  File "C:\Users\enryh\anaconda3\envs\move\lib\runpy.py", line 223, in _get_main_module_details
    return _get_module_details(main_name)
  File "C:\Users\enryh\anaconda3\envs\move\lib\runpy.py", line 129, in _get_module_details
    spec = importlib.util.find_spec(mod_name)
  File "C:\Users\enryh\anaconda3\envs\move\lib\importlib\util.py", line 103, in find_spec
    return _find_spec(fullname, parent_path)
  File "<frozen importlib._bootstrap>", line 945, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1439, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1411, in _get_spec
  File "<frozen zipimport>", line 170, in find_spec
  File "<frozen importlib._bootstrap>", line 431, in spec_from_loader
  File "<frozen importlib._bootstrap_external>", line 741, in spec_from_file_location
  File "<frozen zipimport>", line 229, in get_filename
  File "<frozen zipimport>", line 760, in _get_module_code
  File "<frozen zipimport>", line 689, in _compile_source
  File "C:\Users\enryh\anaconda3\envs\move\Scripts\move-encode-data.exe\__main__.py", line 4
    from move.01_encode_data import main
                ^
SyntaxError: invalid decimal literal

Changing the module name from 01_encode_data to encode_data (and changing setup.cfg accordingly, solves the problem

>>> move-encode-data --help
C:\Users\enrhy\Documents\repos\MOVE\src\move\encode_data\__main__.py:6: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="../conf", config_name="main")
__main__ is powered by Hydra.

== Configuration groups ==
Compose your configuration from those groups (group=option)

data: main
model: vae
training: main
training_association: main
training_latent: main
tuning_reconstruction: main
tuning_stability: main


== Config ==
Override anything in the config (foo.bar=value)

name: MOVE
seed: 123456
data:
  user_config: data.yaml
  na_value: NA
  raw_data_path: data/
  interim_data_path: interim_data/
  processed_data_path: processed_data/
  headers_path: headers/
  version: v1
  ids_file_name: baseline_ids.txt
  ids_has_header: true
  ids_colname: 0
  categorical_inputs:
  - name: diabetes_genotypes
    weight: 1
  - name: baseline_drugs
    weight: 1
  - name: baseline_categorical
    weight: 1
  continuous_inputs:
  - name: baseline_continuous
    weight: 2
  - name: baseline_transcriptomics
    weight: 1
  - name: baseline_diet_wearables
    weight: 1
  - name: baseline_proteomic_antibodies
    weight: 1
  - name: baseline_target_metabolomics
    weight: 1
  - name: baseline_untarget_metabolomics
    weight: 1
  - name: baseline_metagenomics
    weight: 1
  data_of_interest: baseline_drugs
  categorical_names: ${names:${data.categorical_inputs}}
  continuous_names: ${names:${data.continuous_inputs}}
  categorical_weights: ${weights:${data.categorical_inputs}}
  continuous_weights: ${weights:${data.continuous_inputs}}
  data_features_to_visualize_notebook4:
  - drug_1
  - clinical_continuous_2
  - clinical_continuous_3
  write_omics_results_notebook5:
  - baseline_target_metabolomics
  - baseline_untarget_metabolomics
model:
  _target_: move.models.vae.VAE
  user_config: model.yaml
  seed: 1
  cuda: false
  lrate: 0.0001
  num_epochs: 500
  patience: 100
  kld_steps:
  - 20
  - 30
  - 40
  - 90
  batch_steps:
  - 50
  - 100
  - 150
  - 200
  - 250
  - 300
  - 350
  - 400
  - 450
tuning_reconstruction:
  user_config: tuning_reconstruction.yaml
  num_hidden:
  - 500
  - 1000
  num_latent:
  - 20
  - 50
  num_layers:
  - 1
  - 2
  dropout:
  - 0.1
  - 0.2
  beta:
  - 1.0e-05
  - 0.0001
  batch_sizes:
  - 10
  repeats: 1
  max_param_combos_to_save: 12
tuning_stability:
  user_config: tuning_stability.yaml
  num_hidden:
  - 500
  - 1000
  num_latent:
  - 20
  - 50
  num_layers:
  - 1
  dropout:
  - 0.1
  - 0.2
  beta:
  - 1.0e-05
  batch_sizes:
  - 10
  repeats: 5
  tuned_num_epochs: 250
training_latent:
  user_config: training_latent.yaml
  num_hidden: 500
  num_latent: 20
  num_layers: 1
  dropout: 0.1
  beta: 1.0e-05
  batch_sizes: 10
  tuned_num_epochs: 250
training_association:
  user_config: training_association.yaml
  num_hidden: 500
  num_latent:
  - 150
  - 200
  - 250
  - 300
  num_layers: 1
  dropout: 0.1
  beta: 1.0e-05
  batch_sizes: 10
  repeats: 10
  tuned_num_epochs: 250


Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help

Organize code

I suggest the following way to organize the files and folders of the project. As discussed earlier with @valentas1, we can move the diff code snippets into their own modules (e.g., move.analysis and move.models).

.
└── move/
    ├── .github/workflows   <= CI workflows to lint and test code
    ├── conf/               <= Configuration files
    │   ├── data/           <= Default data configuration (e.g., batch size)
    │   ├── model/          <= Default model configuration (e.g., layers)
    │   └── training/       <= Default model training (e.g., epochs, steps)
    │
    ├── notebooks/          <= Jupyter notebooks (step-by-step tutorials)
    ├── src/                <= Source code
    │   └── move/
    │       ├── analysis/   <= Scripts for post-analysis (e.g. feature
    │       │                  importance)
    │       ├── data/       <= Scripts to encode data, create datasets, and 
    │       │                  data loaders
    │       ├── models/     <= Architectures and custom layers
    │       └── viz/        <= Scripts to create visualization
    │
    ├── tests/              <= Unit tests
    ├── LICENSE             <= License
    ├── README.md           <= README.md
    ├── requirements.txt    <= Requirements file for reproducing the analysis
    └── setup.py            <= Setup script

05_identify_associations: write results in tsv table format

Currently, 05_identify_associations does not write out significant hits in a table format, but rather individual files. Change to table format with all that could include:

Drug feature, omics dataset, omics feature, cor p-value, estimated change, confidence interval change

Expected MOVE tutorial runtime?

Hi, congrats on this great tool.

I am currently following the tutorial and trying to familiarize myself with MOVE.
How long should I expect the tutorial runtime to be using the random_small dataset?

System specifications:
RAM: 16 GB
Processor: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz, 2611 Mhz, 4 Core(s), 8 Logical Processor(s)

Many thanks,
Foteini

NA class is created even without NA in categorical data

César and I are applying MOVE to a new dataset.
For the categorical data, we are having issues with the function encode_cat(). We realized the following:

Not an issue:

The class NA (0,...,0) is added even though there are no NAs.

Issues:
2) The function np.isnan(), which is used to check if the array with unique classes contains a NaN class, does not work if the class numbers are introduced as strings. (TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
3) If one class is not represented in the data of a new user ( no patient of class 3,e.g., in a class 1-9 setting), the one-hot encoding is rearranged. This leads to the creation of keys 1-8 plus the nan key with the vector (0...0) assigned. This induces problems at the end of the function, when calling e.g. encodings[lab] for a patient belonging to class 9 (lab=9): the key 9 does not exist anymore in the encodings dict.

Thanks!

Use Hydra multirun mode to do optimization tasks

The hyperparameter optimization can be re-adapted to make use of Hydra's multirun mode (and/or sweeper).

Overview of tasks:

Below a list of tasks to convert these two modules from its current state:

Module and schema

Define new schemas for this task: OptimizeTaskConfig (inherits from TaskConfig) class in move.conf.schema.
Create new move.tasks.optimize_hyperparameters module and function.
Create example experiment config for hyperparameter search. See example YAML in hydra-app-example.

Function

For the function optimize_hyperparamters:

Log job number
Create objective function TSV (if it doesn't exist)
Load pre-processed data
Split data into training/test sets
Make dataloaders
Train model
Record values of objective function (append to TSV)

Misc.

Repeat for second optimization, taking a similar approach to the "identify associations" task (see move.tasks.identify_associations), i.e., detect type of task and change the value of the objective function (from accuracy to stabillity).
Add to move.__main__.
Re-format tutorial files for random.small dataset.

Open Questions

Is the best set of hyperparameters automatically selected based on the objective function value (e.g., reconstruction accuracy)?
- If so, I would suggest we also implement the Optuna plugin with a smarter sampler than greedy grid search.
- If not, then I suggest just saving the results, and then providing some visualization functions so the users can decide on their hyperparameter set.
#22

Resources:

Hydra's documentation on multirun and sweepers
Example Hydra app I wrote with multirun examples

The calculation of effect size

How to calculate effect size in MOVE？

Two functions in 05_identify_associations does not work

Functions/calls

make_files(collected_overlap, groups, con_list_concat, processed_data_path, recon_average_corr_all_indi_new, con_names, continuous_names, drug_h, drug, all_hits, types, version)

and

df_indi_var = get_inter_drug_variation(con_names, drug_h, recon_average_corr_all_indi_new, groups, collected_overlap, drug, con_list_concat, processed_data_path, types)

Currently does not work

Rename 05_identify_drug_assosiation

Rename 05_identify_drug_assosiation to 05_identify_drug_association

Clean-up config files

Moving to the task-based format introduced in #37, the configurations files need to be cleaned up:

Remove unused fields in move/conf/data/base_data (currently marked with # DEPRECATE).
Remove unused config groups: tuning_reconstruction, tuning_stability, training_latent, training_association *.
Remove any duplicate fields.
Remove any other unused fields (e.g., name in main).

* These will be re-implemented as configs of the task config group. (See #38 and #40)

Make each module output to its own folder

Make each module output to its own folder to make it easier to see what comes from which module

GPU Training

Reinstate and test training with GPU

Perturbations with continuous data

Add Bayesian decision method as an option for identify_associations

Convert Ricardo's code using Bayesian decision theory to run instead of t-test. Implementation could be controlled by config with

significance_method: ttest
significance_method: bayes_decision
significance_method: intersection

or similar

Re-structure module 4 (analyze latent)

To match the style of modules 1 and 5 (encode data and identify associations, respectively), this module needs to be refactored.

Task List

Define config schema (add visualization targets as fields)
Create new module and function
Create example config YAML
Add to main entry point

🔊 Warn user when a model is reused for the the latent task

In case of typos in configurations, the analyze latent reloads a model and finds the wrong dimensions. Should it be possible to always retrain?

Definitely log that a pre-trained model was found.

Use of log-file

We should write most information to a log-file instead of the screen.

tuning_stability.yaml

When step 02 is complete add repeats: 5 in the tuning_stability.yaml, this could potentially be decreased to 3 or 4 to save computational time?

Could learning rate be reduced to only 0.0001 and thus remove 1e-5?

Remove number of samples from categorical shapes

MOVEDataset stores categorical_shapes and continuous_shapes. The former is a list of tuples containing three numbers: number of samples, number of features, number of categories per feature. The first item is not needed, and it is inconsistent with continuous_shapes (which does not store the number of samples, but only number of features).

Several functions will be affected by this change.

Identify associations writes out thousands of files

Identify associations module writes out thousands of files in a results/subfolder/ . Is this really needed or can they be deleted?

Set up configuration system

I suggest using Hydra to set up a configuration file system, so users can easily modify and keep track of hyperparameters and other settings with a file (instead of manually typing them on the command line or a notebook).

Resources:

Project where I also set up a configuration system: MLOps_Sequences
From the Hydra docs: config search path, app packaging, compose API
Probably an insightful convo on GH issues: Allow configuring the search path
Kamado - I think how they integrate Hydra can also serve as an inspiration

NNFC Workshop

Shape of original_input and reconstruction do not match

Running MOVE with two continuous datasets works, but adding a third results in the error below (created with the maize dataset:
Adding the values of the third file to the second file runs without error.

Error executing job with overrides: ['task.batch_size=10', 'task.model.num_hidden=[500]', 'task.training_loop.num_epochs=40', 'experiment=maize__tune_reconstruction']
Traceback (most recent call last):
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\__main__.py", line 38, in main
    move.tasks.tune_model(config)
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\tasks\tune_model.py", line 249, in tune_model
    _tune_reconstruction(task_config)
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\tasks\tune_model.py", line 230, in _tune_reconstruction
    cosine_sim = calculate_cosine_similarity(con[mask], con_recon)
  File "C:\Users\t159g\.conda\envs\moveEnv\lib\site-packages\move\analysis\metrics.py", line 55, in calculate_cosine_similarity
    raise ValueError(
ValueError: Original input (4251, 716) and reconstruction (4251, 713) shapes do not match.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I reproduced the error with the Maize dataset by adding a third dataset constructed with R:

### Testing maize (does not work with maize_rnorm.tsv file)
maize_ids <- read.table('MOVE_tutorial/maize/data/maize_ids.txt')
maize_ids$V2 <- rnorm(nrow(maize_ids),10, 2)
maize_ids$V3 <- rpois(nrow(maize_ids),100)

write.table(maize_ids, 'MOVE_tutorial/maize/data/maize_rnorm.tsv', row.names = F, quote = F, sep = '\t')

#Adding similar values to existing file works
maize_microbiome <- read.table('MOVE_tutorial/maize/data/maize_metadata.tsv')
maize_microbiome$V2 <- rnorm(nrow(maize_microbiome),10, 2)
maize_microbiome$V3 <- rpois(nrow(maize_microbiome),100)

write.table(maize_microbiome , 'MOVE_tutorial/maize/data/maize_metadata2.tsv', quote = F, sep = '\t')

	mean_diff[i, :, :] += diff * normalizer

	# Calculate Bayes factors
	logger.info("Identifying significant features")
	bayes_k = np.empty((num_perturbed, num_continuous))
	for i in range(num_perturbed):
	mask = feature_mask[:, [i]] \| nan_mask # 2D: N x C
	diff = np.ma.masked_array(mean_diff[i, :, :], mask=mask) # 2D: N x C
	prob = np.ma.compressed(np.mean(diff > 1e-8, axis=0)) # 1D: C
	bayes_k[i, :] = np.log(prob + 1e-8) - np.log(1 - prob + 1e-8)

	# Calculate Bayes probabilities
	bayes_abs = np.abs(bayes_k)

rasmussenlab / move Goto Github PK

move's Introduction

MOVE (Multi-Omics Variational autoEncoder)

Installation

Installing MOVE package

Requirements

The MOVE pipeline

How to run MOVE

Data sets

DIRECT data set

Simulated and publicaly available data sets

Citation

move's People

Contributors

Stargazers

Watchers

Forkers

move's Issues

Overview of tasks:

Module and schema

Function

Misc.

Open Questions

Resources:

Task List

Resources:

Recommend Projects

Recommend Topics

Recommend Org