Data Preparation Script for the IMI Project MELLODDY

Data preparation scripts, including locality sensitive hashing (LSH) for fold splitting and activity formatting.

Main authors: Lukas Friedrich (Merck KGaA), Jaak Simm (KU Leuven)

Contributors: Lina Humbeck (Boehringer Ingelheim), Ansgar Schuffenhauer (Novartis), Niko Fechner (Novartis), Noe Sturm (Novartis), Anastasia Pentina (Bayer), Wouter Heyndrickx (Janssen Pharmaceuticals), Peter Schmidtke (Servier/Discngine)

Requirements

The data preprocessing script requires:

Python 3.6 or higher
Local Conda installation (e.g. miniconda)
Git installation

Setup the environment

First, clone the git repository from the MELLODDY gitlab repository:

git clone https://github.com/melloddy/MELLODDY-TUNER.git

Then, you can install the conda environment "melloddy_data_prep" and the required packages by runnning the following command:

sh install_environment.sh

A list with all installed packages can be found in the file: environment_melloddy_data_prep.yml

The environment can be activated by:

conda activate melloddy_tuner

After the activationof the environment, you have to install the melloddy-tuner package with pip:

pip install -e .

Finally, you can run the processing scripts located in git_repo/bin/.

Prepare Input Files

The input files contain information about structures (T2) and activity data of these structures (T4) in certain assays.
A weight table file describes the input assays and their corresponding weights in multi-class prediction setup (T3).
As an example, you can download prepared input files from ChEMBL (v25) from the KU Leuven homepage:

ChEMBL25 example files

To run the preprocessing script, the input files should be in csv format and should contain:

a) structure file (T2) containing 2 columns with the headers:
1. input_compound_id
2. smiles

b) activity file (T4) containing 3 columns with the headers:
1. input_compound_id
2. classification_task_id
3. class_label

c) weight table (T3) containing the following columns with headers:
1. classification_task_id
2. input_assay_id (optional, your own assay identifier)
3. assay_type
4. weight (individual weight of each assay for multi-class predictions)

Additional columns are allowed in this weight table.

An example configuration file for standardization is provided in:

/tests/structure_preparation_test/example_parameters.json

containing information about structure standardization options, fingerprint settings,
encryption key, high entropy bits for train/test splitting with LSH and activity data thresholds. The config file can also be modified by the user.

Run Data Prepration Script for Training

To standardize and prepare your input data and create ML-ready files, run the following command with arguments:\

1. path to your T2 structure file (--structure_file)
2. path to your T4 activity file (--activity_file)
3. path to your weight table T3 (--weight_table)
4. path to the config file (--conf_file)
5. path of the output directory, where all output files will be stored (--output_dir)
6. user-defined name of your current run (--run_name)
7. (Optional) Number of CPUs to use during the execution of the script (default: 2) (--number_cpu)
8. (Optional) JSON file with a reference hash key to ensure usage of the same paramters between different users. (--ref_hash)
9. (Optional) Non-interactive mode for cluster/server runs.(--non_interactive)
As an example, you can prepare your data for training by executing prepare_4_melloddy.py:

python bin/prepare_4_melloddy.py \
--structure_file {path/to/your/structure_file_T2.csv}\
--activity_file {/path/to/your/activity_data_file_T4.csv}\
--weight_table {/path/to/your/weight_table_T3.csv}\
--config_file {/path/to/the/distributed/parameters.json}\
--output_dir {path/to/the/output_directory}\
--run_name {name of your current run}\
--number_cpu {number of CPUs to use}\
--ref_hash {path/to/the/provided/ref_hash.json}\

In the given output directory the script will create a folder with the name of the "run_name" and three subfolders:

path/to/the/output_directory/run_name/files_4_ml
path/to/the/output_directory/run_name/results
path/to/the/output_directory/run_name/results_tmp

The folder "results" contains the files which the model will use for the predictions (T11 and T10, and a T10 aggregated by counts, and the weight tables T3_mapped and T9). The folder "results_tmp" contains subfolders for standardization, descriptors and activity formatting including mapping tables and additional output files to track duplicates, failed entries or excluded data. The folder "files_4_ml" contains files which are ready to run the machine learning scripts. The script generates two mtx files (for structure (X), and activity (Y) data) and the folding vector as npy file. It also contains a copy of T10_counts.csv and a reduced version of the weight table T9 (weight_table_T9_red.csv)

The script will also generate a json file ("generated_hash.json") containing a hash key based on a reference set to ensure that every partner uses the same parameters. If a "ref_hash.json" is provided by the user, the "generated_hash.json" will be compared to it and the will stop, if the keys do not match.

An example reference hash key file for example_parameters.json is given in:

/tests/structure_preparation_test/ref_hash.json

Run Data Preparation Script for Prediction

For predicting new compounds with an already trained ML model, only a structure file (like T2.csv) has to be preprocessed. To prepare your structure files, please add the argument --prediction_only when running the two scripts.

To standardize and prepare your input data for prediction, run the following command with arguments:
1. Add the argument --prediction_only to run process only structure data
2. path to your T2 structure file (--structure_file)
3. path to the config file (--config_file)
4. path of the output directory, where all output files will be stored (--output_dir)
5. user-defined name of your current run (--run_name)
6. (Optional) Number of CPUs to use during the execution of the script (default: 2) (--number_cpu)
7. (Optional) JSON file with a reference hash key to ensure usage of the same paramters between different users. (--ref_hash)
8. (Optional) Non-interactive mode for cluster/server runs. (--non_interactive)

For example, you can run:

python bin/prepare_4_melloddy.py \
--prediction_only\
--structure_file {path/to/your/structure_file_T2.csv}\
--config_file {/path/to/the/distributed/parameters.json}\
--output_dir {path/to/the/output_directory}\
--run_name {name of your current run}\
--number_cpu {number of CPUs to use}\
--ref_hash {path/to/the/provided/ref_hash.json}\

Run individual scripts

The data processing includes 3 different steps, which can be performed independently from each other.

bin/standardize_smiles.py takes the input smiles csv file and standardizes the smiles according to pre-defined rules.
bin/calculate_descriptors.py calculates a descriptor based on the standardized smiles, hash the descriptors with the given key and split the data into a given number of folds using a locality-sensitive hashing.
bin/activity_data_formatting.py formats the input bioactivity data into the required output format considering pre-defined rules.
bin/hash_reference_set.py standardize a reference set of molecules as a unit test to ensure that the same configuration was used.
bin/csv_2_mtx.py formats the result csv files into ML ready data formats.

Comparison with Reference Result Files

To verify the common script, please run the pipeline with the provided public data sets (chembl25_T2.csv, chembl25_T3.csv, chembl25_T4.csv).

Please use the config file:

tests/structure_preparation_test/example_parameters.json

And as reference hash file:

tests/structure_preparation_test/ref_hash.json

To process the given ChEMBL files, run the following code:

python bin/prepare_4_melloddy.py \
--structure_file {path/to//chembl_T2.csv}\
--activity_file {/path/to/chembl_T4.csv}\
--weight_table {/path/to/chembl_T3.csv}\
--config_file {/path/to/tests/structure_preparation_test/example_parameters.json}\
--output_dir {path/to/the/output_directory}\
--run_name {name of your current run}\
--number_cpu {number of CPUs to use}\
--ref_hash {path/to/tests/structure_preparation_test/ref_hash.json}\

The ref_hash.json file can be found in tests/structure_preparation_test/.

Already preprocessed files ready for multi-task machine learning with SparseChem SparseChem can be found here: ChEMBL25_processed

Docker

Build the docker image

In order to build the docker image on your computer simply run:

docker build -t melloddy/data_prep .

Run the Data Preparation using the docker image

This should not be officially used right now, but is made available in case it is of use for testing notably here:

Prerequisit

You need to build the docker image prior to running it on your machine

Command Line

--activity_file {/data/path/to/your/activity_data_file_T4.csv}\
--weight_table {/data/path/to/your/weight_table_T3.csv}\
--config_file /params/parameters.json\
--output_dir {/data/path/to/the/output_directory}\
--run_name {name of your current run}\
--number_cpu {number of CPUs to use}
--ref_hash {path/to/the/provided/ref_hash.json}

Example command line on Chembl reference test-set:

docker run -v $PWD/chembl:/chembl  melloddy/data_prep conda run -n melloddy_tuner python bin/prepare_4_melloddy.py --structure_file /chembl/chembl_T2.csv --activity_file /chembl/chembl_T4.csv --weight_table /chembl/chembl_T3.csv --config_file /opt/data_prep/melloddy_tuner/parameters.json --output_dir /chembl --number_cpu 4 --run_name chembl_test --ref_hash /chembl/ref_hash.json

lukasfriedrich91 / melloddy-tuner Goto Github PK

melloddy-tuner's Introduction

Data Preparation Script for the IMI Project MELLODDY

Requirements

Setup the environment

Prepare Input Files

Run Data Prepration Script for Training

Run Data Preparation Script for Prediction

Run individual scripts

Comparison with Reference Result Files

Docker

Build the docker image

Run the Data Preparation using the docker image

Prerequisit

Command Line

melloddy-tuner's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent