molecularai / qptuna Goto Github PK

View Code? Open in Web Editor NEW

83.0 4.0 15.0 106.94 MB

QSARtuna: QSAR model building with the optuna framework

Jupyter Notebook 92.22% Python 7.78% Shell 0.01%

qsar qsar-models hyperparameter-optimization compchem computational-chemistry optuna smiles-strings

qptuna's Introduction

QSARtuna 𓆛: QSAR using Optimization for Hyperparameter Tuning (formerly Optuna AZ and QPTUNA)

Build predictive models for CompChem with hyperparameters optimized by Optuna.

Developed with Uncertainty Quantification and model explainability in mind.

Background

This library searches for the best ML algorithm and molecular descriptor for the given data.

The search itself is done using Optuna.

Developed models employ the latest state-of-the-art uncertainty estimation and explainability python packages

Further documentation in the GitHub pages here.

QSARtuna Publication available here.

The three-step process

QSARtuna is structured around three steps:

Hyperparameter Optimization: Train many models with different parameters using Optuna. Only the training dataset is used here. Training is usually done with cross-validation.
Build (Training): Pick the best model from Optimization, and optionally evaluate its performance on the test dataset.
"Prod-build:" Re-train the best-performing model on the merged training and test datasets. This step has a drawback that there is no data left to evaluate the resulting model, but it has a big benefit that this final model is trained on the all available data.

JSON-based Command-line interface

Let's look at a trivial example of modelling molecular weight using a training set of 50 molecules.

Configuration file

We start with a configuration file in JSON format. It contains four main sections:

data - location of the data file, columns to use.
settings - details about the optimization run.
descriptors - which molecular descriptors to use.
algorithms - which ML algorithms to use.

Below is the example of such a file

{
  "task": "optimization",
  "data": {
    "training_dataset_file": "tests/data/DRD2/subset-50/train.csv",
    "input_column": "canonical",
    "response_column": "molwt"
  },
  "settings": {
    "mode": "regression",
    "cross_validation": 5,
    "direction": "maximize",
    "n_trials": 100,
    "n_startup_trials": 30
  },
  "descriptors": [
    {
      "name": "ECFP",
      "parameters": {
        "radius": 3,
        "nBits": 2048
      }
    },
    {
      "name": "MACCS_keys",
      "parameters": {}
    }
  ],
  "algorithms": [
    {
      "name": "RandomForestRegressor",
      "parameters": {
        "max_depth": {"low": 2, "high": 32},
        "n_estimators": {"low": 10, "high": 250},
        "max_features": ["auto"]
      }
    },
    {
      "name": "Ridge",
      "parameters": {
        "alpha": {"low": 0, "high": 2}
      }
    },
    {
      "name": "Lasso",
      "parameters": {
        "alpha": {"low": 0, "high": 2}
      }
    },
    {
      "name": "XGBRegressor",
      "parameters": {
        "max_depth": {"low": 2, "high": 32},
        "n_estimators": {"low": 3, "high": 100},
        "learning_rate": {"low": 0.1, "high": 0.1}
      }
    }
  ]
}

Data section specifies location of the dataset file. In this example it specifies a relative path to the tests/data folder.

Settings section specifies that:

we are building a regression model,
we want to use 5-fold cross-validation,
we want to maximize the value of the objective function (maximization is the standard for scikit-learn models),
we want to have a total of 100 trials,
and the first 30 trials ("startup trials") should be random exploration (to not get stuck early on in one local minimum).

We specify two descriptors and four algorithm, and optimization is free to pair any specified descriptor with any of the algorithms.

When we have our data and our configuration, it is time to start the optimization.

Run from Python/Jupyter Notebook

Create conda environment with Jupyter and Install QSARtuna there:

module purge
module load Miniconda3
conda create --name my_env_with_qsartuna python=3.10.10 jupyter pip
conda activate my_env_with_qsartuna
module purge  # Just in case.
which python  # Check. Should output path that contains "my_env_with_qsartuna".
python -m pip install https://github.com/MolecularAI/QSARtuna/releases/download/3.1.1/qsartuna-3.1.1.tar.gz

Then you can use QSARtuna inside your Notebook:

from qsartuna.three_step_opt_build_merge import (
    optimize,
    buildconfig_best,
    build_best,
    build_merged,
)
from qsartuna.config import ModelMode, OptimizationDirection
from qsartuna.config.optconfig import (
    OptimizationConfig,
    SVR,
    RandomForestRegressor,
    Ridge,
    Lasso,
    XGBRegressor,
)
from qsartuna.datareader import Dataset
from qsartuna.descriptors import ECFP, MACCS_keys, ECFP_counts, PathFP

# Prepare hyperparameter optimization configuration.
config = OptimizationConfig(
    data=Dataset(
        input_column="canonical",
        response_column="molwt",
        training_dataset_file="tests/data/DRD2/subset-50/train.csv",
    ),
    descriptors=[ECFP.new(), ECFP_counts.new(), MACCS_keys.new(), PathFP.new()],
    algorithms=[
        SVR.new(),
        RandomForestRegressor.new(),
        Ridge.new(),
        Lasso.new(),
        XGBRegressor.new(),
    ],
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=3,
        n_trials=100,
        direction=OptimizationDirection.MAXIMIZATION,
    ),
)

# Run Optuna Study.
study = optimize(config, study_name="my_study")

# Get the best Trial from the Study and make a Build (Training) configuration for it.
buildconfig = buildconfig_best(study)
with open("best_config.txt", "w") as f:
    f.write(str(buildconfig.__dict__))

# Build (re-Train) and save the best model.
build_best(buildconfig, "target/best.pkl")

# Build (Train) and save the model on the merged train+test data.
build_merged(buildconfig, "target/merged.pkl")

Running via CLI

QSARtuna can be deployed directly from the CLI

To run commands QSARtuna uses the following syntax:

qsartuna-<optimize|build|predict|schemagen> <command>

We can run three-step-process from command line with the following command:

  qsartuna-optimize \
  --config examples/optimization/regression_drd2_50.json \
  --best-buildconfig-outpath ~/qsartuna-target/best.json \
  --best-model-outpath ~/qsartuna-target/best.pkl \
  --merged-model-outpath ~/qsartuna-target/merged.pkl

Optimization accepts the following command line arguments:

shell
qsartuna-optimize -h 
usage: qsartuna-optimize [-h] --config CONFIG [--best-buildconfig-outpath BEST_BUILDCONFIG_OUTPATH] [--best-model-outpath BEST_MODEL_OUTPATH] [--merged-model-outpath MERGED_MODEL_OUTPATH] [--no-cache]

optbuild: Optimize hyper-parameters and build (train) the best model.

options:
  -h, --help            show this help message and exit
  --best-buildconfig-outpath BEST_BUILDCONFIG_OUTPATH
                        Path where to write Json of the best build configuration.
  --best-model-outpath BEST_MODEL_OUTPATH
                        Path where to write (persist) the best model.
  --merged-model-outpath MERGED_MODEL_OUTPATH
                        Path where to write (persist) the model trained on merged train+test data.
  --no-cache            Turn off descriptor generation caching

required named arguments:
  --config CONFIG       Path to input configuration file (JSON): either Optimization configuration, or Build (training) configuration.

Since optimization can be a long process, we should avoid running it on the login node, and we should submit it to the SLURM queue instead.

Submitting to SLURM

We can submit our script to the queue by giving sbatch the following script:

#!/bin/sh
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=4G
#SBATCH --time=100:0:0
#SBATCH --partition core

# This script illustrates how to run one configuration from QSARtuna examples.
# The example we use is in examples/optimization/regression_drd2_50.json.

module load Miniconda3
conda activate my_env_with_qsartuna

# The example we chose uses relative paths to data files, change directory.
cd /{project_folder}/

  /<your-project-dir>/qsartuna-optimize \
  --config {project_folder}/examples/optimization/regression_drd2_50.json \
  --best-buildconfig-outpath ~/qsartuna-target/best.json \
  --best-model-outpath ~/qsartuna-target/best.pkl \
  --merged-model-outpath ~/qsartuna-target/merged.pkl

When the script is complete, it will create pickled model files inside your home directory under ~/qsartuna-target/.

Using the model

When the model is built, run inference:

  qsartuna-predict \
  --model-file target/merged.pkl \
  --input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \
  --input-smiles-csv-column "canonical" \
  --output-prediction-csv-file target/prediction.csv

Note that prediction accepts a variety of command line arguments:

 qsartuna-predict -h
usage: qsartuna-predict [-h] --model-file MODEL_FILE [--input-smiles-csv-file INPUT_SMILES_CSV_FILE] [--input-smiles-csv-column INPUT_SMILES_CSV_COLUMN] [--input-aux-column INPUT_AUX_COLUMN]
                        [--input-precomputed-file INPUT_PRECOMPUTED_FILE] [--input-precomputed-input-column INPUT_PRECOMPUTED_INPUT_COLUMN]
                        [--input-precomputed-response-column INPUT_PRECOMPUTED_RESPONSE_COLUMN] [--output-prediction-csv-column OUTPUT_PREDICTION_CSV_COLUMN]
                        [--output-prediction-csv-file OUTPUT_PREDICTION_CSV_FILE] [--predict-uncertainty] [--predict-explain] [--uncertainty_quantile UNCERTAINTY_QUANTILE]

Predict responses for a given OptunaAZ model

options:
  -h, --help            show this help message and exit
  --input-smiles-csv-file INPUT_SMILES_CSV_FILE
                        Name of input CSV file with Input SMILES
  --input-smiles-csv-column INPUT_SMILES_CSV_COLUMN
                        Column name of SMILES column in input CSV file
  --input-aux-column INPUT_AUX_COLUMN
                        Column name of auxiliary descriptors in input CSV file
  --input-precomputed-file INPUT_PRECOMPUTED_FILE
                        Filename of precomputed descriptors input CSV file
  --input-precomputed-input-column INPUT_PRECOMPUTED_INPUT_COLUMN
                        Column name of precomputed descriptors identifier
  --input-precomputed-response-column INPUT_PRECOMPUTED_RESPONSE_COLUMN
                        Column name of precomputed descriptors response column
  --output-prediction-csv-column OUTPUT_PREDICTION_CSV_COLUMN
                        Column name of prediction column in output CSV file
  --output-prediction-csv-file OUTPUT_PREDICTION_CSV_FILE
                        Name of output CSV file
  --predict-uncertainty
                        Predict with uncertainties (model must provide this functionality)
  --predict-explain     Predict with SHAP or ChemProp explainability
  --uncertainty_quantile UNCERTAINTY_QUANTILE
                        Apply uncertainty threshold to predictions

required named arguments:
  --model-file MODEL_FILE
                        Model file name

Optional: inspect

To inspect performance of different models tried during optimization, use MLFlow Tracking UI:

module load mlflow
mlflow ui

Then open mlflow link your browser.

If you run mlflow ui on SCP, you can forward your mlflow port with a separate SSH session started on your local ("non-SCP") machine:

ssh -N -L localhost:5000:localhost:5000 [email protected]

("-L" forwards ports, and "-N" just to not execute any commands).

In the MLFlow Tracking UI, select experiment to the left, it is named after the input file path. Then select all runs/trials in the experiment, and choose "Compare". You will get a comparison page for selected runs/trials in the experiment.

Comparison page will show MLFlow Runs (called Trials in Optuna), as well as their Parameters and Metrics. At the bottom there are plots. For X-axis, select "trial_number". For Y-axis, start with "optimization_objective_cvmean_r2".

You can get more details by clicking individual runs. There you can access run/trial build (training) configuration.

Adding descriptors to QSARtuna

Add the descriptor code to the optunaz.descriptor.py file like so:

@dataclass
class YourNewDescriptor(RdkitDescriptor):
    """YOUR DESCRIPTION GOES HERE"""

    @apischema.type_name("YourNewDescriptorParams")
    @dataclass
    class Parameters:
        # Any parameters to pass to your descriptor here
        exampleOfAParameter: Annotated[
            int,
            schema(
                min=1,
                title="exampleOfAParameter",
                description="This is an example int parameter.",
            ),
        ] = field(
            default=1,
        )

    name: Literal["YourNewDescriptor"]
    parameters: Parameters

    def calculate_from_smi(self, smi: str):
        # Insert your code to calculate from SMILES here
        fp = code_to_calculate_fp(smi)
        return fp

Then add the descriptor to the list here:

AnyUnscaledDescriptor = Union[
    Avalon,
    ECFP,
    ECFP_counts,
    PathFP,
    AmorProtDescriptors,
    MACCS_keys,
    PrecomputedDescriptorFromFile,
    UnscaledMAPC,
    UnscaledPhyschemDescriptors,
    UnscaledJazzyDescriptors,
    UnscaledZScalesDescriptors,
    YourNewDescriptor, #Ensure your new descriptor added here
]

and here:

CompositeCompatibleDescriptor = Union[
    AnyUnscaledDescriptor,
    ScaledDescriptor,
    MAPC,
    PhyschemDescriptors,
    JazzyDescriptors,
    ZScalesDescriptors,
    YourNewDescriptor, #Ensure your new descriptor added here
]

Then you can use YourNewDescriptor inside your Notebook:

from qsartuna.descriptors import YourNewDescriptor

config = OptimizationConfig(
    data=Dataset(
        input_column="canonical",
        response_column="molwt",
        training_dataset_file="tests/data/DRD2/subset-50/train.csv",
    ),
    descriptors=[YourNewDescriptor.new()],
    algorithms=[
        SVR.new(),
    ],
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=3,
        n_trials=100,
        direction=OptimizationDirection.MAXIMIZATION,
    ),
)

or in a new config:

{
  "task": "optimization",
  "data": {
    "training_dataset_file": "tests/data/DRD2/subset-50/train.csv",
    "input_column": "canonical",
    "response_column": "molwt"
  },
  "settings": {
    "mode": "regression",
    "cross_validation": 5,
    "direction": "maximize",
    "n_trials": 100,
    "n_startup_trials": 30
  },
  "descriptors": [
    {
      "name": "YourNewDescriptor",
      "parameters": {
        "exampleOfAParameter": 3
      }
    }
  ],
  "algorithms": [
    {
      "name": "RandomForestRegressor",
      "parameters": {
        "max_depth": {"low": 2, "high": 32},
        "n_estimators": {"low": 10, "high": 250},
        "max_features": ["auto"]
      }
    }
  ]
}

qptuna's People

Contributors

Stargazers

Watchers

Forkers

cmargreitter rnaimehaom rneeser mejdeddine-mokhtar jeremycheminf lewismervin1 ardeat jonswain yupliu janoschmenke rahul7416-p yannickshaofengsun xy21hb nar-n sasaju

qptuna's Issues

some problem occured when i initiated qsartuna in jupyter notebook

Hi,
When i test qsartuna after i downloaded it through pip-install, an error occur

I found "optunaz" was used in codes, so i edit the file and tried to use "optunaz" but it doesn't work

I am not sure whether it's a problem, i reinstalled qsartuna but this error still occured. So i need your assistance.

Question about adding custom descriptor calculation method

Hello

Thank you for the wonderful package.
I would like to definitely make use of it for utilizing MLOps.

Is it possible to integrate a custom descriptor calculation library into this library?

PermissionError: [Errno 13] Permission denied, for 3.0.0.1 and 3.0.0 version

hi,

while I try to run the chemprop part, by following the example of the most basic ChemProp run, which will train the algorithm using the recommended (sensible) defaults for the MPNN architecture.

the error occured.

�[32m[I 2024-05-03 18:36:42,881]�[0m A new study created in memory with name: my_study�[0m
�[32m[I 2024-05-03 18:36:42,890]�[0m A new study created in memory with name: study_name_0�[0m
INFO:root:Enqueued ChemProp manual trial with sensible defaults: {'activation': 'ReLU', 'aggregation': 'mean', 'aggregation_norm': 100, 'batch_size': 50, 'depth': 3, 'dropout': 0.0, 'features_generator': 'none', 'ffn_hidden_size': 300, 'ffn_num_layers': 3, 'final_lr_ratio_exp': -1, 'hidden_size': 300, 'init_lr_ratio_exp': -1, 'max_lr_exp': -3, 'warmup_epochs_ratio': 0.1, 'algorithm_name': 'ChemPropRegressor'}
�[33m[W 2024-05-03 18:36:49,239]�[0m Trial 0 failed with parameters: {'algorithm_name': 'ChemPropRegressor', 'ChemPropRegressor_algorithm_hash': '668a7428ff5cdb271b01c0925e8fea45', 'activation': <ChemPropActivation.RELU: 'ReLU'>, 'aggregation': <ChemPropAggregation.MEAN: 'mean'>, 'aggregation_norm': 100.0, 'batch_size': 50.0, 'depth': 3.0, 'dropout': 0.0, 'ensemble_size': 1, 'epochs': 5, 'features_generator': <ChemPropFeatures_Generator.NONE: 'none'>, 'ffn_hidden_size': 300.0, 'ffn_num_layers': 3.0, 'final_lr_ratio_exp': -1, 'hidden_size': 300.0, 'init_lr_ratio_exp': -1, 'max_lr_exp': -3, 'warmup_epochs_ratio': 0.1, 'descriptor': '{"name": "SmilesFromFile", "parameters": {}}'} because of the following error: ValueError('\nAll the 2 fits failed.\nIt is very likely that your model is misconfigured.\nYou can try to debug the error by setting error_score=\'raise\'.\n\nBelow are more details about the failures:\n--------------------------------------------------------------------------------\n1 fits failed with the following error:\nTraceback (most recent call last):\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\sklearn\\model_selection\\_validation.py", line 890, in _fit_and_score\n    estimator.fit(X_train, y_train, **fit_params)\n  File "d:\\Cheminfo_Workshop\\3_MachineLearning_V1\\QSARtuna-3.0.0.1\\optunaz\\algorithms\\chem_prop.py", line 255, in fit\n    pd.DataFrame(\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\util\\_decorators.py", line 211, in wrapper\n    return func(*args, **kwargs)\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\core\\generic.py", line 3720, in to_csv\n    return DataFrameRenderer(formatter).to_csv(\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\util\\_decorators.py", line 211, in wrapper\n    return func(*args, **kwargs)\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\io\\formats\\format.py", line 1189, in to_csv\n    csv_formatter.save()\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\io\\formats\\csvs.py", line 241, in save\n    with get_handle(\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\io\\common.py", line 856, in get_handle\n    handle = open(\nPermissionError: [Errno 13] Permission denied: \'C:\\\\Users\\\\lsy\\\\AppData\\\\Local\\\\Temp\\\\tmpnon7aoeh\'\n\n--------------------------------------------------------------------------------\n1 fits failed with the following error:\nTraceback (most recent call last):\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\sklearn\\model_selection\\_validation.py", line 890, in _fit_and_score\n    estimator.fit(X_train, y_train, **fit_params)\n  File "d:\\Cheminfo_Workshop\\3_MachineLearning_V1\\QSARtuna-3.0.0.1\\optunaz\\algorithms\\chem_prop.py", line 255, in fit\n    pd.DataFrame(\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\util\\_decorators.py", line 211, in wrapper\n    return func(*args, **kwargs)\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\core\\generic.py", line 3720, in to_csv\n    return DataFrameRenderer(formatter).to_csv(\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\util\\_decorators.py", line 211, in wrapper\n    return func(*args, **kwargs)\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\io\\formats\\format.py", line 1189, in to_csv\n    csv_formatter.save()\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\io\\formats\\csvs.py", line 241, in save\n    with get_handle(\n  File "c:\\Users\\lsy\\anaconda3\\envs\\plantain\\lib\\site-packages\\pandas\\io\\common.py", line 856, in get_handle\n    handle = open(\nPermissionError: [Errno 13] Permission denied: \'C:\\\\Users\\\\lsy\\\\AppData\\\\Local\\\\Temp\\\\tmp25ubl3rt\'\n').�[0m
Traceback (most recent call last):
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\optuna\study\_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "d:\Cheminfo_Workshop\3_MachineLearning_V1\QSARtuna-3.0.0.1\optunaz\objective.py", line 191, in __call__
    scores = sklearn.model_selection.cross_validate(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\sklearn\utils\_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\sklearn\model_selection\_validation.py", line 445, in cross_validate
    _warn_or_raise_about_fit_failures(results, error_score)
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _warn_or_raise_about_fit_failures
    raise ValueError(all_fits_failed_message)
ValueError: 
All the 2 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\sklearn\model_selection\_validation.py", line 890, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "d:\Cheminfo_Workshop\3_MachineLearning_V1\QSARtuna-3.0.0.1\optunaz\algorithms\chem_prop.py", line 255, in fit
    pd.DataFrame(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\core\generic.py", line 3720, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\io\formats\format.py", line 1189, in to_csv
    csv_formatter.save()
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\io\formats\csvs.py", line 241, in save
    with get_handle(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\io\common.py", line 856, in get_handle
    handle = open(
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\lsy\\AppData\\Local\\Temp\\tmpnon7aoeh'

--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\sklearn\model_selection\_validation.py", line 890, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "d:\Cheminfo_Workshop\3_MachineLearning_V1\QSARtuna-3.0.0.1\optunaz\algorithms\chem_prop.py", line 255, in fit
    pd.DataFrame(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\core\generic.py", line 3720, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\io\formats\format.py", line 1189, in to_csv
    csv_formatter.save()
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\io\formats\csvs.py", line 241, in save
    with get_handle(
  File "c:\Users\lsy\anaconda3\envs\plantain\lib\site-packages\pandas\io\common.py", line 856, in get_handle
    handle = open(
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\lsy\\AppData\\Local\\Temp\\tmp25ubl3rt'

�[33m[W 2024-05-03 18:36:49,243]�[0m Trial 0 failed with value None.�[0m

my OS is windows 10, with python 3.10.12

why can't I acess the temp folder? could you please provide some suggestions how to fix it up?
many thanks,

Missing files

There seems to be some missing files like optunaz/algorithms.py

Instructions to create the environment missing

Hello.
Could you please share the instructions to create the environment?
Thanks

Install instruction on conda has link to AZ server failing

The installation to run the tool with a notebook has this line
python -m pip install http://pages.scp.astrazeneca.net/mai/qptuna/releases/Qptuna_latest.tar.gz
but when running it, it gives error to access the file

Output of training and testing sets

Hello, I created the QSARtuna configuration in Python and set up a dataset splitting strategy. How do I output the training and testing set files? Looking forward to your response.

for Choosing scoring function, how to set OptimizationDirection

To improve the accuracy and performance of the model. In QPTUNA, for training when neg_mean_absolute_error is used as the objective, should minimization or maximization be chosen as the optimization direction?

    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=3,
        n_trials=100,
        n_startup_trials=50,
        scoring="neg_mean_absolute_error",  # Scoring function name from scikit-learn.
        direction=OptimizationDirection.MAXIMIZATION, # MAXIMIZATION or Minimization?
        track_to_mlflow=False,

many thanks,

Does QSARtuna Utilize GPU Resources?

I noticed that the project uses PyTorch, but during execution, I observed that my GPU was not being utilized. Is additional configuration required to enable GPU acceleration?

CPU: AMD 5600G
GPU: GTX1060 6G
CUDA Version: CUDA 12.2
Driver Version: 535.183.01

import torch
torch.cuda.is_available()
# True

PrecomputedDescriptorFromFile doesn't accept stereo smiles?

Hi
I'm not sure, here, but after spending some time and not understanding why my file was failing but test in data was, I used
Function to remove stereochemistry from SMILES

def remove_stereo(smiles):
    mol = Chem.MolFromSmiles(smiles)
    smiles_no_stereo = Chem.MolToSmiles(mol, isomericSmiles=False)
    return smiles_no_stereo

Then save the files with fp column and I could run the study; otherwise I got error with descriptors can not be calculated.
I have no idea why because other functions like "UnscaledPhyschemDescriptors" worked without any issues.

Another problem with the function, if the column we want to add is recognized as integer by pandas then it fails because the code is looking for float (I used value + 0.000001 to get around it)

ChemProp classifier warnings

When I use the following code:

from optunaz.utils.preprocessing.splitter import Random
from optunaz.utils.preprocessing.deduplicator import KeepMedian
from optunaz.config.optconfig import ChemPropHyperoptClassifier
from optunaz.descriptors import SmilesBasedDescriptor, SmilesFromFile

config = OptimizationConfig(
data=Dataset(
input_column="Smiles", # Smiles column.
response_column="Activity", # Activity column.
training_dataset_file="reinvent4_preparation.csv", # This will be split into train and test.
split_strategy=Random(fraction=0.2),
deduplication_strategy=KeepMedian(),
),
descriptors=[
SmilesFromFile.new(),
],
algorithms=[
ChemPropHyperoptClassifier.new(epochs=100, num_iters=2), #num_iters>2: enable hyperopt within ChemProp trials
],
settings=OptimizationConfig.Settings(
mode=ModelMode.CLASSIFICATION,
cross_validation=5,
n_startup_trials=50,
n_trials=300,
direction=OptimizationDirection.MAXIMIZATION,
),
)

Setup basic logging.
import logging
from importlib import reload
reload(logging)
logging.basicConfig(level=logging.INFO)

Avoid decpreciated warnings from packages etc
import warnings
warnings.simplefilter("ignore")
def warn(*args, **kwargs):
pass
warnings.warn = warn

study = optimize(config, study_name="BPA_ChemProp_hyperopt")

I get the following warnings:

/home/wout/.anaconda/envs/Qptuna/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1497: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/home/wout/.anaconda/envs/Qptuna/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1497: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/home/wout/.anaconda/envs/Qptuna/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1497: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

I am not sure if this is a problem. I also wonder what the optimal amount of epochs, trials and etc are, because it feels like in the tutorial everything is small for the sake of computational power. I would think that the amount of trials should be as high as possible, but when I use it for ChemProp, all trials have the same outputvalue.