paccmann / paccmann_datasets Goto Github PK

View Code? Open in Web Editor NEW

26.0 5.0 7.0 5.95 MB

pytoda - PaccMann PyTorch Dataset Classes. Read the docs: https://paccmann.github.io/paccmann_datasets/

License: MIT License

Python 100.00%

pytorch chemoinformatics bioinformatics rdkit smiles deep-learning python

paccmann_datasets's Introduction

PyToDa

Overview

pytoda - PaccMann PyTorch Dataset Classes

A python package that eases handling biochemical data for deep learning applications with pytorch.

Installation

pytoda ships via PyPI:

pip install pytoda

Documentation

Please find the full documentation here.

Development

For development setup, we recommend to work in a dedicated conda environment:

conda env create -f conda.yml

Activate the environment:

conda activate pytoda

Install in editable mode:

pip install -r dev_requirements.txt
pip install --user --no-use-pep517 -e .

Examples

For some examples on how to use pytoda see here

References

If you use pytoda in your projects, please cite the following:

@article{born2021datadriven,
  author = {
    Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and
    Mill,Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and
    Cardinale, Antonio and Laino, Teodoro and 
    {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a
  },
  doi = {10.1088/2632-2153/abe808},
  issn = {2632-2153},
  journal = {Machine Learning: Science and Technology},
  number = {2},
  pages = {025024},
  title = {{
    Data-driven molecular design for discovery and synthesis of novel ligands: 
    a case study on SARS-CoV-2
  }},
  url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808},
  volume = {2},
  year = {2021}
}
@article{born2021paccmannrl,
    title = {
      PaccMann$^{RL}$: De novo generation of hit-like anticancer molecules from
      transcriptomic data via reinforcement learning
    },
    journal = {iScience},
    volume = {24},
    number = {4},
    year = {2021},
    issn = {2589-0042},
    doi = {https://doi.org/10.1016/j.isci.2021.102269},
    url = {https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6},
    author = {
      Jannis Born and Matteo Manica and Ali Oskooei and Joris Cadow and Greta Markert
      and Mar{\'\i}a Rodr{\'\i}guez Mart{\'\i}nez}
    }
}

paccmann_datasets's People

Contributors

Stargazers

Watchers

Forkers

chloe-shen halinee yoelshoshan rnaimehaom m4kxjcxv gihanpanapitiya phillipdowney

paccmann_datasets's Issues

Refactor from codecov to SonarQube

Due to codecov vulnerability issues. Deadline: 15.12.2021

Python 3.6 incompatibility

Caused by:

from typing import OrderedDict

which was introduced in 3.7.2: https://docs.python.org/3/library/typing.html#typing.OrderedDict

Facing issues to install the repository with Conda

Hi, I'm following the instructions written in README.md to install pytoda with Conda but it still fails.
When I run the command:

$ conda env create -f conda.yml

I get the following message:

Could not find a version that satisfies the requirement rdkit==2019.03.1 (from -r /home/biohpc/paccmann_datasets/condaenv.5xlin6py.requirements.txt (line 1)) (from versions: )
No matching distribution found for rdkit==2019.03.1 (from -r /home/biohpc/paccmann_datasets/condaenv.5xlin6py.requirements.txt (line 1))

I also face similar issues with other packages, e.g. torch==1.0.1 does not exist, the latest available version is 0.1.2.post2.
To overcome these sort of problem I tried to change the versions in conda.yml but I didn't succeed.

Some similar happens when installing paccmann_predictor.
Could you help me install the repositories to make predictions?

Thanks

Pytorch import order

Description:
There are multiple reports of errors in packages that are imported downstream after imports of old versions of pytorch.
This includes:

rdkit
scipy
...?

It can even lead to segmentation faults as described here. The occurrence of the error seems to depend on the version of pytorch but also on the channel through which it was installed.

Solution:
I think it's important to fix this for pytoda users since the error messages are always in downstream packages and are thus very difficult to debug. A simple fix from our side is to always sort imports so that torch is imported last. The configs for package sorting are saved in the pyproject.toml under [tool.isort] which has an argument force_to_top but unfortunately no force_to_bottom.
A fix could be to define force_to_top = ["rdkit", "scikit-learn"] in pyproject.toml.

add polymer dataset

Add a dataset for polymers that creates an arbitrary tuple of smiles datasets.

same params as smiles dataset but takes lists for each param, length of list corresponds to amount of smiles dataset.
Single smiles language object for all of them.
We need modification in smiles language to handle start and stop of individual smiles entities like <START_MONOMER> etc (--> polymer_smiles_language class?)
loader returns a tuple, one smiles per dataset

generic transforms should be removed from the SMILES ones

paccmann_datasets/pytoda/datasets/protein_sequence_dataset.py

Line 7 in 3885e3d

from ..smiles.transforms import LeftPadding, Randomize, ToTensor

Make sure all tests are enabled (check for presence of init.py)

Integrate BigSMILES

https://pubs.acs.org/doi/abs/10.1021/acscentsci.9b00476

Using different protein_languages in PPI dataset

Currently, in the ProteinProteinInteractionDataset, we enter one protein_language that is used to encode all sequences. Since we currently have a number of different ways of encoding amino acids sequences (biophysical features, blosum matrix, learned embedding), it should be possible to use different encodings for the two inputs to a PPI dataset. I.e. I might want to encode one of the proteins using the blosum matrix, but the other with a learned embedding. For this, two different protein languages would need to be used in the PPI dataset.

I would propose that we enable entering a Sequence of ProteinLanguage objects, where the first entry is used to encode the first entry in the sequence_filepaths, and the second for the second one. Default can still be generating a single ProteinLanguage object and using it for both sequences.


class ProteinProteinInteractionDataset(Dataset):
    def __init__(
        self,
        sequence_filepaths: Union[Files, Sequence[Files]],
        entity_names: Sequence[str],
        labels_filepath: str,
        sequence_filetypes: Union[str, List[str]] = 'infer',
        annotations_column_names: Union[List[int], List[str]] = None,
        protein_language: ProteinLanguage = None,
        amino_acid_dict: str = 'iupac',
        paddings: Union[bool, Sequence[bool]] = True,
        padding_lengths: Union[int, Sequence[int]] = None,
        add_start_and_stops: Union[bool, Sequence[bool]] = False,
        augment_by_reverts: Union[bool, Sequence[bool]] = False,
        randomizes: Union[bool, Sequence[bool]] = False,
        device: torch.device = (
            torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        )
    ) -> None:

transform_encoding in smiles_language.py

Hi!

In my test (e.g paccmann_predictor/examples/train_paccmann.py) , I checked this function no defined in this script (smiles_language.py)

Could you check this issue?

Thank you in advance!

SMILESLanguage transforms can't be set in datasets

If a SMILESLanguage or SMILESTokenizer object is passed to DrugAffinityDataset or any downstream class like SMILESTokenizerDataset, it is impossible to control the transforms of the tokenizer object. This is only possible if a new object is created.
See SMILESTokenizerDataset:

        if smiles_language is not None:
            self.smiles_language = smiles_language
        else:
            language_kwargs = {}  # SMILES default
            if selfies:
                language_kwargs = dict(
                    name='selfies-language', smiles_tokenizer=split_selfies
                )
            self.smiles_language = SMILESTokenizer(
                **language_kwargs,
                canonical=canonical,
                augment=augment,
                kekulize=kekulize,
                all_bonds_explicit=all_bonds_explicit,

I suggest to do as follows:

add a line above in the first if where self.smiles_language.set_smiles_transforms() is called to set the transforms with what has been passed to the SMILESTokenizerDataset constructor
More importantly: Find a way to save the transforms as part of the SMILES Tokenizer object. These transformations can heavily indicate performance. At this point, it is very easy to mess things up. Effectively whenever any transformation was different from the default during model training AND a smiles language is passed at test time when the model is evaluated e.g. on a different dataset, the language will behave differently. That's a bug.
I believe we should wrap all data needed to construct a smiles language/tokenizer into a single .json. Now we keep dragging 3 files (vocab, tokenizer_config and token_count) from which effectively only the first is needed for restoring. If we unify these into a nested json that also has a logic to restore the transforms we heavily simplify our lives.

low GPU utilization

I've noticed that training (and inference) loops of multiple models that use default pytoda datsets settings have very low GPU utilization.

Initially I thought that it's because people keep num_workers=0 in their configs which makes pytorch DataLoader to only work in a single process. However, after increasing num_workers into any value larger than 0 it crashed.

Initially it was just a matter of few lambda functions in pytoda datasets which multiprocessing doesn't like due to pickling issues (which was pretty easy to fix)
But after that, I kept getting the error which I describe here

Shortly - you get
"RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

Using the suggested solution of setting to 'spawn' does not actually solve the problem.

Using the modification of NOT transferring tensors to the GPU inside pytoda resulted in x5-x20 speedup in training so I think that this is a pretty critical aspect.

I think that device should not be part of the api, and if it is it should at least default to cpu (and have a warning when it's not)

I can prepare an initial PR if you'd like.

edit: @jannisborn pointed me to a related issue: #105

complementing Protein Sequence Dataset

It would be nice to have a more flexible protein sequence dataset. In particular we want to have the option to:

either use or not use the tokenized protein sequence (single int per AA)
either use or not use a list of biochemical properties of each AA (5 binary features like polarity, aromaticiity etc)
either use or not use a list of continuous features (e.g. dissociation constants, molecular weights etc, see here: https://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-chart.html)

Fix all docstrings

Docstrings should follow a common guideline. To be able to use autodoc and generate the pdf and html documentation we should comply to this style: https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html.

AttributeError: Can't pickle local object 'SMILESTokenizer._set_token_len_fn.<locals>.<lambda>'

As of now datasets or training scripts using 'spawn' multiprocessing fail because of this error.

_CashDataset has default 1GB size limit

there is a default file size limit for the cache

size_limit (int, in bytes) default one gigabyte - approximate size limit of cache.
http://www.grantjenks.com/docs/diskcache/api.html#constants

Not very useful in the case where the data doesn't fit into memory. In case of larger data, a likely error to the user is KeyError: 0 when starting to iterate.

There should be a way to pass arguments to the cache instantiation from the inheriting classes _SmiLazyDataset and _CsvLazyDataset and the uses of these.
An error should be raised when loading too large data right away.

Device handling in Datasets

@drugilsberg @jannisborn I'm opening this issue to revise and discuss whether there should be any device handling and what the default device or even datatype of items should be. Much of this is motivated by PyTorch Lightning.

First of all, a Dataset is not required at all to return a torch.Tensor when indexed (also briefly discussed in #98), e.g. SmilesDataset returns str. The default torch Dataloader/collate_fn handles np.arrays, strings, even Mappable such as dicts. Actually it handles np.arrays even better than we do (avoiding a copy if possible https://github.com/pytorch/pytorch/blob/b643dbb8a4bf9850171b2de848b6b89206973972/torch/utils/data/_utils/collate.py#L51).

Would you agree that we should not cast to tensor if there is no benefit within the dataset itself?

I think our handling of device is uncommon. One would usually only send data to the gpu in advance of the training loop when one knows the entire data to fit in GPU memory, as this avoids any sending of data between devices. This is not the case for us where most of the time we end up calling tensor.to(device) in __getitem__.

I'm assuming now that the data originates from cpu anyway, and we merely have to decide when the transfer to other devices should happen. In the generic case that not all data can be sent to gpu, doing it in __getitem__ has a big overhead versus sending batches. So it should to be done in the training loop. (Lightning does this for us btw)

So I argue to not ever set the device parameter in our datasets to use a gpu. I think we should remove any device handling from pytoda.datasets. At least we should set the default to 'cpu'.

Leaving things on the cpu allows also to later send only parts of the data to specific devices, as torch.DistributedSampler does (and again Lightning for us under the hood), allowing multi gpu training. Currently our datasets would send all data to all gpus.
Also the dataloader could prepare on cpu with multiple workers. So instead of worrying about device, we should take care that the datasets do well on cpu and can be used with multiple workers (torch docs issue a warning here not to return cuda tensors!).

We should check if we have to do anything to support Dataloader(..., pin_memory=True).

Reading:
https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
https://pytorch.org/docs/stable/data.html#multi-process-data-loading
https://pytorch.org/docs/stable/data.html#memory-pinning

Shape of tensors returned by SMILES_dataset

The SMILESDataset class currently returns per default tensors of shape sequence_length x 1. This is not ideal since the additional dimension is not understood by most models per default.
As a workaround, we use torch.squeeze() in the code that currently uses SMILESDataset objects.

Writing methods that want to use models and loaders generically (like the uncertainity estimation functions in paccmann_predictor) causes problems then.

Check test_smiles_eager_dataset.py for the details.

Transformation parameter in DrugSensitivityDataset

Abstract class DrugSensitivityDataset does not allow to be fed the parameter used for standardization or min/max scaling. A workaround for this is done in GeneExpressionDataset though.

This lowers SNR and thus hampers performance, but also violates the concept of entirely unseen data. I thus mark it as a bug.

We need to:

adapt DrugSensitivityDataset to allow the respective IC50 transforms with predefined params
adapt the call to GeneExpressionDataset inside DrugSensitivityDataset to also pass these values down.
merge and push the tag of pytoda
adapt the training script in paccmann_predictor to update the model_params.json with the respective transformation parameters since they are needed downstream, e.g. to compute the micromolar IC50

Integrate SMILES-PE

We should integrate the tokenizers:

https://github.com/XinhaoLi74/SmilesPE

Remove invalid SMILES

Atm, the SMILESDataset setup fails if there is a single invalid SMILES.

We should do two things:

raise a warning whenever an invalid SMILES is encountered. The transforms (like canonicalization) will not apply and the raw SMILES will be tokenized (@jannisborn)
Create a script that parses a .smi file and removes all invalid SMILES (@drugilsberg).

Add TMAP to project dependencies

The visualization module uses TMAP (alongside faerun) to generate the graphs.

It should be added to the project. However it is not available via pip and I am not sure what is the clean way of adding it (and adding it to the CI docker).

add DrugAffinityDataset

Routines to convert FASTA and SMILES

We should have routines converting between AA sequences and SMILES (if possible). Should raise warnings if conversion failed.

Not sure we need them as transforms in pytoda.smiles.transforms atm, but I would write them as simple methods into pytoda.smiles.utils .

Improve AnnotatedDataset

Better handling of the index.

PyPI release

With rdkit now being pypi installable nothing prevents us from releasing pytoda via pypi :)

twine/pypi setup
simplifying development installations
simplifying CI setup in GA workflows
releasing version 1.0 ? (any thoughts @drugilsberg @C-nit ?)

double variable assignment in constructor

paccmann_datasets/pytoda/datasets/_smiles_dataset.py

Line 117 in 3885e3d

self.padding_length = self.padding_length = (

Documentation

Docstrings should be deployed and hosted on an external site, e.g. with readthedocs. No major update of readme/docstring/examples needed.

ProteinLanguage error handling

Currently, proteinlanguage raises an error if it encounters an unknown token at runtime:

torch.tensor(token_indexes, dtype=self.dtype, device=self.device)
TypeError: an integer is required (got type str)

The error message is cryptic
It should be filled with an unknown token and a warning should be raised
if iterate_dataset is True, this issue should be detected at object construction (not the case currently)

_TableDatset: prevent computation of both standardize and min_max:

calling .fit may be time consuming for larger dataset. Given that self.standardize and self.min_max are never both true, couldn't we make sure to call only the needed .fit function here?

Originally posted by @jannisborn in https://github.com/_render_node/MDIzOlB1bGxSZXF1ZXN0UmV2aWV3VGhyZWFkMjc2Mzc3NTI4OnYy/pull_request_review_threads/discussion

Tensor shape bug

I think an error in SMILESDataset can occur if augmentation is used and the longest SMILES of the dataset is augmented to become longer than the initial SMILES. If the max_token_length is 100, everything will be padded to 100, but if the augmented SMILES is not longer, it does not fit.

We do deal this case now by running once over the dataset and adjusting max_token_length. But we should perform multiple augmentations of the longest sequence, just to be safe.

selfies 1.0 release

our tests for selfies tokenization fail for newer versions of the selfies dependency.

FAIL: test_tokenize_selfies (pytoda.smiles.tests.test_processing.TestProcessing)
Test tokenize_selfies.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dow/public_dev/paccmann_datasets/pytoda/smiles/tests/test_processing.py", line 142, in test_tokenize_selfies
    self.assertListEqual(tokenize_selfies(selfies), ground_truth)
AssertionError: Lists differ: ['[C]', '[C]', '[=N]', '[O]', '[C]', '[Expl=Ring1]', '[Branch1_1]'] != ['[c]', '[c]', '[n]', '[o]', '[c]', '[Ring1]', '[Ring2]']

First differing element 0:
'[C]'
'[c]'

- ['[C]', '[C]', '[=N]', '[O]', '[C]', '[Expl=Ring1]', '[Branch1_1]']
?    ^      ^      ^^      ^      ^      -----           ^^^ ^^^^^

+ ['[c]', '[c]', '[n]', '[o]', '[c]', '[Ring1]', '[Ring2]']
?    ^      ^      ^      ^      ^                 ^^ ^^

we should require a version >1 in setup.py, pin an up to date version in requirements.txt and adapt the tests for correct selfies tokens.
See https://github.com/aspuru-guzik-group/selfies/blob/master/CHANGELOG.md

Besides these obvious task, there is now selfies.utils.split_selfies.
Could it replace our implementation of tokenize_selfies altogether @jannisborn ?

languages, where to use and what type to return by to_token_indexes

paccmann_datasets/pytoda/datasets/drug_affinity_dataset.py

Line 208 in 63a6441

token_indexes_tensor = self.smiles_dataset.get_item_from_key(

Bump selfies to 2.1

To avoid inconsistencies with downstream packages using higher selfies version (e.g., paccmann generative models)

ProteinSequence dataset setup

Based on discussion with @YoelShoshan , the protein-sequence dataset can be improved

even if iterate_dataset is False, it does iterate the dataset and adds tokens
if we iterate the dataset, we can check for lowercase characters or non-string items and raise warnings/errors

Dynamic type checking in protein_language.sequence_to_token_indexes with a single instance. I profiled this and it doesn't add significant runtime, whereas doing the same via pydantic decorator validate_arguments increases runtime by 4-5 times (for short AAS)

Customizable column names in Datasets

Many datasets such as DrugAffinityDataset or DrugSensitivityDataset are packaged with heavy assumptions about the column names of the labelled .csv files.

E..g see here:

drug_affinity_filepath (str): path to drug affinity
                .csv file. Currently, the only supported format is .csv,
                with an index and three header columns named: "ligand_name",
                "sequence_id", "label".

I think these 3 column names are nice defaults, but the user should be allowed to pass them in as arguments. E.g. for DrugSensitivityDataset the label column is called IC50 which suggests a float and thus a regression task although the dataset should be agnostic to it. One could obviously abuse the column name by that's not neat.

Truly Lazy Datasets (netcdf, hdf5)

To train on large DBs like ZINC, we should create a super lazy SMILES dataset class, conceptually like this:

class _SMILESLDataset(torch.utils.Dataset):
    def __init__(self, data_dir):
        self.data_files = os.listdir(data_dir)
        sort(self.data_files)

    def __getitem__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)

Then we can create a loader with multiple workers.

I think it should not support any RDKit operations or SMILES language transforms. SELFIES should be supported though.

paccmann / paccmann_datasets Goto Github PK

paccmann_datasets's Introduction

PyToDa

Overview

Installation

Documentation

Development

Examples

References

paccmann_datasets's People

Contributors

Stargazers

Watchers

Forkers

paccmann_datasets's Issues

Recommend Projects

Recommend Topics

Recommend Org