datamol-io / splito Goto Github PK

Machine Learning dataset splitting for life sciences.

Home Page: https://splito-docs.datamol.io/

License: Apache License 2.0

Python 100.00%

splito's Issues

Support train/test/validation splitting

As far as I can tell, the splitters will only do train/test splits. It would be really useful to allow for a third validation split.

Unpinned dependencies

At the moment, neither the dependencies in project.toml nor the developer dependencies in env.yaml are pinned. This might create subtle discrepancies between developers and the pip package, leading to unreproducible bugs and software rot.

Solution: Pin the dependencies. Install the environment, run pip freeze, and pin the installed versions in the .toml and .yaml files.

I like pip-tools, but I haven't used it for .toml and conda environments.

Alternatively, it could be postponed, e.g. until a stable release, but we need to be aware of these subtleties.

Functional forms of splitters

Just a half-baked thought from the #9 discussion: It might be useful to have splitters in functional forms similar to torch.nn.functional.

Most of the splitters can be stateless, so we could create functions that create a splitter object, call .split(), and then return the results. This could simplify usage, but it would create an additional interface, which is not in accordance with PEP 20.

In short the idea is to develop a new approach allowing to optimize a dataset split according to multiple objectives and constraints. Such example of GA approach has been proposed at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189 and is already implemented at partitio.simpd.

Make a general function for all splitting strategies.

Similar to the sklearn.train_test_split method, implement another function general_split or train_test_split that does any kind of splitting.

def general_split(
    mols: Union[datamol.Mol, str], 
    test_size: Union[float, int], 
    splitting_method: Literal["random", "scaffold", "kmeans"], 
    random_state: int = 42, 
    n_jobs:int=0, 
    *args, 
    **kwargs)

    print("Do some magic")
    return train_idx, test_idx

Add proxy functions for common splitting method such as random

Forward from here to sklearn. This is only for convenience and prevent the sklearn imports.

That being said, I am not 100% convinced we should do that so putting this as low priority for now.

Add support for SPECTRA

This package implements the spectral framework for model evaluation. All you need to get started is (1) a model, (2) a dataset, and (3) a definition of sample to sample similarity!
The SPECTRA package generates a series of splits with decreasing train-test similarity. Evaluating your models on these splits will give a better understanding of model generalizability. Read the preprint for more info on how this works.

See https://github.com/mims-harvard/SPECTRA and https://twitter.com/YEktefaie/status/1782449554077647054

datamol-io / splito Goto Github PK

splito's Issues

Support train/test/validation splitting

Unpinned dependencies

Functional forms of splitters

visualize_chemspace is missing

Library name

SIMPD new implementation

Make a general function for all splitting strategies.

Add proxy functions for common splitting method such as random

Add support for SPECTRA

Documentation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent