datamol-io / splito Goto Github PK
View Code? Open in Web Editor NEWMachine Learning dataset splitting for life sciences.
Home Page: https://splito-docs.datamol.io/
License: Apache License 2.0
Machine Learning dataset splitting for life sciences.
Home Page: https://splito-docs.datamol.io/
License: Apache License 2.0
As far as I can tell, the splitters will only do train/test splits. It would be really useful to allow for a third validation split.
At the moment, neither the dependencies in project.toml
nor the developer dependencies in env.yaml
are pinned. This might create subtle discrepancies between developers and the pip package, leading to unreproducible bugs and software rot.
Solution: Pin the dependencies. Install the environment, run pip freeze
, and pin the installed versions in the .toml
and .yaml
files.
I like pip-tools, but I haven't used it for .toml
and conda environments.
Alternatively, it could be postponed, e.g. until a stable release, but we need to be aware of these subtleties.
Just a half-baked thought from the #9 discussion: It might be useful to have splitters in functional forms similar to torch.nn.functional
.
Most of the splitters can be stateless, so we could create functions that create a splitter object, call .split()
, and then return the results. This could simplify usage, but it would create an additional interface, which is not in accordance with PEP 20.
Hi,
I have been trying to plot scaffold split using visualize_chemspace (as mentioned in the tutorial). visualize_chemspace code is not there in utili.py. Is it changed?
I went with partitio
to move forward but if you have better ideas, I am open!
ping @datamol-io/partitio-maintain-team
See polaris-hub/polaris#20 (comment) for context
In short the idea is to develop a new approach allowing to optimize a dataset split according to multiple objectives and constraints. Such example of GA approach has been proposed at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189 and is already implemented at partitio.simpd
.
Similar to the sklearn.train_test_split method, implement another function general_split
or train_test_split
that does any kind of splitting.
def general_split(
mols: Union[datamol.Mol, str],
test_size: Union[float, int],
splitting_method: Literal["random", "scaffold", "kmeans"],
random_state: int = 42,
n_jobs:int=0,
*args,
**kwargs)
print("Do some magic")
return train_idx, test_idx
Forward from here to sklearn
. This is only for convenience and prevent the sklearn imports.
That being said, I am not 100% convinced we should do that so putting this as low priority for now.
This package implements the spectral framework for model evaluation. All you need to get started is (1) a model, (2) a dataset, and (3) a definition of sample to sample similarity!
The SPECTRA package generates a series of splits with decreasing train-test similarity. Evaluating your models on these splits will give a better understanding of model generalizability. Read the preprint for more info on how this works.
See https://github.com/mims-harvard/SPECTRA and https://twitter.com/YEktefaie/status/1782449554077647054
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.