lewisacidic / scikit-chem Goto Github PK
View Code? Open in Web Editor NEWA high level cheminformatics package for the Scientific Python stack, built on RDKit.
Home Page: http://scikit-chem.readthedocs.io/en/latest/index.html
License: Other
A high level cheminformatics package for the Scientific Python stack, built on RDKit.
Home Page: http://scikit-chem.readthedocs.io/en/latest/index.html
License: Other
They are atom descriptors.
Chemaxon wrappers for:
MOE wrappers for:
All use similar code, so should be dried up.
Things that are duplicated:
Issues/inconsistencies to be considered:
A higher level command line interface should be able to handle these issues.
The CI scripts have been failing for a long time now, as the install doesn't fetch the example ames.sdf file that isn't checked into VCS. The new dataset module system should fix this, once we adapt the tests, but the CI should be reassessed as it was created by a very inexperienced young @richlewis42.
The testing framework is currently py.test
. We should have testing modules to not reproduce fixtures for every test, which provide appropriate testing data.
It would be great to do some Three.js 3d rendering of molecules.
SVG would probably be the best default SVG rendering for molecules.
We should make all keyword arguments explicit, and document them. This should be done at least for all wrappers of rdkit objects, and for all concrete classes.
We should write something a bit more positive and concrete in the README.
Code health has slipped massively due to time pressures on my PhD. This should be cleaned up when there is time.
scikit-chem
releases v0.0.3-0.0.5 were very close together. This is because they were broken - exhaustive tests were not run before they were released!
Release system should:
setup.py
and __init__.py
as it is now).The io package could do with a refactor to unify the api, and allow for more flexibility.
It would be good to be able:
pandas
manages it - i.e. download it from the web if given a URL.doctests don't work in CI for chemaxon. This could be fixed by either:
Primary solution looks easier, and fits into the wider plan. Temporary solution is skipping all doctest lines where the tools are used.
Default morgan fingerprints are floats, but should be np.uint8s.
Defaults should be factored into a config file. We could use a .skchem.yaml
, or a .skchem/config.py
.
We should check parameters for all objects on initialisation to be sure that they are of the expected type. At the moment, one can do:
mf = skchem.descriptors.MorganFeaturizer(radius='break please')
And no warnings or errors are thrown.
fuel comes with quite a few dependencies that are unnecessary. We could 'polyfill' the basic in-memory functionality it provides easily, whilst offering the more advanced functionality as an optional dependency.
External tools should have a mixin that checks if it is installed on creation of the object.
#9
Scikit-learn is changing their cross validation API. We should support their new one with the cross validation classes that are currently supported.
Conformers should have views.
There should be much better support for sparse features. We probably can't use pandas, unless they get their scipy.sparse wrapping good, or their sparse format is more widely adopted.
These should all inherit from a base class, implementing a consistent transform
public method.
We should make sure Google Style conformant docstrings are written for all public objects. We should list here those that are not yet written.
Atom features could intelligently determine their max_atoms
.
Before advertising the package, we need to improve the README.md.
Descriptors might be considered to != fingerprints. Featurizers might be considered a better name.
Test coverage is very low, as we don't test the data module. We should test it!
Scikit-chem
is designed to make doing cheminformatics things easier. We should provide wrappers for proprietary libraries. A good example of this would be the already implemented ChemAxonStandardizer
.
Essentially, proprietary tools usually require .sdf
files.
tempfile
builtin.subprocess
builtin, outputting to another temp file.io
module.Suitable proprietary tools:
sddesc
cxcalc
generatemd
There should be a consistent API that checks for the tool, and allows for configuration options for where/how the tool is installed to be specified.
Potentially, these could be extensions, so that those without the tools don't end up trying to use them.
For example:
skchem-chemaxon-standardizer
could be a separate package.
We will need to make a more advanced pipelining module than that available in scikit-learn, as we need to filter instances also.
There should be a mechanism to serialise and deserialise all skchem objects, probably using json (we're not using pickle!!). This will be most easily achieved by having a scikit-learn style hyperparameter introspection check in the constructor.
Issues occur when index contains repeats.
e.g.
std = skchem.standardizers.ChemAxonStandardizer()
ms = pd.Series([skchem.Mol.from_smiles('CC'), skchem.Mol.from_smiles('C')], index=['', ''])
The library has been written in a hurry to provide specific functions for my PhD. This has resulted in some ugly, untested code. This should be remedied! In particular:
The converter module should be refactored to offer more flexibility.
It might be nice to be able to add features after the package is generated.
This could be done by allowing a Converter to set itself up from a HDF5 file, rather than making it from scratch.
e.g. with preexisting data.h5
, extra features and splits could be added as:
conv = Converter(..., output_path='data.h5')
conv.features += skchem.descriptors.MorganFingerprinter()
conv.splits += pd.Series([True, False ...])
This would probably be easiest once we have unique string representations for featurizers.
The interact module should be factored out, as the ipywidgets dependency is quite large.
We should allow for the standardizer to be configurable in python, rather than relying on a config file.
We should provide a simple function/class in utils
to dry up the accessors, ie.
uff = skchem.forcefields.get('uff')
mfper = skchem.descriptors.get('morgan')
Should all use the same interface. This could:
Preprocessed datasets should be hosted as:
Forcefields should not be a top level package name. The actual functionality provided is more like "molecular geometry optimisation", for which forcefields(/molecular mechanics) are just one technique. We should rename the package to something more general.
Perhaps geometry_optimisation
, then molecular_mechanics
as a sub package?
We could create a guide for the different types of features, standardisers, and have it be an informative place to learn cheminformatics in its own right.
Objects should have a verbosity argument, which controls whether they print loading bars.
We could have a plugin infrastructure, for each of the external tools, and for example the interact module.
We removed structure as a required name for the transformers by requiring the series. It may be acceptable to bring this back, but the name must be configurable as a keyword argument.
This must also be done in IO, where there is currently no option to name the column the compounds go into.
Both PCF and AF take a collection of ('name', func) pairs. The mechanism of treating them should be dried up and enhanced.
We should be able to add default implemented features by string name, and also add those implemented as functions.
It would be good to have a unique string representation for featurizers.
Perhaps:
Default: morg
With radius=3: morg-radius=3
With radius=3, n_feats=1024: morg-radius=3-n_feats=1024
etc.
This would be good to do with the JSON/YAML serialization part.
The base Converter class should have responsibilities for filtering, standardising and producing splits removed from it. These should be implemented individually for datasets.
That said, a default pipeline should be provided, that will be used for all the datasets at the moment, but it may not be appropriate for all datasets!
Add option to parallelise all transformers by calling transform on splits using pool.
There should be good integration with web based tools. For example, we should be able to load compounds from at least:
using their web services. This could potentially be easily done by wrapping PubChemPy, ChemSpiPy etc.
I've been force pushing changes to master when I manage to break things (usually only until I get the results back from CI), assuming I am the only one working on the project. I am planning on using separate branches for separate features, which will let me get CI results without corrupting master. If anyone is considering working on the project with me, please let me know and I will stop - or fork the project on github to let me know!
We should also wrap the multiple conformation generator functionality.
Organic filter became slow when it subclassed element filter. It should be made great again, preferably without use of walls financed by foreign nations.
We will need online documentation. This will probably have to be done with sphinx
, but could be done with mkdocs
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.