The scikit-chem from lewisacidic

Move FeatureInvariants and ConnectivityInvariants to AtomDescriptors

They are atom descriptors.

Dry up command line wrappers

Chemaxon wrappers for:

Atom features
Molecular Features
NMR spectra
Standardiser

MOE wrappers for:

Molecular Features

All use similar code, so should be dried up.

Things that are duplicated:

Writing to temp files
Calling the command line
Reading the file back

Issues/inconsistencies to be considered:

Monitoring: some tools give an SDF back, others a csv, others the NMR formatted data
Reading from stdout or from a file
Stderr handling

A higher level command line interface should be able to handle these issues.

The CI scripts have been failing for a long time now, as the install doesn't fetch the example ames.sdf file that isn't checked into VCS. The new dataset module system should fix this, once we adapt the tests, but the CI should be reassessed as it was created by a very inexperienced young @richlewis42.

Provide a better testing framework

The testing framework is currently py.test. We should have testing modules to not reproduce fixtures for every test, which provide appropriate testing data.

3D rendering

It would be great to do some Three.js 3d rendering of molecules.

Default _repr_html_ should be SVG

SVG would probably be the best default SVG rendering for molecules.

Explicitly state all options as keyword arguments in constructors.

We should make all keyword arguments explicit, and document them. This should be done at least for all wrappers of rdkit objects, and for all concrete classes.

Update the README

We should write something a bit more positive and concrete in the README.

Clean up the code!

Code health has slipped massively due to time pressures on my PhD. This should be cleaned up when there is time.

Create a proper release system.

scikit-chem releases v0.0.3-0.0.5 were very close together. This is because they were broken - exhaustive tests were not run before they were released!

Release system should:

run full battery of tests for all supported versions before.
test pip builds (especially make sure data files get included!!)
test conda builds
make sure the versioning actually gets updated (single version for the whole package, not individual in setup.py and __init__.py as it is now).
create a setuptools command to do these

IO Refactor

The io package could do with a refactor to unify the api, and allow for more flexibility.

It would be good to be able:

to download files analogously to the way pandas manages it - i.e. download it from the web if given a URL.
allow naming the molecule column (this would be a fix for #38.)
squeeze if no properties
consistent API to not read properties
offer reading for more files, such has HELM, Mol2, perhaps hdf(!), chemical-son, cml.
serialise to string if no argument given

make doctests work with pytest switch

doctests don't work in CI for chemaxon. This could be fixed by either:

Refactoring the Chemaxon wrappers into their own extension module.
Adding in some doctest optional skip directive

Primary solution looks easier, and fits into the wider plan. Temporary solution is skipping all doctest lines where the tools are used.

Morgan fingerprints are floats

Default morgan fingerprints are floats, but should be np.uint8s.

Defaults should factored into a config file

Defaults should be factored into a config file. We could use a .skchem.yaml, or a .skchem/config.py.

Check parameters on initialisation.

We should check parameters for all objects on initialisation to be sure that they are of the expected type. At the moment, one can do:

mf = skchem.descriptors.MorganFeaturizer(radius='break please')

And no warnings or errors are thrown.

make fuel an optional dependency, or factor it out to a new module, e.g. skchem_fuel

fuel comes with quite a few dependencies that are unnecessary. We could 'polyfill' the basic in-memory functionality it provides easily, whilst offering the more advanced functionality as an optional dependency.

Validate if external tool is installed

External tools should have a mixin that checks if it is installed on creation of the object.
#9

cross validation API

Scikit-learn is changing their cross validation API. We should support their new one with the cross validation classes that are currently supported.

Refactor Conformers

Conformers should have views.

Better Support for Sparse Features

There should be much better support for sparse features. We probably can't use pandas, unless they get their scipy.sparse wrapping good, or their sparse format is more widely adopted.

Refactor descriptors, standardizers

These should all inherit from a base class, implementing a consistent transform public method.

docstrings

We should make sure Google Style conformant docstrings are written for all public objects. We should list here those that are not yet written.

skchem.forcefields.UFF
skchem.forcefields.MMFF
skchem.forcefields.ForceField
skchem.forcefields.RoughEmbedding

Atom Features max_atoms auto detect

Atom features could intelligently determine their max_atoms.

Improve README.md

Before advertising the package, we need to improve the README.md.

Aims
Scope: quick example
Features
~~Examples~~

Rename the descriptors package to the featurizers package

Descriptors might be considered to != fingerprints. Featurizers might be considered a better name.

Test the data module!

Test coverage is very low, as we don't test the data module. We should test it!

Wrappers for proprietary tools

Scikit-chem is designed to make doing cheminformatics things easier. We should provide wrappers for proprietary libraries. A good example of this would be the already implemented ChemAxonStandardizer.

Essentially, proprietary tools usually require .sdf files.

We write a data frame to a temp file using tempfile builtin.
Run the tool using subprocess builtin, outputting to another temp file.
Read the output using the io module.

Suitable proprietary tools:

MOE descriptors sddesc
ChemAxon descriptors cxcalc
ChemAxon generatemd

There should be a consistent API that checks for the tool, and allows for configuration options for where/how the tool is installed to be specified.

Potentially, these could be extensions, so that those without the tools don't end up trying to use them.
For example:
skchem-chemaxon-standardizer could be a separate package.

Pipelining

We will need to make a more advanced pipelining module than that available in scikit-learn, as we need to filter instances also.

'Hyperparameters' and serialisation

There should be a mechanism to serialise and deserialise all skchem objects, probably using json (we're not using pickle!!). This will be most easily achieved by having a scikit-learn style hyperparameter introspection check in the constructor.

Non unique index issues.

Issues occur when index contains repeats.

e.g.

std = skchem.standardizers.ChemAxonStandardizer()
ms = pd.Series([skchem.Mol.from_smiles('CC'), skchem.Mol.from_smiles('C')], index=['', ''])

Clean up and Increase test coverage

The library has been written in a hurry to provide specific functions for my PhD. This has resulted in some ugly, untested code. This should be remedied! In particular:

descriptors
filters

skchem.data.converter module refactor

The converter module should be refactored to offer more flexibility.

It might be nice to be able to add features after the package is generated.

This could be done by allowing a Converter to set itself up from a HDF5 file, rather than making it from scratch.

e.g. with preexisting data.h5, extra features and splits could be added as:

conv = Converter(..., output_path='data.h5')
conv.features += skchem.descriptors.MorganFingerprinter()
conv.splits += pd.Series([True, False ...])

This would probably be easiest once we have unique string representations for featurizers.

Factor the interact module out into its own package.

The interact module should be factored out, as the ipywidgets dependency is quite large.

Configure ChemAxon Standardizer in Python

We should allow for the standardizer to be configurable in python, rather than relying on a config file.

Dry up object convenience accessors

We should provide a simple function/class in utils to dry up the accessors, ie.

uff = skchem.forcefields.get('uff')
mfper = skchem.descriptors.get('morgan')

Should all use the same interface. This could:

Return default objects if a string is passed
List available objects

Where to save preprocessed datasets

Preprocessed datasets should be hosted as:

Takes ages to process
May require proprietary tools

Rename 'forcefields'

Forcefields should not be a top level package name. The actual functionality provided is more like "molecular geometry optimisation", for which forcefields(/molecular mechanics) are just one technique. We should rename the package to something more general.

Perhaps geometry_optimisation, then molecular_mechanics as a sub package?

Documentation for what featurizers do!

We could create a guide for the different types of features, standardisers, and have it be an informative place to learn cheminformatics in its own right.

Verbosity for loading bars

Objects should have a verbosity argument, which controls whether they print loading bars.

Implement a plugin infrastructure

We could have a plugin infrastructure, for each of the external tools, and for example the interact module.

Avoid use of 'structure' as a reserved name.

We removed structure as a required name for the transformers by requiring the series. It may be acceptable to bring this back, but the name must be configurable as a keyword argument.

This must also be done in IO, where there is currently no option to name the column the compounds go into.

Dry up feature choice in Physicochemical and Atom featurizers

Both PCF and AF take a collection of ('name', func) pairs. The mechanism of treating them should be dried up and enhanced.

We should be able to add default implemented features by string name, and also add those implemented as functions.

Unique string representations for featurizers.

It would be good to have a unique string representation for featurizers.

Perhaps:

Default: morg
With radius=3: morg-radius=3
With radius=3, n_feats=1024: morg-radius=3-n_feats=1024

etc.

This would be good to do with the JSON/YAML serialization part.

Data pipelines

The base Converter class should have responsibilities for filtering, standardising and producing splits removed from it. These should be implemented individually for datasets.

That said, a default pipeline should be provided, that will be used for all the datasets at the moment, but it may not be appropriate for all datasets!

Parallelism by default

Add option to parallelise all transformers by calling transform on splits using pool.

Web based tools

There should be good integration with web based tools. For example, we should be able to load compounds from at least:

ChEMBL
DrugBank
ChemSpider
PubChem (C/S)IDs

using their web services. This could potentially be easily done by wrapping PubChemPy, ChemSpiPy etc.

Stop force pushing.

I've been force pushing changes to master when I manage to break things (usually only until I get the results back from CI), assuming I am the only one working on the project. I am planning on using separate branches for separate features, which will let me get CI results without corrupting master. If anyone is considering working on the project with me, please let me know and I will stop - or fork the project on github to let me know!

Generate multiple conformers

We should also wrap the multiple conformation generator functionality.

Wrap the multiple conformation generators
Prune similar conformers by rms

make OrganicFilter great again

Organic filter became slow when it subclassed element filter. It should be made great again, preferably without use of walls financed by foreign nations.

Documentation

We will need online documentation. This will probably have to be done with sphinx, but could be done with mkdocs.

lewisacidic / scikit-chem Goto Github PK

scikit-chem's People

Contributors

Stargazers

Watchers

Forkers

scikit-chem's Issues

Recommend Projects

Recommend Topics

Recommend Org