Code Monkey home page Code Monkey logo

scikit-chem's People

Contributors

jriccil avatar lewisacidic avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-chem's Issues

Dry up command line wrappers

Chemaxon wrappers for:

  • Atom features
  • Molecular Features
  • NMR spectra
  • Standardiser

MOE wrappers for:

  • Molecular Features

All use similar code, so should be dried up.

Things that are duplicated:

  • Writing to temp files
  • Calling the command line
  • Reading the file back

Issues/inconsistencies to be considered:

  • Monitoring: some tools give an SDF back, others a csv, others the NMR formatted data
  • Reading from stdout or from a file
  • Stderr handling

A higher level command line interface should be able to handle these issues.

Fix the CI

The CI scripts have been failing for a long time now, as the install doesn't fetch the example ames.sdf file that isn't checked into VCS. The new dataset module system should fix this, once we adapt the tests, but the CI should be reassessed as it was created by a very inexperienced young @richlewis42.

Provide a better testing framework

The testing framework is currently py.test. We should have testing modules to not reproduce fixtures for every test, which provide appropriate testing data.

3D rendering

It would be great to do some Three.js 3d rendering of molecules.

Update the README

We should write something a bit more positive and concrete in the README.

Clean up the code!

Code health has slipped massively due to time pressures on my PhD. This should be cleaned up when there is time.

Create a proper release system.

scikit-chem releases v0.0.3-0.0.5 were very close together. This is because they were broken - exhaustive tests were not run before they were released!

Release system should:

  • run full battery of tests for all supported versions before.
  • test pip builds (especially make sure data files get included!!)
  • test conda builds
  • make sure the versioning actually gets updated (single version for the whole package, not individual in setup.py and __init__.py as it is now).
  • create a setuptools command to do these

IO Refactor

The io package could do with a refactor to unify the api, and allow for more flexibility.

It would be good to be able:

  • to download files analogously to the way pandas manages it - i.e. download it from the web if given a URL.
  • allow naming the molecule column (this would be a fix for #38.)
  • squeeze if no properties
  • consistent API to not read properties
  • offer reading for more files, such has HELM, Mol2, perhaps hdf(!), chemical-son, cml.
  • serialise to string if no argument given

make doctests work with pytest switch

doctests don't work in CI for chemaxon. This could be fixed by either:

  • Refactoring the Chemaxon wrappers into their own extension module.
  • Adding in some doctest optional skip directive

Primary solution looks easier, and fits into the wider plan. Temporary solution is skipping all doctest lines where the tools are used.

Check parameters on initialisation.

We should check parameters for all objects on initialisation to be sure that they are of the expected type. At the moment, one can do:

mf = skchem.descriptors.MorganFeaturizer(radius='break please')

And no warnings or errors are thrown.

cross validation API

Scikit-learn is changing their cross validation API. We should support their new one with the cross validation classes that are currently supported.

Better Support for Sparse Features

There should be much better support for sparse features. We probably can't use pandas, unless they get their scipy.sparse wrapping good, or their sparse format is more widely adopted.

docstrings

We should make sure Google Style conformant docstrings are written for all public objects. We should list here those that are not yet written.

  • skchem.forcefields.UFF
  • skchem.forcefields.MMFF
  • skchem.forcefields.ForceField
  • skchem.forcefields.RoughEmbedding

Improve README.md

Before advertising the package, we need to improve the README.md.

  • Aims
  • Scope: quick example
  • Features
  • Examples

Wrappers for proprietary tools

Scikit-chem is designed to make doing cheminformatics things easier. We should provide wrappers for proprietary libraries. A good example of this would be the already implemented ChemAxonStandardizer.

Essentially, proprietary tools usually require .sdf files.

  1. We write a data frame to a temp file using tempfile builtin.
  2. Run the tool using subprocess builtin, outputting to another temp file.
  3. Read the output using the io module.

Suitable proprietary tools:

  • MOE descriptors sddesc
  • ChemAxon descriptors cxcalc
  • ChemAxon generatemd

There should be a consistent API that checks for the tool, and allows for configuration options for where/how the tool is installed to be specified.

Potentially, these could be extensions, so that those without the tools don't end up trying to use them.
For example:
skchem-chemaxon-standardizer could be a separate package.

Pipelining

We will need to make a more advanced pipelining module than that available in scikit-learn, as we need to filter instances also.

'Hyperparameters' and serialisation

There should be a mechanism to serialise and deserialise all skchem objects, probably using json (we're not using pickle!!). This will be most easily achieved by having a scikit-learn style hyperparameter introspection check in the constructor.

Non unique index issues.

Issues occur when index contains repeats.

e.g.

std = skchem.standardizers.ChemAxonStandardizer()
ms = pd.Series([skchem.Mol.from_smiles('CC'), skchem.Mol.from_smiles('C')], index=['', '']) 

Clean up and Increase test coverage

The library has been written in a hurry to provide specific functions for my PhD. This has resulted in some ugly, untested code. This should be remedied! In particular:

  • descriptors
  • filters

skchem.data.converter module refactor

The converter module should be refactored to offer more flexibility.

It might be nice to be able to add features after the package is generated.

This could be done by allowing a Converter to set itself up from a HDF5 file, rather than making it from scratch.

e.g. with preexisting data.h5, extra features and splits could be added as:

conv = Converter(..., output_path='data.h5')
conv.features += skchem.descriptors.MorganFingerprinter()
conv.splits += pd.Series([True, False ...])

This would probably be easiest once we have unique string representations for featurizers.

Dry up object convenience accessors

We should provide a simple function/class in utils to dry up the accessors, ie.

uff = skchem.forcefields.get('uff')
mfper = skchem.descriptors.get('morgan')

Should all use the same interface. This could:

  • Return default objects if a string is passed
  • List available objects

Rename 'forcefields'

Forcefields should not be a top level package name. The actual functionality provided is more like "molecular geometry optimisation", for which forcefields(/molecular mechanics) are just one technique. We should rename the package to something more general.

Perhaps geometry_optimisation, then molecular_mechanics as a sub package?

Documentation for what featurizers do!

We could create a guide for the different types of features, standardisers, and have it be an informative place to learn cheminformatics in its own right.

Avoid use of 'structure' as a reserved name.

We removed structure as a required name for the transformers by requiring the series. It may be acceptable to bring this back, but the name must be configurable as a keyword argument.

This must also be done in IO, where there is currently no option to name the column the compounds go into.

Unique string representations for featurizers.

It would be good to have a unique string representation for featurizers.

Perhaps:

Default: morg
With radius=3: morg-radius=3
With radius=3, n_feats=1024: morg-radius=3-n_feats=1024

etc.

This would be good to do with the JSON/YAML serialization part.

Data pipelines

The base Converter class should have responsibilities for filtering, standardising and producing splits removed from it. These should be implemented individually for datasets.

That said, a default pipeline should be provided, that will be used for all the datasets at the moment, but it may not be appropriate for all datasets!

Parallelism by default

Add option to parallelise all transformers by calling transform on splits using pool.

Web based tools

There should be good integration with web based tools. For example, we should be able to load compounds from at least:

  • ChEMBL
  • DrugBank
  • ChemSpider
  • PubChem (C/S)IDs

using their web services. This could potentially be easily done by wrapping PubChemPy, ChemSpiPy etc.

Stop force pushing.

I've been force pushing changes to master when I manage to break things (usually only until I get the results back from CI), assuming I am the only one working on the project. I am planning on using separate branches for separate features, which will let me get CI results without corrupting master. If anyone is considering working on the project with me, please let me know and I will stop - or fork the project on github to let me know!

Generate multiple conformers

We should also wrap the multiple conformation generator functionality.

  • Wrap the multiple conformation generators
  • Prune similar conformers by rms

make OrganicFilter great again

Organic filter became slow when it subclassed element filter. It should be made great again, preferably without use of walls financed by foreign nations.

Documentation

We will need online documentation. This will probably have to be done with sphinx, but could be done with mkdocs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.