Code Monkey home page Code Monkey logo

epistasislab / tpot Goto Github PK

View Code? Open in Web Editor NEW
9.5K 289.0 1.5K 77.94 MB

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Home Page: http://epistasislab.github.io/tpot/

License: GNU Lesser General Public License v3.0

Shell 0.57% Python 75.85% Jupyter Notebook 23.58%
machine-learning python data-science automl automation scikit-learn hyperparameter-optimization model-selection parameter-tuning automated-machine-learning

tpot's Introduction

Master status: Master Build Status - Mac/Linux Master Build Status - Windows Master Coverage Status

Development status: Development Build Status - Mac/Linux Development Build Status - Windows Development Coverage Status

Package information: Python 3.7 License: LGPL v3 PyPI version


To try the NEW! TPOT2 (alpha) please go here!


TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT Demo

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

An example Machine Learning pipeline

An example Machine Learning pipeline

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

An example TPOT pipeline

TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

TPOT is still under active development and we encourage you to check back on this repository regularly for updates.

For further information about TPOT, please see the project documentation.

License

Please see the repository license for the licensing and usage information for TPOT.

Generally, we have licensed TPOT to make it as widely usable as possible.

Installation

We maintain the TPOT installation instructions in the documentation. TPOT requires a working installation of Python.

Usage

TPOT can be used on the command line or with Python code.

Click on the corresponding links to find more information on TPOT usage in the documentation.

Examples

Classification

Below is a minimal working example with the optical recognition of handwritten digits dataset.

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_digits_pipeline.py file and look similar to the following:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Regression

Similarly, TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

which should result in a pipeline that achieves about 12.77 mean squared error (MSE), and the Python code in tpot_boston_pipeline.py should look similar to:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Check the documentation for more examples and tutorials.

Contributing to TPOT

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.

Before submitting any contributions, please review our contribution guidelines.

Having problems or have questions about TPOT?

Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, file a new issue on this repository so we can review your issue.

Citing TPOT

If you use TPOT in a scientific publication, please consider citing at least one of the following papers:

Trang T. Le, Weixuan Fu and Jason H. Moore (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics.36(1): 250-256.

BibTeX entry:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). Automating biomedical data science through tree-based pipeline optimization. Applications of Evolutionary Computation, pages 123-137.

BibTeX entry:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.

BibTeX entry:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

Alternatively, you can cite the repository directly with the following DOI:

DOI

Support for TPOT

TPOT was developed in the Computational Genetics Lab at the University of Pennsylvania with funding from the NIH under grant R01 AI117694. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.

The TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.

tpot's People

Contributors

akshayvarik avatar apachaves avatar ayrtonb avatar bartdp1 avatar bartleyn avatar beckernick avatar ckastner avatar dankoretsky avatar gjena avatar gosuto-inzasheru avatar jaimecclin avatar jamesjjcondon avatar jamesmyatt avatar jay-m-dev avatar jdromano2 avatar jhmenke avatar joseortiz3 avatar kadarakos avatar nickotto avatar perib avatar pgijsbers avatar pronojitsaha avatar rasbt avatar rhiever avatar sahil-b-shah avatar sohnam avatar tcfuji avatar tomaugspurger avatar weixuanfu avatar zoso95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tpot's Issues

Add Executable Examples

Would be nice if you could add an ./examples directory containing some IPython Notebooks with a complete TPOT pipeline optimization examples & comments (e.g., using the prostate cancer dataset). You could link them in the README.md file, which would help people to get an overview, plus, people could re-use these notebooks as templates for own experiments.

example broken?

Getting the following traceback when running the minimal working example from README.

Traceback (most recent call last):
  File "example.py", line 9, in <module>
    tpot.fit(X_train, y_train)
  File "/Users/marcel/workspace/tpot/tpot/tpot.py", line 162, in fit
    self.toolbox.register('evaluate', self.evaluate_individual, training_testing_data=training_testing_data)
AttributeError: 'TPOT' object has no attribute 'evaluate_individual'

Changing member function name back to _evaluate_individual resolves the issue.

Installation error with Python 2

[pete@dakota ~]$ pip install tpot
Collecting tpot
  Downloading TPOT-0.1.2.tar.gz (165kB)
    100% |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 167kB 811kB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/var/folders/c3/bhnfltk57zb_fs68gy3mcr300000gn/T/pip-build-be0XTk/tpot/setup.py", line 18, in <module>
        package_version = calculate_version()
      File "/private/var/folders/c3/bhnfltk57zb_fs68gy3mcr300000gn/T/pip-build-be0XTk/tpot/setup.py", line 13, in calculate_version
        version = next(filter(lambda x: '__version__' in x, initpy)).split('\'')[1]
    TypeError: list object is not an iterator

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/c3/bhnfltk57zb_fs68gy3mcr300000gn/T/pip-build-be0XTk/tpot

Probably because Python 3 treats many lists as iterators by default and Python 2 doesn't.

Broken Landscape Badge

For some reason, the landscape badge is now broken after the last pull request. I don't think there is a problem with the link since it worked previously, however, I noticed that it also doesn't show on landscape.io itself anymore (the other styles still seem to work though); weird!

screen shot 2015-11-16 at 3 52 49 pm

I think this may be a github-related caching error, e.g,. see the discussion here: github/markup#224

Let's keep an eye on that, maybe it resolves itself in a few hours, otherwise, we'd have to investigate further ...

Add initialization statements to script version

Currently, TPOT prints the settings etc. at the beginning of a run with the command-line version with high verbosity (=2), but not the script version. Make TPOT also print the settings etc. at the beginning of the script version at high verbosity as well.

Basic project documentation

Flesh out the README to provide

  • a basic working example of TPOT
  • installation instructions
  • a longer description of TPOT

Using the tpot object for prediction

Error with .predict for iris example

from tpot import TPOT
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split

digits = load_iris()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOT(generations=10)
tpot.fit(X_train, y_train)
print(tpot.score(X_train, y_train, X_test, y_test))

But when I try to use the pipeline as a predictor

tpot.predict(X_train, y_train, X_test)

this is the error I get (iPython debugger output):

TypeError                                 Traceback (most recent call last)
<ipython-input-8-74abe9ee292a> in <module>()
----> 1 tpot.predict(X_train, y_train, X_test)

/usr/local/lib/python2.7/dist-packages/tpot/tpot.pyc in predict(self, training_features, training_classes, testing_features)
    290 
    291         result = func(training_testing_data)
--> 292         return result[result['group'] == 'testing', 'guess'].values
    293 
    294     def score(self, training_features, training_classes, testing_features, testing_classes):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1656             return self._getitem_multilevel(key)
   1657         else:
-> 1658             return self._getitem_column(key)
   1659 
   1660     def _getitem_column(self, key):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1663         # get column
   1664         if self.columns.is_unique:
-> 1665             return self._get_item_cache(key)
   1666 
   1667         # duplicate columns & possible reduce dimensionaility

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1001     def _get_item_cache(self, item):
   1002         cache = self._item_cache
-> 1003         res = cache.get(item)
   1004         if res is None:
   1005             values = self._data.get(item)

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
    623     def __hash__(self):
    624         raise TypeError('{0!r} objects are mutable, thus they cannot be'
--> 625                         ' hashed'.format(self.__class__.__name__))
    626 
    627     def __iter__(self):

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Interpreting generated code

Running the iris example generated this piece of code

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, 
                                                                     n_iter=1, 
                                                                     train_size=0.75)))
result1 = tpot_data.copy()

# Perform classification with a decision tree classifier
dtc1 = DecisionTreeClassifier(max_features=min(83, len(result1.columns) - 1), max_depth=19)
dtc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result1['dtc1-classification'] = dtc1.predict(result1.drop('class', axis=1).values)

# Perform classification with a decision tree classifier
dtc2 = DecisionTreeClassifier(max_features='auto', max_depth=56)
dtc2.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result2 = result1
result2['dtc2-classification'] = dtc2.predict(result2.drop('class', axis=1).values)

I struggle a bit to understand what is the intended idea behind providing this result2 dataframe. So there are 2 classification results in the above example both with decision trees and with different hyper-parameters, but how do these get combined?

Ensure Access to Model Parameters upon Early Termination

From @rasbt:

Let's make sure that we don't lose the model parameters if the run is terminated early.

  • Add a "verbose" parameter that writes to stderr. This way, the user can pipe the output (e.g., model parameters and metrics) to a log file. This is especially useful to keep track of the process if you are running TPOT as a PBS job, and it ensures that you have access to the model parameters when the job crashes (or hits the wall time)
  • Also, make sure that the current state is saved gracefully if the program quits unexpectedly.

TPOT command line usage help

I downloaded a sample mnist data set into a CSV and installed TPOT and all the dependencies.

I tried running it through the command line and below is the command I ran and the results I got,

$ tpot -i mnist.csv -is , -g 100 -s 42 -v 2

TPOT settings:
crossover_rate  =   0.05
generations =   100
input_file  =   mnist.csv
input_separator =   ,
mutation_rate   =   0.9
population_size =   100
random_state    =   42
verbosity   =   2





gen nevals  Minimum accuracy    Average accuracy    Maximum accuracy
0   100     0.1                 0.404918            0.964608






^C


^CTraceback (most recent call last):
  File "/Users/moi/.pyenv/versions/tpot/bin/tpot", line 9, in <module>
    load_entry_point('TPOT==0.1.3', 'console_scripts', 'tpot')()
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/tpot/tpot.py", line 479, in main
    training_features, training_classes)))
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/tpot/tpot.py", line 207, in score
    training_testing_data.rename(columns={column: str(column).zfill(5)}, inplace=True)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/frame.py", line 2697, in rename
    **kwargs)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/generic.py", line 606, in rename
    result._data = result._data.rename_axis(f, axis=baxis, copy=copy)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 2587, in rename_axis
    obj = self.copy(deep=copy)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 3059, in copy
    do_integrity_check=False)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 2823, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 578, in copy
    values = values.copy()
KeyboardInterrupt

It took a couple of hours until TPOT returned the stats summary and then after an hour or so it was still running so I terminated it. I'm curious as to what a TPOT run looks like? And for some reason I was expecting code to be written to a directory,

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

Or maybe this python source is printed to the terminal? Going to start reading the tpot source more thoroughly.

Allow passing of sparse matrices

In some situations, it's easier to pass a scipy.sparse matrix object to sklearn model objects. This would reduce the memory requirements when fitting larger datasets.

Most sklearn models will accept a sparse matrix. For those that do not, checking for sparsity in the method call and calling matrix.todense() would work.

Implement a predict() function

Create a predict() function to provide a similar interface as with scikit-learn models.

predict() takes 3 parameters:

  • training features
  • training classes
  • testing features

and returns the predicted testing classes.

Performance Metrics & Fitness Functions

I imagine that it would be useful to use TPOT to find models that optimize alternate performance metrics like precision, recall, etc. As such, I've come up with the following brainstorming questions:

  1. Does having alternate performance metrics/fitness functions make sense for the users?
  2. Does it make sense to add alternate metrics when reporting the best model? If so, which metric do we use as the fitness function?
  3. Since there is native support for multi-class/label classification regular precision, recall, and F1 may not be that useful. Should we just take the average version of scores like these when necessary?

There are plenty of other questions, but I figured this would be a decent place to start. Let me know if I'm totally misguided in proposing this -- I won't profess to be an expert in genetic programming.

Export TPOT pipelines to Orange file format

Orange has a pretty nice interface for building sklearn pipelines in a GUI.

image

It'd be great to have a function that exports TPOT pipelines into a file format that can be opened with Orange.

I've contacted the Orange devs and they said that it should be relatively straightforward to accomplish this as long as Orange has each TPOT pipeline operator already implemented as an Orange widget.

I'd imagine a good first step would be to thoroughly explore the Orange software and see what widgets/pipeline operators they already have implemented.


Here's an example Orange file:

<?xml version='1.0' encoding='utf-8'?>
<scheme description="" title="" version="2.0">
    <nodes>
        <node id="0" name="File" position="(150, 150)" project_name="Orange" qualified_name="Orange.widgets.data.owfile.OWFile" title="File" version="" />
        <node id="1" name="Data Info" position="(319.0, 77.0)" project_name="Orange" qualified_name="Orange.widgets.data.owdatainfo.OWDataInfo" title="Data Info" version="" />
        <node id="2" name="Test &amp; Score" position="(318.0, 209.0)" project_name="Orange" qualified_name="Orange.widgets.evaluate.owtestlearners.OWTestLearners" title="Test &amp; Score" version="" />
        <node id="3" name="Logistic Regression" position="(162.0, 324.0)" project_name="Orange" qualified_name="Orange.widgets.classify.owlogisticregression.OWLogisticRegression" title="Logistic Regression" version="" />
    </nodes>
    <links>
        <link enabled="true" id="0" sink_channel="Data" sink_node_id="1" source_channel="Data" source_node_id="0" />
        <link enabled="true" id="1" sink_channel="Data" sink_node_id="2" source_channel="Data" source_node_id="0" />
        <link enabled="true" id="2" sink_channel="Learner" sink_node_id="2" source_channel="Learner" source_node_id="3" />
    </links>
</scheme>

Add more feature selection operators

Packaging and Unit tests

From @rasbt:

I think it would be worthwhile to turn TPOT into an importable python module/package and to add unit tests. This would help with the development, especially in collaboration.

For the first public release, I think it would be a big plus to add continuous integration, e.g,. Travis CI (in terms of trustworthiness)

Some bugs in the generated code with feature selection and scaler

I ran a couple of experiments on MNIST and observed that the code generation is a bit buggy at the moment. In the first example only operator generated is SelectPercentile

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


# Use Scikit-learn's SelectPercentile for feature selection
training_features = result2.loc[training_indices].drop('class', axis=1)
training_class_vals = result2.loc[training_indices, 'class'].values

if len(training_features.columns.values) == 0:
result3 = result2.copy()
else:
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(training_features.values, training_class_vals)
mask = selector.get_support(True)
mask_cols = list(training_features.iloc[:, mask].columns) + ['class']
result3 = result2[mask_cols]
  • No indentation
  • result2 is not defined
  • optimized_pipeline_ contains _select_percentile, svc, _standard_scaler, but svc and
    standard scaler don't appear in the generated code

Another example with RobustScaler:

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


# Use Scikit-learn's RobustScaler to scale the features
training_features = result3.loc[training_indices].drop('class', axis=1)
result4 = result3.copy()

if len(training_features.columns.values) > 0:
scaler = RobustScaler()
scaler.fit(training_features.values.astype(np.float64))
scaled_features = scaler.transform(result4.drop('class', axis=1).values.astype(np.float64))

for col_num, column in enumerate(result4.drop('class', axis=1).columns.values):
    result4.loc[:, column] = scaled_features[:, col_num]
  • No indentation
  • result2 is not defined
  • optimized_pipeline_ contains _robust_scaler, svc, svc, _select_percentile, but svc, svc and
    _select_percentile, don't appear in the generated code

Open more ML model parameters to optimization

From #39, we discussed parameters that may be important to open up to search for the various ML models in TPOT. The sklearn devs have a general sense of some of the important parameters, below, but this is not an exhaustive list.

I think it would be valuable at some point to explore what parameters are most important to optimize for the various models used in TPOT, as I discussed here.

_DEFAULT_PARAM_GRIDS = {'AdaBoostClassifier':
                        [{'learning_rate': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'AdaBoostRegressor':
                        [{'learning_rate': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'DecisionTreeClassifier':
                        [{'max_features': ["auto", None]}],
                        'DecisionTreeRegressor':
                        [{'max_features': ["auto", None]}],
                        'ElasticNet':
                        [{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'GradientBoostingClassifier':
                        [{'max_depth': [1, 3, 5]}],
                        'GradientBoostingRegressor':
                        [{'max_depth': [1, 3, 5]}],
                        'KNeighborsClassifier':
                        [{'n_neighbors': [1, 5, 10, 100],
                          'weights': ['uniform', 'distance']}],
                        'KNeighborsRegressor':
                        [{'n_neighbors': [1, 5, 10, 100],
                          'weights': ['uniform', 'distance']}],
                        'Lasso':
                        [{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'LinearRegression':
                        [{}],
                        'LinearSVC':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'LogisticRegression':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'SVC': [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],
                                 'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'MultinomialNB':
                        [{'alpha': [0.1, 0.25, 0.5, 0.75, 1.0]}],
                        'RandomForestClassifier':
                        [{'max_depth': [1, 5, 10, None]}],
                        'RandomForestRegressor':
                        [{'max_depth': [1, 5, 10, None]}],
                        'Ridge':
                        [{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'SGDClassifier':
                        [{'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
                          'penalty': ['l1', 'l2', 'elasticnet']}],
                        'SGDRegressor':
                        [{'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
                          'penalty': ['l1', 'l2', 'elasticnet']}],
                        'LinearSVR':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'SVR':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],
                          'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}]}

Smart seeding of TPOT populations?

Sorry if the text below sounds like rambling -- I was using this issue to brainstorm.

I've been thinking about possible ways to make TPOT perform better right out of the box, without having to run it for several generations to finally discover the better pipelines. One of the ideas I've had is to seed the TPOT population with a smarter group of solutions.

For example, we know that a TPOT pipeline will need at least one model, so we can seed it with each of the 6 current models over a small range of parameters:

  • decision tree: all combinations of
    • max_features = [0 (--> auto), 1 (--> None)]
    • max_depth: [0 (--> None), 1, 5, 10, 20, 50]
    • = 12 total combinations
  • random forest: all combinations of
    • n_estimators = [100, 500]
    • max_features = [0 (--> auto), 1]
    • = 4 total combinations
  • logistic regression:
    • C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
    • = 7 total combinations
  • svc:
    • C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
    • = 7 total combinations
  • knnc:
    • n_neighbors = [2, 5, 10, 20, 50]
    • = 5 total combinations
  • gradient boosting: all combinations of
    • learning_rate: [0.01, 0.1, 0.5, 1.0]
    • n_estimators: [100, 500]
    • max_depth: [0 (--> None), 5, 10]
    • = 24 total combinations

That gives us 59 "classifier-only" TPOT pipelines to start with.

We also have 4 feature selectors:

  • RFE: all combinations of
    • num_features = [1, 5, 10, 50]
    • step = [0.1, 0.25, 0.5]
    • = 12 total combinations
  • select percentile:
    • percentile = [1, 5, 10, 25, 50, 75]
    • = 6 total combinations
  • select k best:
    • k = [1, 2, 5, 10, 20, 50]
    • = 6 total combinations
  • variance threshold:
    • threshold = [0.1, 0.2, 0.3, 0.4, 0.5]
    • = 5 total combinations

And 4 feature preprocessors:

  • standard scaler (no parameters)
    • = 1 total combinations
  • robust scaler (no parameters)
    • = 1 total combinations
  • polynomial features (no parameters)
    • = 1 total combinations
  • PCA:
    • n_components = [1, 2, 4, 10, 20]
    • = 5 total combinations

Thus, if we wanted to provide at least one feature preprocessor or selector in the pipeline before passing the data to the model, that would result in:

feature selection combinations = 12 * 59 + 6 * 59 + 6 * 59 + 5 * 59 = 1,711

feature preprocessor combinations = 5 * 59 + 1 * 59 + 1 * 59 + 1 * 59 = 472

Giving us a total = 59 + 1,711 + 472 = 2,242 pipeline combinations to start out with.

We'd evaluate all 2,242 of these pipelines then use the top 100 to seed the TPOT population. From there, the GP algorithm is allowed to tinker with the pipeline, fine-tune the parameters, and possibly discover better combinations of pipeline operators.

That's obviously a lot of pipelines to try out at the beginning -- about 23 generations worth of pipelines, which will be quite slow on any decently sized data set. It may be necessary to cut down on the parameters that we try out at the beginning.

Address pipeline overfitting

Currently, TPOT has a tendency to build pipelines that overfit the data unless a good training sample is provided. We need to devise a method to combat overfitting on the pipeline level. Here's what I'm looking to explore:

Multi-objective fitness: Optimize along two fitness axes, where one is classification accuracy and the other is model complexity. Model complexity can be quantified in several ways:

  • The number of model pipeline operators in the pipeline
  • The number of pipeline operators in the pipeline
  • The sum of the number of features at every stage of the pipeline

Pareto optimization: Taking ideas from the famous NSGA-2 algorithm, we can explore a two-fitness-axis optimization problem but treat them as Pareto fronts instead. This results in a group of pipelines to select from at the end of the optimization process, where the user must hand-select the trade-off between complexity and accuracy of the pipeline (rather than strictly minimizing model complexity in the multi-objective fitness problem).

I'll be working on this over Winter break, so please feel free to provide feedback and ideas.

Break export() down into 3 separate functions

export() is currently too large. Break it down into 3 functions based on the primary steps of the function:

  • Replace all of the mathematical operators with their results
  • Unroll the nested function calls into serial code
  • Replace the function calls with their corresponding Python code

This change will make it easier to unit test the export() function as well.

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call?

Currently, TPOT requires the training data to be passed along with any additional data so the pipeline can train the sklearn models on the training data again. This is required because the pipeline consists of functions: each random forest, decision tree, etc. is a function and the model is garbage collected as soon as the function terminates.

Let's brainstorm: How can we design these functions so the models remain persistent and don't need to be re-trained? This is really only important for the final pipeline, where the user would be performing score() and predict() calls against the pipeline.

Issue with train/test split

Sometimes when both train_size and test_size aren't specified in StratifiedShuffleSplit() calls, the split doesn't use the entire data set. Change all split calls to explicitly specify both train_size and test_size.

Convenience function: Detect if there are non-numerical features and encode them as numerical features

(As discussed in #60)

Since many sklearn tools only work on numerical data, one limitation of TPOT is that it cannot work with non-numerical features. We should look into adding a convenience function that:

  1. detects whether there exist non-numerical features in the feature set

  2. sends a warning to the user that they should preprocess the non-numerical features into numerical features

  3. ... but also tell the user that TPOT is automatically encoding the non-numerical features as numerical features, do so, and pass the new preprocessed feature set to the optimization process.

Support multiclass accuracy

Currently, TPOT only handles binary classification accuracy. Many ML problems are multiclass -- make sure TPOT can handle this.

Implement more classifier pipeline operators

Similar to the Decision Tree and Random Forest classifier pipeline operators, also implement:

@rasbt, do you think we should add any more than this? I'd like to add ANNs eventually, but since they're not directly supported in sklearn, that will wait for a later time.

Pickling TPOT objects

I wonder if there'd be any interest in generating something to pickle TPOT pipelines. Aside from the immediate use-case of "I want to import my pipeline and work with it more easily than with the .export()-generated .py file", it'd also help immensely in parallelising pipeline search, say if I wanted to find several different pipelines simultaneously and compare their scores afterwards. Admittedly, I haven't looked into the structure of TPOT pipelines, so I'm not sure how complex they are. Is this completely nontrivial ?

Add more preprocessing pipeline operators

Project Documentation Enhancement

I was thinking that it may be worthwhile setting up a project documentation page other than this github repo -- for example, via Sphinx or MkDocs. This would have the advantage to create & organize an API documentation and tutorials/examples. I could set up something like at http://rasbt.github.io/biopandas/ if you'd find it useful.

tpot should handle --version argument

I was running into runtime errors when executing the example (MNIST data set): "AttributeError: TPOT instance has no attribute 'export'", and was hoping to check if the correct tpot version is installed, but it seems that --version isn't supported.

Trim out data transformations operators that are downstream of the last classification step

Sometimes the optimized pipeline will look something like this:

transformation -> transformation -> classification -> transformation

The last transformation step adds nothing. We should cleanup the pipeline by adding a post-processing step to tpot.fit that trims out unnecessary operators from the optimized pipeline. This will be trivial after incorporating the refactor in #63 as we could just add an attribute to the base classes to identify whether or not an operator can be the pipeline terminus. Something like:

class BasicOperator(object): 
        ...
        self._terminal_operator = False
        ...

class LearnerOperator(object): 
        ...
        self._terminal_operator = True
        ...

I felt it'd probably be better to create a new issue for this topic rather than unilaterally adding a commit downstream of the #63 HEAD.

Questions about TPOT

http://www.randalolson.com/2015/11/15/introducing-tpot-the-data-science-assistant/

Perhaps the most basic way to help is to give TPOT a try for your normal workflow and let me know how it works for you. What worked well? What didn't work well? What new features do you think would help? I have my way of doing things, but I'd like to design this tool to be useful for everyone.

Given that this is very much work in progress, I am primarily wondering:

  1. if/how can this be used to directly deal with ASTs (parse trees) - i.e. beyond GAs and so that this can be used to seed/create, mutate syntax trees (e.g. from python ast)
    and 2) if there are any plans to support OpenCL, e.g. for running things concurrently using GPUs or idle CPU cores ?

Thanks

(note that numpy based code can often be easily moved to OpenCL using pyOpenCL)

Expand project unit tests and integration tests

Currently, there are only a few unit tests in tests.py. These are basic unit tests and don't cover a large portion of the project. We should expand the unit tests to cover more of the core TPOT functions.

We also need integration tests that test TPOT as a whole. This can be done with a small, fixed data set and a fixed random number generator seed over only a few generations, with a few different parameter settings.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.