epistasislab / tpot Goto Github PK

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Home Page: http://epistasislab.github.io/tpot/

License: GNU Lesser General Public License v3.0

Shell 0.57% Python 75.85% Jupyter Notebook 23.58%

machine-learning python data-science automl automation scikit-learn hyperparameter-optimization model-selection parameter-tuning automated-machine-learning

tpot's Introduction

Master status:

Development status:

Package information:

To try the TPOT2 (alpha) please go here!

TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

An example Machine Learning pipeline

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

TPOT is still under active development and we encourage you to check back on this repository regularly for updates.

For further information about TPOT, please see the project documentation.

License

Please see the repository license for the licensing and usage information for TPOT.

Generally, we have licensed TPOT to make it as widely usable as possible.

Installation

We maintain the TPOT installation instructions in the documentation. TPOT requires a working installation of Python.

Usage

TPOT can be used on the command line or with Python code.

Click on the corresponding links to find more information on TPOT usage in the documentation.

Examples

Classification

Below is a minimal working example with the optical recognition of handwritten digits dataset.

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_digits_pipeline.py file and look similar to the following:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Regression

Similarly, TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

which should result in a pipeline that achieves about 12.77 mean squared error (MSE), and the Python code in tpot_boston_pipeline.py should look similar to:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Check the documentation for more examples and tutorials.

Contributing to TPOT

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.

Before submitting any contributions, please review our contribution guidelines.

Having problems or have questions about TPOT?

Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, file a new issue on this repository so we can review your issue.

Citing TPOT

If you use TPOT in a scientific publication, please consider citing at least one of the following papers:

Trang T. Le, Weixuan Fu and Jason H. Moore (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics.36(1): 250-256.

BibTeX entry:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). Automating biomedical data science through tree-based pipeline optimization. Applications of Evolutionary Computation, pages 123-137.

BibTeX entry:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.

BibTeX entry:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

Alternatively, you can cite the repository directly with the following DOI:

Support for TPOT

TPOT was developed in the Computational Genetics Lab at the University of Pennsylvania with funding from the NIH under grant R01 AI117694. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.

The TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.

tpot's People

Contributors

Stargazers

Watchers

Forkers

rasbt benjamesbabala vikramachandran techscientist rhanberry fsgp qcaudron milstein abdullah2891 snazz2001 kharkovsailing udemirezen twistingtwists w268wang chris7 bartleyn fdoperezi tomviner stephanesbizzera abhishekgahlot krishnatray therockstardba kod3r ivanidris janusnic simplyahmazing gragtah slitayem carbonblack8 xypan1232 pronojitsaha rcarneva dmarx you13 kadarakos abhik1368 eugenesavenko adversariel lejarx paulhendricks sampathweb codeaudit libardo1 sid-dinesh94 vsolano todo ritviksahajpal booleancandy jeetgangele magsol cclauss jni gitter-badger lizsz gjena derekjanni xuanhan863 adityosanjaya chennnnnyize-zz ml-lab chubbymaggie directorscut82 cloudxtreme 3mei macmorgan screwed99 jsonbao sunzhxjs 4sp1r3 42machinelearning lzchuan neocortex jjsong kaynewest fototo divide-x-plus ekoziol srhrshr erlendd erogol caracena ndanielsen puneet-shivanand gcetinkaya pieces201020 lpalova daniel-perry hitchhiker744 camcairns sunlchk mdbconsulting tonyfast ankitshah009 alonegu westonplatter huleg raamana hedgefair harshnisar sunhwan

tpot's Issues

Output the current best pipeline when the user terminates TPOT on the command line

Issue raised from #34

Add random_state as a parameter to init()

Instead of only configuring the RNG seed from the command line, make it a parameter that is passed to __init__(), where it sets the seeds there.

Add Executable Examples

Would be nice if you could add an ./examples directory containing some IPython Notebooks with a complete TPOT pipeline optimization examples & comments (e.g., using the prostate cancer dataset). You could link them in the README.md file, which would help people to get an overview, plus, people could re-use these notebooks as templates for own experiments.

example broken?

Getting the following traceback when running the minimal working example from README.

Traceback (most recent call last):
  File "example.py", line 9, in <module>
    tpot.fit(X_train, y_train)
  File "/Users/marcel/workspace/tpot/tpot/tpot.py", line 162, in fit
    self.toolbox.register('evaluate', self.evaluate_individual, training_testing_data=training_testing_data)
AttributeError: 'TPOT' object has no attribute 'evaluate_individual'

Changing member function name back to _evaluate_individual resolves the issue.

Installation error with Python 2

[pete@dakota ~]$ pip install tpot
Collecting tpot
  Downloading TPOT-0.1.2.tar.gz (165kB)
    100% |████████████████████████████████| 167kB 811kB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/var/folders/c3/bhnfltk57zb_fs68gy3mcr300000gn/T/pip-build-be0XTk/tpot/setup.py", line 18, in <module>
        package_version = calculate_version()
      File "/private/var/folders/c3/bhnfltk57zb_fs68gy3mcr300000gn/T/pip-build-be0XTk/tpot/setup.py", line 13, in calculate_version
        version = next(filter(lambda x: '__version__' in x, initpy)).split('\'')[1]
    TypeError: list object is not an iterator

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/c3/bhnfltk57zb_fs68gy3mcr300000gn/T/pip-build-be0XTk/tpot

Probably because Python 3 treats many lists as iterators by default and Python 2 doesn't.

Should we rename optimize() to fit()?

@rasbt, what do you think? We basically have a normal-ish sklearn interface here except for the optimize() call.

Broken Landscape Badge

For some reason, the landscape badge is now broken after the last pull request. I don't think there is a problem with the link since it worked previously, however, I noticed that it also doesn't show on landscape.io itself anymore (the other styles still seem to work though); weird!

I think this may be a github-related caching error, e.g,. see the discussion here: github/markup#224

Let's keep an eye on that, maybe it resolves itself in a few hours, otherwise, we'd have to investigate further ...

Add initialization statements to script version

Currently, TPOT prints the settings etc. at the beginning of a run with the command-line version with high verbosity (=2), but not the script version. Make TPOT also print the settings etc. at the beginning of the script version at high verbosity as well.

Basic project documentation

Flesh out the README to provide

a basic working example of TPOT
installation instructions
a longer description of TPOT

Using the tpot object for prediction

Error with .predict for iris example

from tpot import TPOT
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split

digits = load_iris()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOT(generations=10)
tpot.fit(X_train, y_train)
print(tpot.score(X_train, y_train, X_test, y_test))

But when I try to use the pipeline as a predictor

tpot.predict(X_train, y_train, X_test)

this is the error I get (iPython debugger output):

TypeError                                 Traceback (most recent call last)
<ipython-input-8-74abe9ee292a> in <module>()
----> 1 tpot.predict(X_train, y_train, X_test)

/usr/local/lib/python2.7/dist-packages/tpot/tpot.pyc in predict(self, training_features, training_classes, testing_features)
    290 
    291         result = func(training_testing_data)
--> 292         return result[result['group'] == 'testing', 'guess'].values
    293 
    294     def score(self, training_features, training_classes, testing_features, testing_classes):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1656             return self._getitem_multilevel(key)
   1657         else:
-> 1658             return self._getitem_column(key)
   1659 
   1660     def _getitem_column(self, key):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1663         # get column
   1664         if self.columns.is_unique:
-> 1665             return self._get_item_cache(key)
   1666 
   1667         # duplicate columns & possible reduce dimensionaility

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1001     def _get_item_cache(self, item):
   1002         cache = self._item_cache
-> 1003         res = cache.get(item)
   1004         if res is None:
   1005             values = self._data.get(item)

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
    623     def __hash__(self):
    624         raise TypeError('{0!r} objects are mutable, thus they cannot be'
--> 625                         ' hashed'.format(self.__class__.__name__))
    626 
    627     def __iter__(self):

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Interpreting generated code

Running the iris example generated this piece of code

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, 
                                                                     n_iter=1, 
                                                                     train_size=0.75)))
result1 = tpot_data.copy()

# Perform classification with a decision tree classifier
dtc1 = DecisionTreeClassifier(max_features=min(83, len(result1.columns) - 1), max_depth=19)
dtc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result1['dtc1-classification'] = dtc1.predict(result1.drop('class', axis=1).values)

# Perform classification with a decision tree classifier
dtc2 = DecisionTreeClassifier(max_features='auto', max_depth=56)
dtc2.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result2 = result1
result2['dtc2-classification'] = dtc2.predict(result2.drop('class', axis=1).values)

I struggle a bit to understand what is the intended idea behind providing this result2 dataframe. So there are 2 classification results in the above example both with decision trees and with different hyper-parameters, but how do these get combined?

Ensure Access to Model Parameters upon Early Termination

From @rasbt:

Let's make sure that we don't lose the model parameters if the run is terminated early.

Add a "verbose" parameter that writes to stderr. This way, the user can pipe the output (e.g., model parameters and metrics) to a log file. This is especially useful to keep track of the process if you are running TPOT as a PBS job, and it ensures that you have access to the model parameters when the job crashes (or hits the wall time)
Also, make sure that the current state is saved gracefully if the program quits unexpectedly.

Explain early stopping parameters in the documentation

Is there use for an early stopping parameter? Where can it be best implemented? Should it be based on the previous n-rounds, or based on the internal validation evaluations? Something to think about...

TPOT command line usage help

I downloaded a sample mnist data set into a CSV and installed TPOT and all the dependencies.

I tried running it through the command line and below is the command I ran and the results I got,

$ tpot -i mnist.csv -is , -g 100 -s 42 -v 2

TPOT settings:
crossover_rate  =   0.05
generations =   100
input_file  =   mnist.csv
input_separator =   ,
mutation_rate   =   0.9
population_size =   100
random_state    =   42
verbosity   =   2





gen nevals  Minimum accuracy    Average accuracy    Maximum accuracy
0   100     0.1                 0.404918            0.964608






^C


^CTraceback (most recent call last):
  File "/Users/moi/.pyenv/versions/tpot/bin/tpot", line 9, in <module>
    load_entry_point('TPOT==0.1.3', 'console_scripts', 'tpot')()
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/tpot/tpot.py", line 479, in main
    training_features, training_classes)))
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/tpot/tpot.py", line 207, in score
    training_testing_data.rename(columns={column: str(column).zfill(5)}, inplace=True)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/frame.py", line 2697, in rename
    **kwargs)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/generic.py", line 606, in rename
    result._data = result._data.rename_axis(f, axis=baxis, copy=copy)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 2587, in rename_axis
    obj = self.copy(deep=copy)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 3059, in copy
    do_integrity_check=False)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 2823, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/moi/.pyenv/versions/tpot/lib/python3.5/site-packages/pandas/core/internals.py", line 578, in copy
    values = values.copy()
KeyboardInterrupt

It took a couple of hours until TPOT returned the stats summary and then after an hour or so it was still running so I terminated it. I'm curious as to what a TPOT run looks like? And for some reason I was expecting code to be written to a directory,

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

Or maybe this python source is printed to the terminal? Going to start reading the tpot source more thoroughly.

Allow passing of sparse matrices

In some situations, it's easier to pass a scipy.sparse matrix object to sklearn model objects. This would reduce the memory requirements when fitting larger datasets.

Most sklearn models will accept a sparse matrix. For those that do not, checking for sparsity in the method call and calling matrix.todense() would work.

Implement a predict() function

Create a predict() function to provide a similar interface as with scikit-learn models.

predict() takes 3 parameters:

training features
training classes
testing features

and returns the predicted testing classes.

Performance Metrics & Fitness Functions

I imagine that it would be useful to use TPOT to find models that optimize alternate performance metrics like precision, recall, etc. As such, I've come up with the following brainstorming questions:

Does having alternate performance metrics/fitness functions make sense for the users?
Does it make sense to add alternate metrics when reporting the best model? If so, which metric do we use as the fitness function?
Since there is native support for multi-class/label classification regular precision, recall, and F1 may not be that useful. Should we just take the average version of scores like these when necessary?

There are plenty of other questions, but I figured this would be a decent place to start. Let me know if I'm totally misguided in proposing this -- I won't profess to be an expert in genetic programming.

Export TPOT pipelines to Orange file format

Orange has a pretty nice interface for building sklearn pipelines in a GUI.

It'd be great to have a function that exports TPOT pipelines into a file format that can be opened with Orange.

I've contacted the Orange devs and they said that it should be relatively straightforward to accomplish this as long as Orange has each TPOT pipeline operator already implemented as an Orange widget.

I'd imagine a good first step would be to thoroughly explore the Orange software and see what widgets/pipeline operators they already have implemented.

Here's an example Orange file:

<?xml version='1.0' encoding='utf-8'?>
<scheme description="" title="" version="2.0">
    <nodes>
        <node id="0" name="File" position="(150, 150)" project_name="Orange" qualified_name="Orange.widgets.data.owfile.OWFile" title="File" version="" />
        <node id="1" name="Data Info" position="(319.0, 77.0)" project_name="Orange" qualified_name="Orange.widgets.data.owdatainfo.OWDataInfo" title="Data Info" version="" />
        <node id="2" name="Test &amp; Score" position="(318.0, 209.0)" project_name="Orange" qualified_name="Orange.widgets.evaluate.owtestlearners.OWTestLearners" title="Test &amp; Score" version="" />
        <node id="3" name="Logistic Regression" position="(162.0, 324.0)" project_name="Orange" qualified_name="Orange.widgets.classify.owlogisticregression.OWLogisticRegression" title="Logistic Regression" version="" />
    </nodes>
    <links>
        <link enabled="true" id="0" sink_channel="Data" sink_node_id="1" source_channel="Data" source_node_id="0" />
        <link enabled="true" id="1" sink_channel="Data" sink_node_id="2" source_channel="Data" source_node_id="0" />
        <link enabled="true" id="2" sink_channel="Learner" sink_node_id="2" source_channel="Learner" source_node_id="3" />
    </links>
</scheme>

Add verbosity parameter

Allow the user to tune the level of output from TPOT. Default to no output.

Add more feature selection operators

Add more feature selection operators from sklearn:

Packaging and Unit tests

From @rasbt:

I think it would be worthwhile to turn TPOT into an importable python module/package and to add unit tests. This would help with the development, especially in collaboration.

For the first public release, I think it would be a big plus to add continuous integration, e.g,. Travis CI (in terms of trustworthiness)

Add DecisionTreeRegressor & RandomForestRegressor

I was wondering whether it is easy to implement a regressor for TPOT. It'd use DecisionTreeRegressor and RandomForestRegressor instead of classifiers.

It'd increase the number of TPOT usage and boost the development of TPOT.

I'm kinda at pre-intermediate level at commiting to open-source and precisely sklearn, but if someone could tell me this task is achievable to a newcomer I'll start working on this.

Some bugs in the generated code with feature selection and scaler

I ran a couple of experiments on MNIST and observed that the code generation is a bit buggy at the moment. In the first example only operator generated is SelectPercentile

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


# Use Scikit-learn's SelectPercentile for feature selection
training_features = result2.loc[training_indices].drop('class', axis=1)
training_class_vals = result2.loc[training_indices, 'class'].values

if len(training_features.columns.values) == 0:
result3 = result2.copy()
else:
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(training_features.values, training_class_vals)
mask = selector.get_support(True)
mask_cols = list(training_features.iloc[:, mask].columns) + ['class']
result3 = result2[mask_cols]

No indentation
result2 is not defined
optimized_pipeline_ contains _select_percentile, svc, _standard_scaler, but svc and
standard scaler don't appear in the generated code

Another example with RobustScaler:

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


# Use Scikit-learn's RobustScaler to scale the features
training_features = result3.loc[training_indices].drop('class', axis=1)
result4 = result3.copy()

if len(training_features.columns.values) > 0:
scaler = RobustScaler()
scaler.fit(training_features.values.astype(np.float64))
scaled_features = scaler.transform(result4.drop('class', axis=1).values.astype(np.float64))

for col_num, column in enumerate(result4.drop('class', axis=1).columns.values):
    result4.loc[:, column] = scaled_features[:, col_num]

No indentation
result2 is not defined
optimized_pipeline_ contains _robust_scaler, svc, svc, _select_percentile, but svc, svc and
_select_percentile, don't appear in the generated code

Always use balanced accuracy

There is no reason not to always use balanced accuracy for the fitness function by default.

Check parameters on input and make sure they're in a valid range

For both __init() and the command line calls.

Open more ML model parameters to optimization

From #39, we discussed parameters that may be important to open up to search for the various ML models in TPOT. The sklearn devs have a general sense of some of the important parameters, below, but this is not an exhaustive list.

I think it would be valuable at some point to explore what parameters are most important to optimize for the various models used in TPOT, as I discussed here.

_DEFAULT_PARAM_GRIDS = {'AdaBoostClassifier':
                        [{'learning_rate': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'AdaBoostRegressor':
                        [{'learning_rate': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'DecisionTreeClassifier':
                        [{'max_features': ["auto", None]}],
                        'DecisionTreeRegressor':
                        [{'max_features': ["auto", None]}],
                        'ElasticNet':
                        [{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'GradientBoostingClassifier':
                        [{'max_depth': [1, 3, 5]}],
                        'GradientBoostingRegressor':
                        [{'max_depth': [1, 3, 5]}],
                        'KNeighborsClassifier':
                        [{'n_neighbors': [1, 5, 10, 100],
                          'weights': ['uniform', 'distance']}],
                        'KNeighborsRegressor':
                        [{'n_neighbors': [1, 5, 10, 100],
                          'weights': ['uniform', 'distance']}],
                        'Lasso':
                        [{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'LinearRegression':
                        [{}],
                        'LinearSVC':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'LogisticRegression':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'SVC': [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],
                                 'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'MultinomialNB':
                        [{'alpha': [0.1, 0.25, 0.5, 0.75, 1.0]}],
                        'RandomForestClassifier':
                        [{'max_depth': [1, 5, 10, None]}],
                        'RandomForestRegressor':
                        [{'max_depth': [1, 5, 10, None]}],
                        'Ridge':
                        [{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'SGDClassifier':
                        [{'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
                          'penalty': ['l1', 'l2', 'elasticnet']}],
                        'SGDRegressor':
                        [{'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
                          'penalty': ['l1', 'l2', 'elasticnet']}],
                        'LinearSVR':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}],
                        'SVR':
                        [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],
                          'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}]}

Smart seeding of TPOT populations?

Sorry if the text below sounds like rambling -- I was using this issue to brainstorm.

I've been thinking about possible ways to make TPOT perform better right out of the box, without having to run it for several generations to finally discover the better pipelines. One of the ideas I've had is to seed the TPOT population with a smarter group of solutions.

For example, we know that a TPOT pipeline will need at least one model, so we can seed it with each of the 6 current models over a small range of parameters:

decision tree: all combinations of
- max_features = [0 (--> auto), 1 (--> None)]
- max_depth: [0 (--> None), 1, 5, 10, 20, 50]
- = 12 total combinations
random forest: all combinations of
- n_estimators = [100, 500]
- max_features = [0 (--> auto), 1]
- = 4 total combinations
logistic regression:
- C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
- = 7 total combinations
svc:
- C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
- = 7 total combinations
knnc:
- n_neighbors = [2, 5, 10, 20, 50]
- = 5 total combinations
gradient boosting: all combinations of
- learning_rate: [0.01, 0.1, 0.5, 1.0]
- n_estimators: [100, 500]
- max_depth: [0 (--> None), 5, 10]
- = 24 total combinations

That gives us 59 "classifier-only" TPOT pipelines to start with.

We also have 4 feature selectors:

RFE: all combinations of
- num_features = [1, 5, 10, 50]
- step = [0.1, 0.25, 0.5]
- = 12 total combinations
select percentile:
- percentile = [1, 5, 10, 25, 50, 75]
- = 6 total combinations
select k best:
- k = [1, 2, 5, 10, 20, 50]
- = 6 total combinations
variance threshold:
- threshold = [0.1, 0.2, 0.3, 0.4, 0.5]
- = 5 total combinations

And 4 feature preprocessors:

standard scaler (no parameters)
- = 1 total combinations
robust scaler (no parameters)
- = 1 total combinations
polynomial features (no parameters)
- = 1 total combinations
PCA:
- n_components = [1, 2, 4, 10, 20]
- = 5 total combinations

Thus, if we wanted to provide at least one feature preprocessor or selector in the pipeline before passing the data to the model, that would result in:

feature selection combinations = 12 * 59 + 6 * 59 + 6 * 59 + 5 * 59 = 1,711

feature preprocessor combinations = 5 * 59 + 1 * 59 + 1 * 59 + 1 * 59 = 472

Giving us a total = 59 + 1,711 + 472 = 2,242 pipeline combinations to start out with.

We'd evaluate all 2,242 of these pipelines then use the top 100 to seed the TPOT population. From there, the GP algorithm is allowed to tinker with the pipeline, fine-tune the parameters, and possibly discover better combinations of pipeline operators.

That's obviously a lot of pipelines to try out at the beginning -- about 23 generations worth of pipelines, which will be quite slow on any decently sized data set. It may be necessary to cut down on the parameters that we try out at the beginning.

Implement command-line call for TPOT

Currently, the only way to call TPOT is to point Python to the tpot.py file, which obviously isn't useful for people who installed TPOT via e.g. pip. Make the pip installer create an entry point for the command line.

See https://github.com/rhiever/reddit-analysis/blob/master/setup.py#L41 for an example.

Address pipeline overfitting

Currently, TPOT has a tendency to build pipelines that overfit the data unless a good training sample is provided. We need to devise a method to combat overfitting on the pipeline level. Here's what I'm looking to explore:

Multi-objective fitness: Optimize along two fitness axes, where one is classification accuracy and the other is model complexity. Model complexity can be quantified in several ways:

The number of model pipeline operators in the pipeline
The number of pipeline operators in the pipeline
The sum of the number of features at every stage of the pipeline

Pareto optimization: Taking ideas from the famous NSGA-2 algorithm, we can explore a two-fitness-axis optimization problem but treat them as Pareto fronts instead. This results in a group of pipelines to select from at the end of the optimization process, where the user must hand-select the trade-off between complexity and accuracy of the pipeline (rather than strictly minimizing model complexity in the multi-objective fitness problem).

I'll be working on this over Winter break, so please feel free to provide feedback and ideas.

Error running MNIST minimal working example with Python 2.7

----> 1 tpot.score(X_test, y_test)

TypeError: score() takes exactly 5 arguments (3 given)

I think it should be:
tpot.score(X_train, y_train, X_test, y_test)
Out[9]: 0.97094643806485903

Break export() down into 3 separate functions

export() is currently too large. Break it down into 3 functions based on the primary steps of the function:

Replace all of the mathematical operators with their results
Unroll the nested function calls into serial code
Replace the function calls with their corresponding Python code

This change will make it easier to unit test the export() function as well.

Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call?

Currently, TPOT requires the training data to be passed along with any additional data so the pipeline can train the sklearn models on the training data again. This is required because the pipeline consists of functions: each random forest, decision tree, etc. is a function and the model is garbage collected as soon as the function terminates.

Let's brainstorm: How can we design these functions so the models remain persistent and don't need to be re-trained? This is really only important for the final pipeline, where the user would be performing score() and predict() calls against the pipeline.

Issue with train/test split

Sometimes when both train_size and test_size aren't specified in StratifiedShuffleSplit() calls, the split doesn't use the entire data set. Change all split calls to explicitly specify both train_size and test_size.

Convenience function: Detect if there are non-numerical features and encode them as numerical features

(As discussed in #60)

Since many sklearn tools only work on numerical data, one limitation of TPOT is that it cannot work with non-numerical features. We should look into adding a convenience function that:

detects whether there exist non-numerical features in the feature set
sends a warning to the user that they should preprocess the non-numerical features into numerical features
... but also tell the user that TPOT is automatically encoding the non-numerical features as numerical features, do so, and pass the new preprocessed feature set to the optimization process.

Support multiclass accuracy

Currently, TPOT only handles binary classification accuracy. Many ML problems are multiclass -- make sure TPOT can handle this.

Making the code citable

Hey, since TPOT has a release tag for the "first" version as it is described in the accompanying paper, I thought it may be worthwhile to go a step "further" and get a unique DOI to cite the library. There's a nice instruction page on GitHub to set this up: https://guides.github.com/activities/citable-code/

Output the pipeline to sklearn code

Provide a utility function to export the pipeline as sklearn code.

Implement more classifier pipeline operators

Similar to the Decision Tree and Random Forest classifier pipeline operators, also implement:

@rasbt, do you think we should add any more than this? I'd like to add ANNs eventually, but since they're not directly supported in sklearn, that will wait for a later time.

Pickling TPOT objects

I wonder if there'd be any interest in generating something to pickle TPOT pipelines. Aside from the immediate use-case of "I want to import my pipeline and work with it more easily than with the .export()-generated .py file", it'd also help immensely in parallelising pipeline search, say if I wanted to find several different pipelines simultaneously and compare their scores afterwards. Admittedly, I haven't looked into the structure of TPOT pipelines, so I'm not sure how complex they are. Is this completely nontrivial ?

Add more preprocessing pipeline operators

Add the following preprocessing pipeline operators:

Store the training set internally so the user doesn't have to repeatedly pass it

Since the training set is required to properly run the pipeline each time, this information should be stored internally so the user doesn't have to repeatedly pass it to TPOT (e.g., with the score() function).

Project Documentation Enhancement

I was thinking that it may be worthwhile setting up a project documentation page other than this github repo -- for example, via Sphinx or MkDocs. This would have the advantage to create & organize an API documentation and tutorials/examples. I could set up something like at http://rasbt.github.io/biopandas/ if you'd find it useful.

Export functionality: Only import sklearn modules when they're used

Currently, the export() function imports all of the sklearn modules that can possibly be used by TPOT. Let's make the export() function a little smarter by looking at the sklearn modules it actually uses and only importing those.

tpot should handle --version argument

I was running into runtime errors when executing the example (MNIST data set): "AttributeError: TPOT instance has no attribute 'export'", and was hoping to check if the correct tpot version is installed, but it seems that --version isn't supported.

Trim out data transformations operators that are downstream of the last classification step

Sometimes the optimized pipeline will look something like this:

transformation -> transformation -> classification -> transformation

The last transformation step adds nothing. We should cleanup the pipeline by adding a post-processing step to tpot.fit that trims out unnecessary operators from the optimized pipeline. This will be trivial after incorporating the refactor in #63 as we could just add an attribute to the base classes to identify whether or not an operator can be the pipeline terminus. Something like:

class BasicOperator(object): 
        ...
        self._terminal_operator = False
        ...

class LearnerOperator(object): 
        ...
        self._terminal_operator = True
        ...

I felt it'd probably be better to create a new issue for this topic rather than unilaterally adding a commit downstream of the #63 HEAD.

Create an abstract function for the classifier pipeline operators

There's quite a bit of repeated code in the classifier functions. Abstract the common code to a single function then have the classifiers call that function and pass their custom parameters/model.

Questions about TPOT

http://www.randalolson.com/2015/11/15/introducing-tpot-the-data-science-assistant/

Perhaps the most basic way to help is to give TPOT a try for your normal workflow and let me know how it works for you. What worked well? What didn't work well? What new features do you think would help? I have my way of doing things, but I'd like to design this tool to be useful for everyone.

Given that this is very much work in progress, I am primarily wondering:

if/how can this be used to directly deal with ASTs (parse trees) - i.e. beyond GAs and so that this can be used to seed/create, mutate syntax trees (e.g. from python ast)
and 2) if there are any plans to support OpenCL, e.g. for running things concurrently using GPUs or idle CPU cores ?

Thanks

(note that numpy based code can often be easily moved to OpenCL using pyOpenCL)

epistasislab / tpot Goto Github PK

tpot's Introduction

License

Installation

Usage

Examples

Classification

Regression

Contributing to TPOT

Having problems or have questions about TPOT?

Citing TPOT

Support for TPOT

tpot's People

Contributors

Stargazers

Watchers

Forkers

tpot's Issues

Error with .predict for iris example

Interpreting generated code

Recommend Projects

Recommend Topics

Recommend Org