titu1994 / pyshac Goto Github PK

A Python library for the Sequential Halving and Classification algorithm

Home Page: http://titu1994.github.io/pyshac/

License: MIT License

Python 99.92% Batchfile 0.08%

pyshac's Introduction

PySHAC : A Python Library for `Sequential Halving and Classification` Algorithm

PySHAC is a python library to use the Sequential Halving and Classification algorithm from the paper Parallel Architecture and Hyperparameter Search via Successive Halving and Classification with ease.

Note : This library is not affiliated with Google.

Documentation

Stable build documentation can be found at PySHAC Documentation.

It contains a User Guide, as well as explanation of the different engines that can be used with PySHAC.

Topic	Link
Installation	http://titu1994.github.io/pyshac/install/
User Guide	http://titu1994.github.io/pyshac/guide/
Managed Engines	http://titu1994.github.io/pyshac/managed/
Custom Hyper Parameters	http://titu1994.github.io/pyshac/custom-hyper-parameters/
Serial Evaluation	http://titu1994.github.io/pyshac/serial-execution/
External Dataset Training	http://titu1994.github.io/pyshac/external-dataset-training/
Callbacks	http://titu1994.github.io/pyshac/callbacks/

Installation

This library is available for Python 2.7 and 3.4+ via pip for Windows, MacOSX and Linux.

pip install pyshac

To install the master branch of this library :

git clone https://github.com/titu1994/pyshac.git
cd pyshac
pip install .

or pip install .[tests]  # to also include dependencies necessary for testing

To install the requirements before installing the library :

pip install -r "requirements.txt"

To build the docs, additional packages must be installed :

pip install -r "doc_requirements.txt"

Getting started with PySHAC

First, build the set of hyper parameters. The three main HyperParameter classes are :

DiscreteHyperParameter
UniformContinuousHyperParameter
NormalContinuousHyperParameter

There are also 3 additional hyper parameters, which are useful when a parameter needs to be sampled multiple times for each evaluation :

MultiDiscreteHyperParameter
MultiUniformContinuousHyperParameter
MultiNormalContinuousHyperParameter

These multi parameters have an additional argument sample_count which can be used to sample multiple times per step.

Note: The values will be concatenated linearly, so each multi parameter will have a list of values returned in the resultant OrderedDict. If you wish to flatten the entire search space, you can use pyshac.flatten_parameters on this OrderedDict.

import pyshac

# Discrete parameters
dice_rolls = pyshac.DiscreteHyperParameter('dice', values=[1, 2, 3, 4, 5, 6])
coin_flip = pyshac.DiscreteHyperParameter('coin', values=[0, 1])

# Continuous Parameters
classifier_threshold = pyshac.UniformContinuousHyperParameter('threshold', min_value=0.0, max_value=1.0)
noise = pyshac.NormalContinuousHyperParameter('noise', mean=0.0, std=1.0)

Setup the engine

When setting up the SHAC engine, we need to define a few important parameters which will be used by the engine :

Hyper Parameter list: A list of parameters that have been declared. This will constitute the search space.
Total budget: The number of evaluations that will occur.
Number of batches: The number of samples per batch of evaluation.
Objective: String value which can be either max or min. Defines whether the objective should be maximised or minimised.
Maximum number of classifiers: As it suggests, decides the upper limit of how many classifiers can be trained. This is optional, and usually not required to specify.

import numpy as np
import pyshac

# define the parameters
param_x = pyshac.UniformContinuousHyperParameter('x', -5.0, 5.0)
param_y = pyshac.UniformContinuousHyperParameter('y', -2.0, 2.0)

parameters = [param_x, param_y]

# define the total budget as 100 evaluations
total_budget = 100  # 100 evaluations at maximum

# define the number of batches
num_batches = 10  # 10 samples per batch

# define the objective
objective = 'min'  # minimize the squared loss

shac = pyshac.SHAC(parameters, total_budget, num_batches, objective)

Training the classifiers

To train a classifier, the user must define an Evaluation function. This is a user defined function, that accepts 2 or more inputs as defined by the engine, and returns a python floating point value.

The Evaluation Function receives at least 2 inputs :

Worker ID: Integer id that can be left alone when executing only on CPU or used to determine the iteration number in the current epoch of evaluation.
Parameter OrderedDict: An OrderedDict which contains the (name, value) pairs of the Parameters passed to the engine.
- Since it is an ordered dict, if only the values are required, list(parameters.values()) can be used to get the list of values in the same order as when the Parameters were declared to the engine.
- These are the values of the sampled hyper parameters which have passed through the current cascade of models.

An example of a defined evaluation function :

# define the evaluation function
def squared_error_loss(id, parameters):
    x = parameters['x']
    y = parameters['y']
    y_sample = 2 * x - y

    # assume best values of x and y and 2 and 0 respectively
    y_true = 4.

    return np.square(y_sample - y_true)

A single call to shac.fit() will begin training the classifiers.

There are a few cases to consider:

There can be cases where the search space is not large enough to train the maximum number of classifier (usually 18).
There may be instances where we want to allow some relaxations of the constraint that the next batch must pass through all of the previous classifiers. This allows classifiers to train on the same search space repeatedly rather than divide the search space.

In these cases, we can utilize a few additional parameters to allow the training behaviour to better adapt to these circumstances. These parameters are :

skip_cv_checks: As it suggests, if the number of samples per batch is too small, it is preferable to skip the cross validation check, as most classifiers will not pass them.
early_stop: Determines whether training should halt as soon as an epoch of failed learning occurs. This is useful when evaluations are very costly.
relax_checks: This will instead relax the constrain of having the sample pass through all classifiers to having the classifier past through most of the classifiers. In doing so, more samples can be obtained for the same search space.

# `early stopping` default is False, and it is preferred not to use it when using `relax checks`
shac.fit(squared_error_loss, skip_cv_checks=True, early_stop=False, relax_checks=True)

Sampling the best hyper parameters

Once the models have been trained by the engine, it is as simple as calling predict() to sample multiple samples or batches of parameters.

Samples can be obtained in a per instance or per batch (or even a combination) using the two parameters - num_samples and num_batches.

# sample a single instance of hyper parameters
parameter_samples = shac.predict()  # Gets 1 sample.

# sample multiple instances of hyper parameters
parameter_samples = shac.predict(10)  # Gets 10 samples.

# sample a batch of hyper parameters
parameter_samples = shac.predict(num_batches=5)  # samples 5 batches, each containing 10 samples.

# sample multiple batches and a few additional instances of hyper parameters
parameter_samples = shac.predict(5, 5)  # samples 5 batches (each containing 10 samples) and an additional 5 samples.

Examples

Examples based on the Branin and Hartmann6 problems can be found in the Examples folder.

An example of how to use the TensorflowSHAC engine is provided in the example foldes as well.

Comparison scripts of basic optimization, Branin and Hartmann6 using Tensorflow Eager 1.8 are provided in the respective folders.

Evaluation of Branin

Brannin to close to the true optima as described in the paper.

Evaluation of Hardmann 6

Hartmann 6 was a much harder dataset, and results are worse than Random Search 2x and the one from the paper. Perhaps it was due to a bad run, and may be fixed with larger budget for training.

Evaluation of Simple Optimization Objective

The task is to sample two parameters x and y, such that z = 2 * x - y and we want z to approach the value of 4. We utilize MSE as the metric between z and the optimal value.

Evaluation of Hyper Parameter Optimization

The task is to sample hyper parameters which provide high accuracy values using TensorflowSHAC engine.

pyshac's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 gaimjkp mahmud83 amirunpri2018 sophisticateddetective stjordanis kolanich-ml

pyshac's Issues

callbacks

I think it isn't necessary, but I wanted to confirm, since model.fit is currently called in an evaluate_model() function it could be re-initialized each time anyway so I guess it isn't a big deal in any case, they can just be created there.

tensor/generator input

I was looking through the code the other day and some of it appears to assume the dataset can fit in memory as a numpy array. My dataset is 0.5 TB so unfortunately that won't work. Am I mistaken or will changes be required to account for this situation?

Default hyperparameters that won't be explored.

In my current hyperparam setup I happen to have a very similar API to what you chose. 👍 However, I don't always want to search every hyperparam, but I do want every hyperparam to be defined.

Here is my hyperparameter search space object:

class HyperparameterOptions(object):
    """
     [apache v2 license](https://www.apache.org/licenses/LICENSE-2.0)
    """

    def __init__(self, verbose=0):
        self.index_dict = {}
        self.search_space = []
        self.verbose = verbose

    def add_param(self, name, domain, domain_type='discrete', enable=True, required=True, default=None):
        """

        # Arguments

            search_space: list of hyperparameter configurations required by BayseanOptimizer
            index_dict: dictionary that will be used to lookup real values
                and types when we get the hyperopt callback with ints or floats
            enable: this parameter will be part of hyperparameter search
            required: this parameter must be passed to the model
            default: default value if required

        """
        if self.search_space is None:
            self.search_space = []
        if self.index_dict is None:
            self.index_dict = {'current_index': 0}
        if 'current_index' not in self.index_dict:
            self.index_dict['current_index'] = 0

        if enable or required:
            param_index = self.index_dict['current_index']
            numerical_domain = domain
            needs_reverse_lookup = False
            lookup_as = 'float'
            # convert string domains to a domain of integer indexes
            if domain_type == 'discrete':
                if isinstance(domain, list) and isinstance(domain[0], str):
                    numerical_domain = [i for i in range(len(domain))]
                    lookup_as = 'str'
                    needs_reverse_lookup = True
                elif isinstance(domain, list) and isinstance(domain[0], bool):
                    numerical_domain = [i for i in range(len(domain))]
                    lookup_as = 'bool'
                    needs_reverse_lookup = True
                elif isinstance(domain, list) and isinstance(domain[0], float):
                    lookup_as = 'float'
                else:
                    lookup_as = 'int'

            opt_dict = {
                'name': name,
                'type': domain_type,
                'domain': numerical_domain}

            if enable:
                self.search_space += [opt_dict]
                # create a second version for us to construct the real function call
                opt_dict = copy.deepcopy(opt_dict)
                opt_dict['lookup_as'] = lookup_as
            else:
                opt_dict['lookup_as'] = None

            opt_dict['enable'] = enable
            opt_dict['required'] = required
            opt_dict['default'] = default
            opt_dict['domain'] = domain
            opt_dict['needs_reverse_lookup'] = needs_reverse_lookup
            self.index_dict[name] = opt_dict
            if enable:
                opt_dict['index'] = param_index
                self.index_dict['current_index'] += 1

    def params_to_args(self, x):
        """ Convert GPyOpt Bayesian Optimizer params back into function call arguments

        Arguments:

            x: the callback parameter of the GPyOpt Bayesian Optimizer
            index_dict: a dictionary with all the information necessary to convert back to function call arguments
        """
        if len(x.shape) == 1:
            # if we get a 1d array convert it to 2d so we are consistent
            x = np.expand_dims(x, axis=0)

        def lookup_as(name, value):
            """ How to lookup internally stored values.
            """
            if name == 'float':
                return float(value)
            elif name == 'int':
                return int(value)
            elif name == 'str':
                return str(value)
            elif name == 'bool':
                return bool(value)
            else:
                raise ValueError('Trying to lookup unsupported type: ' + str(name))

        # x is a funky 2d numpy array, so we convert it back to normal parameters
        kwargs = {}
        if self.verbose > 0:
            print('INDEX DICT: ' + str(self.index_dict))
        for key, opt_dict in six.iteritems(self.index_dict):
            if key == 'current_index':
                continue

            if opt_dict['enable']:
                arg_name = opt_dict['name']
                optimizer_param_column = opt_dict['index']
                if optimizer_param_column > x.shape[-1]:
                    raise ValueError('Attempting to access optimizer_param_column' + str(optimizer_param_column) +
                                     ' outside parameter bounds' + str(x.shape) +
                                     ' of optimizer array with index dict: ' + str(self.index_dict) +
                                     'and array x: ' + str(x))
                param_value = x[:, optimizer_param_column]
                if opt_dict['type'] == 'discrete':
                    # the value is an integer indexing into the lookup dict
                    if opt_dict['needs_reverse_lookup']:
                        domain_index = int(param_value)
                        domain_value = opt_dict['domain'][domain_index]
                        value = lookup_as(opt_dict['lookup_as'], domain_value)
                    else:
                        value = lookup_as(opt_dict['lookup_as'], param_value)

                else:
                    # the value is a param to use directly
                    value = lookup_as(opt_dict['lookup_as'], param_value)

                kwargs[arg_name] = value
            elif opt_dict['required']:
                if self.verbose > 0:
                    print('REQUIRED NAME: ' + str(opt_dict['name']) + ' DEFAULT: ' + str(opt_dict['default']))
                kwargs[opt_dict['name']] = opt_dict['default']
        return kwargs

    def get_domain(self):
        """ Get the hyperparameter search space in the gpyopt domain format.
        """
        return self.search_space

    def save(self, filename):
        """ Save the HyperParameterOptions search space and argument index dictionary to a json file.
        """
        data = {}
        data['search_space'] = self.search_space
        data['index_dict'] = self.index_dict
        with open(filename, 'w') as fp:
            json.dump(data, fp)

Specifically, I have the extra options to enable params, determine if they are required, and set a default:

    def add_param(self, name, domain, domain_type='discrete', enable=True, required=True, default=None):
        """

        # Arguments

            search_space: list of hyperparameter configurations required by BayseanOptimizer
            index_dict: dictionary that will be used to lookup real values
                and types when we get the hyperopt callback with ints or floats
            enable: this parameter will be part of hyperparameter search
            required: this parameter must be passed to the model
            default: default value if required

        """

This lets me search subsets while still including the configuration data for the larger search space. That way I can re-use the data for narrower/wider searches. It also lets me see the data in the encoded space and decode options individually. Your style is actually better overall, but do you think the functions like UniformContinuousHyperParameter could get something equivalent to the enable, required, & default parameters?

side note: code above has apache v2 license.

How can I contact you

How can I contact you

, I want to ask you about the femh competition.

weight-efficient pyshac, possibly tf eager

ENAS shows that keeping/reloading past model weights during search can make model search much more efficient, and with eager execution models could potentially be modified more often, perhaps on a per-batch basis even. Overhead of building/evaluating models would have to be kept very low, as would memory utilization and the component which selects a new set of model params.

Since I'm not sure how the threading/process model currently works here I'm not sure if I could simply declare a shared model object which would be called with the current choice of params. Do you have any thoughts on gotchas I might encounter, and could it be feasible to run pyshac with eager execution enabled?

I'm hoping it would be too troublesome since I see some tfe code (though not using pyshac), and there is already a pytorch backend.

Aside: why is the actual pyshac fit call commented?

Out of range value for min_split_loss, value='0'

    best=self.invokeScoring(blackBoxIteration, pb, context)
  File "<censored>\nick\projects\uniopt\UniOpt\backends\pyshac.py", line 107, in invokeScoring
    shac.fit(pyshacScore, skip_cv_checks=self.skipCV, early_stop=self.earlyStop, relax_checks=self.relaxChecks)
  File "<censored>\Anaconda3\lib\site-packages\pyshac\core\engine.py", line 1201, in fit
    callbacks=callbacks)
  File "<censored>\Anaconda3\lib\site-packages\pyshac\core\engine.py", line 279, in fit
    model = self._train_classifier(x, y, num_splits=num_splits)
  File "<censored>\Anaconda3\lib\site-packages\pyshac\core\engine.py", line 862, in _train_classifier
    n_jobs=self.num_workers)
  File "<censored>\Anaconda3\lib\site-packages\pyshac\utils\xgb_utils.py", line 46, in train_single_model
    scores = cross_val_score(model, encoded_samples, labels, cv=kfold, n_jobs=1)
  File "<censored>\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 402, in cross_val_score
    error_score=error_score)
.....
  File "<censored>\Anaconda3\lib\site-packages\xgboost\core.py", line 165, in _check_call
    raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b"Out of range value for min_split_loss, value='0'"

I have searched for the name in the repo and traced the hyperparams setting (by patching the relevant function in xgboost), the hyperparam is really never set.

{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 200, 'nthread': 10, 'objective': 'binary:logistic', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample':1}

add version number to output files

Might I suggest adding an extra pyshac_version and user_version column to the output csv files & other output files?

If behaviors/defaults are changed/incremented, old data can still be identified and used correctly based on these values. For example, if a bug in user code didn't utilize one of the hyperparams correctly in past data, that allows corrections to be made without throwing anything away.

Split dealing with CSV from training and dataset cleaning in fit_dataset

Hi again.

I'm implementing resumpltion and metaoptimization in UniOpt, so for pyshac backend I need an API to inject points into the optimizer from memory:
a) bulk injection (needed for resumation, may be less efficient since it is done rarely);
b) individual point injection (needed for metaoptimization and should be efficient, so incremental learning).

I guess fit_dataset does this, but

it deals with csv files, so I cannot use it as it is
it does too much work, so I don't want to recreate it

It would be better if the stuff worked not only with arrays, but also with iterators.

existing dataset of hyperparameters + scores

Hey, I recently did a search with bayesian optimization so I have a few thousand hyperparameter sets with scores. I'd be interested to feed that data in then make a few new predictions to see how it does. Is that possible?

_pickle.PicklingError: Could not pickle the task to send it to the workers.

When using with xgboost.