Using the Built-in Schedulers

In this tutorial, you will learn how to use and configure the most important built-in HPO algorithms. Alternatively, you can also use most algorithms from Ray Tune.

This tutorial provides a walkthrough of some of the topics addressed here.

Schedulers and Searchers

The decision-making algorithms driving an HPO experiments are referred to as schedulers. As in Ray Tune, some of our schedulers are internally configured by a searcher. A scheduler interacts with the backend, making decisions on which configuration to evaluate next, and whether to stop, pause or resume existing trials. It relays “next configuration” decisions to the searcher. Some searchers maintain a surrogate model which is fitted to metric data coming from evaluations.

Note

There are two ways to create many of the schedulers of Syne Tune:

Import wrapper class from syne_tune.optimizer.baselines, for example RandomSearch for random search
Use template classes FIFOScheduler or HyperbandScheduler together with the searcher argument, for example FIFOScheduler with searcher="random" for random search

Importing from syne_tune.optimizer.baselines is often simpler. However, in this tutorial, we will use the template classes in order to expose the common structure and to explain arguments only once.

FIFOScheduler

This is the simplest kind of scheduler. It cannot stop or pause trials, each evaluation proceeds to the end. Depending on the searcher, this scheduler supports:

Random search [searcher="random"]
Bayesian optimization with Gaussian processes [searcher="bayesopt"]
Grid search [searcher="grid"]
TPE with kernel density estimators [searcher="kde"]
Constrained Bayesian optimization [searcher="bayesopt_constrained"]
Cost-aware Bayesian optimization [searcher="bayesopt_cost"]
Bore [searcher="bore"]

We will only consider the first two searchers in this tutorial. Here is a launcher script using FIFOScheduler:

import logging

from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import FIFOScheduler
from syne_tune import Tuner, StoppingCriterion

from benchmarking.benchmark_definitions import \
    mlp_fashionmnist_benchmark


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.DEBUG)
    n_workers = 4
    max_wallclock_time = 120

    # We pick the MLP on FashionMNIST benchmark
    # The 'benchmark' object contains arguments needed by scheduler and
    # searcher (e.g., 'mode', 'metric'), along with suggested default values
    # for other arguments (which you are free to override)
    benchmark = mlp_fashionmnist_benchmark()
    config_space = benchmark.config_space

    backend = LocalBackend(entry_point=benchmark.script)

    # GP-based Bayesian optimization searcher. Many options can be specified
    # via `search_options`, but let's use the defaults
    searcher = "bayesopt"
    search_options = {'num_init_random': n_workers + 2}
    scheduler = FIFOScheduler(
        config_space,
        searcher=searcher,
        search_options=search_options,
        mode=benchmark.mode,
        metric=benchmark.metric,
    )

    tuner = Tuner(
        trial_backend=backend,
        scheduler=scheduler,
        stop_criterion=StoppingCriterion(
            max_wallclock_time=max_wallclock_time
        ),
        n_workers=n_workers,
    )

    tuner.run()

What happens in this launcher script?

We select the mlp_fashionmnist benchmark, adopting its default hyperparameter search space without modifications.
We select the local backend, which runs up to n_workers = 4 processes in parallel on the same instance.
We create a FIFOScheduler with searcher = "bayesopt". This means that new configurations to be evaluated are selected by Bayesian optimization, and all trials are run to the end. The scheduler needs to know the config_space, the name of metric to tune (metric) and whether to minimize or maximize this metric (mode). For mlp_fashionmnist, we have metric = "accuracy" and mode = "max", so we select a configuration which maximizes accuracy.
Options for the searcher can be passed via search_options. We use defaults, except for changing num_init_random (see below) to the number of workers plus two.
Finally, we create the tuner, passing trial_backend, scheduler, as well as the stopping criterion for the experiment (stop after 120 seconds) and the number of workers. The experiment is started by tuner.run().

FIFOScheduler provides the full range of arguments. Here, we list the most important ones:

config_space: Hyperparameter search space. This argument is mandatory. Apart from hyperparameters to be searched over, the space may contain fixed parameters (such as epochs in the example above). A config passed to the training script is always extended by these fixed parameters. If you use a benchmark, you can use benchmark["config_space"] here, or you can modify this default search space.
searcher: Selects searcher to be used (see below).
search_options: Options to configure the searcher (see below).
metric, mode: Name of metric to tune (i.e, key used in report call by the training script), which is either to be minimized (mode="min") or maximized (mode="max"). If you use a benchmark, just use benchmark["metric"] and benchmark["mode"] here.
points_to_evaluate: Allows to specify a list of configurations which are evaluated first. If your training code corresponds to some open source ML algorithm, you may want to use the defaults provided in the code. The entry (or entries) in points_to_evaluate do not have to specify values for all hyperparameters. For any hyperparameter not listed there, the following rule is used to choose a default. For float and int value type, the mid-point of the search range is used (in linear or log scaling). For categorical value type, the first entry in the value set is used. The default is a single config with all values chosen by the default rule. Pass an empty list in order to not specify any initial configs.
random_seed: Master random seed. Random sampling in schedulers and searchers are done by a number of numpy.random.RandomState generators, whose seeds are derived from random_seed. If not given, a random seed is sampled and printed in the log.

Random Search

The simplest HPO baseline is random search, which you obtain with searcher="random", or by using RandomSearch instead of FIFOScheduler. Search decisions are not based on past data, a new configuration is chosen by sampling attribute values at random, from distributions specified in config_space. These distributions are detailed here.

If points_to_evaluate is specified, configurations are first taken from this list before any are drawn at random. Options for configuring the searcher are given in search_options. These are:

debug_log: If True, a useful log output about the search progress is printed.
allow_duplicates: If True, the same configuration may be suggested more than once. The default is False, in that sampling is without replacement.

Bayesian Optimization

Bayesian optimization is obtained by searcher='bayesopt', or by using BayesianOptimization instead of FIFOScheduler. More information about Bayesian optimization is provided here.

Options for configuring the searcher are given in search_options. These include options for the random searcher. GPFIFOSearcher provides the full range of arguments. We list the most important ones:

num_init_random: Number of initial configurations chosen at random (or via points_to_evaluate). In fact, the number of initial configurations is the maximum of this and the length of points_to_evaluate. Afterwards, configurations are chosen by Bayesian optimization (BO). In general, BO is only used once at least one metric value from past trials is available. We recommend to set this value to the number of workers plus two.
opt_nstarts, opt_maxiter: BO employs a Gaussian process surrogate model, whose own hyperparameters (e.g., kernel parameters, noise variance) are chosen by empirical Bayesian optimization. In general, this is done whenever new data becomes available. It is the most expensive computation in each round. opt_maxiter is the maximum number of L-BFGS iterations. We run opt_nstarts such optimizations from random starting points and pick the best.
max_size_data_for_model, max_size_top_fraction: GP computations scale cubically with the number of observations, and decision making can become very slow for too many trials. Whenever there are more than max_size_data_for_model observations, the dataset is downsampled to this size. Here, max_size_data_for_model * max_size_top_fraction of the entries correspond to the cases with the best metric values, while the remaining entries are drawn at random (without replacement) from all other cases. Defaults to DEFAULT_MAX_SIZE_DATA_FOR_MODEL.
opt_skip_init_length, opt_skip_period: Refitting the GP hyperparameters in each round can become expensive, especially when the number of observations grows large. If so, you can choose to do it only every opt_skip_period rounds. Skipping optimizations is done only once the number of observations is above opt_skip_init_length.
gp_base_kernel: Selects the covariance (or kernel) function to be used in the surrogate model. Current choices are “matern52-ard” (Matern 5/2 with automatic relevance determination; the default) and “matern52-noard” (Matern 5/2 without ARD).
acq_function: Selects the acquisition function to be used. Current choices are “ei” (negative expected improvement; the default) and “lcb” (lower confidence bound). The latter has the form \(\mu(x) - \kappa \sigma(x)\), where \(\mu(x)\), \(\sigma(x)\) are predictive mean and standard deviation, and \(\kappa > 0\) is a parameter, which can be passed via acq_function_kwargs={"kappa": 0.5} for \(\kappa = 0.5\).
input_warping: If this is True, inputs are warped before being fed into the covariance function, the effective kernel becomes \(k(w(x), w(x'))\), where \(w(x)\) is a warping transform with two non-negative parameters per component. These parameters are learned along with other parameters of the surrogate model. Input warping allows the surrogate model to represent non-stationary functions, while still keeping the numbers of parameters small. Note that only such components of \(x\) are warped which belong to non-categorical hyperparameters.
boxcox_transform: If this is True, target values are transformed before being fitted with a Gaussian marginal likelihood. This is using the Box-Cox transform with a parameter \(\lambda\), which is learned alongside other parameters of the surrogate model. The transform is \(\log y\) for \(\lambda = 0\), and \(y - 1\) for \(\lambda = 1\). This option requires the targets to be positive.

HyperbandScheduler

This scheduler comes in at least two different variants, one may stop trials early (type="stopping"), the other may pause trials and resume them later (type="promotion"). For tuning neural network models, it tends to work much better than FIFOScheduler. You may have read about successive halving and Hyperband before. Chances are you read about synchronous scheduling of parallel evaluations, while both HyperbandScheduler and FIFOScheduler implement asynchronous scheduling, which can be substantially more efficient. This tutorial provides details about synchronous and asynchronous variants of successive halving and Hyperband.

Here is a launcher script using HyperbandScheduler:

import logging

from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import HyperbandScheduler
from syne_tune import Tuner, StoppingCriterion

from benchmarking.benchmark_definitions import \
    mlp_fashionmnist_benchmark

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.DEBUG)
    n_workers = 4
    max_wallclock_time = 120

    # We pick the MLP on FashionMNIST benchmark
    # The 'benchmark' object contains arguments needed by scheduler and
    # searcher (e.g., 'mode', 'metric'), along with suggested default values
    # for other arguments (which you are free to override)
    benchmark = mlp_fashionmnist_benchmark()
    config_space = benchmark.config_space

    backend = LocalBackend(entry_point=benchmark.script)

    # MOBSTER: Combination of asynchronous successive halving with
    # GP-based Bayesian optimization
    searcher = 'bayesopt'
    search_options = {'num_init_random': n_workers + 2}
    scheduler = HyperbandScheduler(
        config_space,
        searcher=searcher,
        search_options=search_options,
        type="stopping",
        max_resource_attr=benchmark.max_resource_attr,
        resource_attr=benchmark.resource_attr,
        mode=benchmark.mode,
        metric=benchmark.metric,
        grace_period=1,
        reduction_factor=3,
    )

    tuner = Tuner(
        trial_backend=backend,
        scheduler=scheduler,
        stop_criterion=StoppingCriterion(
            max_wallclock_time=max_wallclock_time
        ),
        n_workers=n_workers,
    )

    tuner.run()

Much of this launcher script is the same as for FIFOScheduler, but HyperbandScheduler comes with a number of extra arguments we will explain in the sequel (type, max_resource_attr, grace_period, reduction_factor, resource_attr). The mlp_fashionmnist benchmark trains a two-layer MLP on FashionMNIST (more details are here). The accuracy is computed and reported at the end of each epoch:

for epoch in range(resume_from + 1, config['epochs'] + 1):
    train_model(config, state, train_loader)
    accuracy = validate_model(config, state, valid_loader)
    report(epoch=epoch, accuracy=accuracy)

While metric="accuracy" is the criterion to be optimized, resource_attr="epoch" is the resource attribute. In the schedulers discussed here, the resource attribute must be a positive integer.

HyperbandScheduler maintains reported metrics for all trials at certain rung levels (levels of resource attribute epoch at which scheduling decisions are done). When a trial reports (epoch, accuracy) for a rung level == epoch, the scheduler makes a decision whether to stop (pause) or continue. This decision is done based on all accuracy values encountered before at the same rung level. Whenever a trial is stopped (or paused), the executing worker becomes available to evaluate a different configuration.

Rung level spacing and stop/go decisions are determined by the parameters max_resource_attr, grace_period, and reduction_factor. The first is the name of the attribute in config_space which contains the maximum number of epochs to train (max_resource_attr == "epochs" in our benchmark). This allows the training script to obtain max_resource_value = config["max_resource_attr"]. Rung levels are \(r_{min}, r_{min} \eta, r_{min} \eta^2, \dots, r_{max}\), where \(r_{min}\) is grace_period, \(\eta\) is reduction_factor, and \(r_{max}\) is max_resource_value. In the example above, max_resource_value = 81, grace_period = 1, and reduction_factor = 3, so that rung levels are 1, 3, 9, 27, 81. The spacing is such that stop/go decisions are done less frequently for trials which already went further: they have earned trust by not being stopped earlier. \(r_{max}\) need not be of the form \(r_{min} \eta^k\). If max_resource_value = 56 in the example above, the rung levels would be 1, 3, 9, 27, 56.

Given such a rung level spacing, stop/go decisions are done by comparing accuracy to the 1 / reduction_factor quantile of values recorded at the rung level. In the example above, our trial is stopped if accuracy is no better than the best 1/3 of previous values (the list includes the current accuracy value), otherwise it is stopped.

Further details about HyperbandScheduler and multi-fidelity HPO methods are given in this tutorial. HyperbandScheduler provides the full range of arguments. Here, we list the most important ones:

max_resource_attr, grace_period, reduction_factor: As detailed above, these determine the rung levels and the stop/go decisions. The resource attribute is a positive integer. We need reduction_factor >= 2. Note that instead of max_resource_attr, you can also use max_t, as detailed here.
rung_increment: This parameter can be used instead of reduction_factor (the latter takes precedence). In this case, rung levels are spaced linearly: \(r_{min} + j \nu, j = 0, 1, 2, \dots\), where \(\nu\) is rung_increment. The stop/go rule in the successive halving scheduler is set based on the ratio of successive rung levels.
rung_levels: Alternatively, the user can specify the list of rung levels directly (positive integers, strictly increasing). The stop/go rule in the successive halving scheduler is set based on the ratio of successive rung levels.
type: The most important values are "stopping", "promotion" (see above).
brackets: Number of brackets to be used in Hyperband. More details are found here. The default is 1 (successive halving).

Depending on the searcher, this scheduler supports:

Asynchronous successive halving (ASHA) [searcher="random"]
MOBSTER [searcher="bayesopt"]
Asynchronous BOHB [searcher="kde"]
Hyper-Tune [searcher="hypertune"]
Cost-aware Bayesian optimization [searcher="bayesopt_cost"]
Bore [searcher="bore"]
DyHPO [searcher="dyhpo", type="dyhpo"]

We will only consider the first two searchers in this tutorial.

Asynchronous Hyperband (ASHA)

If HyperbandScheduler is configured with a random searcher, we obtain ASHA, as proposed in A System for Massively Parallel Hyperparameter Tuning. More details are provided here. Nothing much can be configured via search_options in this case. The arguments are the same as for random search with FIFOScheduler.

Model-based Asynchronous Hyperband (MOBSTER)

If HyperbandScheduler is configured with a Bayesian optimization searcher, we obtain MOBSTER, as proposed in Model-based Asynchronous Hyperparameter and Neural Architecture Search. By default, MOBSTER uses a multi-task Gaussian process surrogate model for metrics data observed at all resource levels. More details are provided here.

Recommendations

Finally, we provide some general recommendations on how to use our built-in schedulers.

If you can afford it for your problem, random search is a useful baseline (RandomSearch). However, if even a single full evaluation takes a long time, try ASHA (ASHA) instead. The default for ASHA is type="stopping", but you should consider type="promotion" as well (more details on this choice are given here.
Use these baseline runs to get an idea how long your experiment needs to run. It is recommended to use a stopping criterion of the form stop_criterion=StoppingCriterion(max_wallclock_time=X), so that the experiment is stopped after X seconds.
If your tuning problem comes with an obvious resource parameter, make sure to implement it such that results are reported during the evaluation, not only at the end. When training a neural network model, choose the number of epochs as resource. In other situations, choosing a resource parameter may be more difficult. Our schedulers require positive integers. Make sure that evaluations for the same configuration scale linearly in the resource parameter: an evaluation up to 2 * r should be roughly twice as expensive as one up to r.
If your problem has a resource parameter, always make sure to try HyperbandScheduler, which in many cases runs much faster than FIFOScheduler.
If you end up tuning the same ML algorithm or neural network model on different datasets, make sure to set points_to_evaluate appropriately. If the model comes from frequently used open source code, its built-in defaults will be a good choice. Any hyperparameter not covered in points_to_evaluate is set using a midpoint heuristic. While still better than choosing the first configuration at random, this may not be very good.
In general, the defaults should work well if your tuning problem is expensive enough (at least a minute per unit of r). In such cases, MOBSTER (MOBSTER) can outperform ASHA substantially. However, if your problem is cheap, so you can afford a lot of evaluations, the searchers based on GP surrogate models may end up expensive. In fact, once the number of evaluations surpassed a certain threshold, the data is filtered down before fitting the surrogate model (see here). You can adjust this threshold or change opt_skip_period in order to speed up MOBSTER.