Using the Built-in Schedulers
In this tutorial, you will learn how to use and configure the most important built-in HPO algorithms. Alternatively, you can also use most algorithms from Ray Tune.
This tutorial provides a walkthrough of some of the topics addressed here.
Schedulers and Searchers
The decision-making algorithms driving an HPO experiments are referred to as schedulers. As in Ray Tune, some of our schedulers are internally configured by a searcher. A scheduler interacts with the backend, making decisions on which configuration to evaluate next, and whether to stop, pause or resume existing trials. It relays “next configuration” decisions to the searcher. Some searchers maintain a surrogate model which is fitted to metric data coming from evaluations.
Note
There are two ways to create many of the schedulers of Syne Tune:
Import wrapper class from
syne_tune.optimizer.baselines
, for exampleRandomSearch
for random searchUse template classes
FIFOScheduler
orHyperbandScheduler
together with thesearcher
argument, for exampleFIFOScheduler
withsearcher="random"
for random search
Importing from syne_tune.optimizer.baselines
is often simpler. However,
in this tutorial, we will use the template classes in order to expose the
common structure and to explain arguments only once.
FIFOScheduler
This is the simplest kind of scheduler. It cannot stop or pause trials, each evaluation proceeds to the end. Depending on the searcher, this scheduler supports:
Random search [
searcher="random"
]Bayesian optimization with Gaussian processes [
searcher="bayesopt"
]Grid search [
searcher="grid"
]TPE with kernel density estimators [
searcher="kde"
]Constrained Bayesian optimization [
searcher="bayesopt_constrained"
]Cost-aware Bayesian optimization [
searcher="bayesopt_cost"
]Bore [
searcher="bore"
]
We will only consider the first two searchers in this tutorial. Here is a
launcher script using FIFOScheduler
:
import logging
from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import FIFOScheduler
from syne_tune import Tuner, StoppingCriterion
from benchmarking.benchmark_definitions import \
mlp_fashionmnist_benchmark
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
n_workers = 4
max_wallclock_time = 120
# We pick the MLP on FashionMNIST benchmark
# The 'benchmark' object contains arguments needed by scheduler and
# searcher (e.g., 'mode', 'metric'), along with suggested default values
# for other arguments (which you are free to override)
benchmark = mlp_fashionmnist_benchmark()
config_space = benchmark.config_space
backend = LocalBackend(entry_point=benchmark.script)
# GP-based Bayesian optimization searcher. Many options can be specified
# via `search_options`, but let's use the defaults
searcher = "bayesopt"
search_options = {'num_init_random': n_workers + 2}
scheduler = FIFOScheduler(
config_space,
searcher=searcher,
search_options=search_options,
mode=benchmark.mode,
metric=benchmark.metric,
)
tuner = Tuner(
trial_backend=backend,
scheduler=scheduler,
stop_criterion=StoppingCriterion(
max_wallclock_time=max_wallclock_time
),
n_workers=n_workers,
)
tuner.run()
What happens in this launcher script?
We select the
mlp_fashionmnist
benchmark, adopting its default hyperparameter search space without modifications.We select the local backend, which runs up to
n_workers = 4
processes in parallel on the same instance.We create a
FIFOScheduler
withsearcher = "bayesopt"
. This means that new configurations to be evaluated are selected by Bayesian optimization, and all trials are run to the end. The scheduler needs to know theconfig_space
, the name of metric to tune (metric
) and whether to minimize or maximize this metric (mode
). Formlp_fashionmnist
, we havemetric = "accuracy"
andmode = "max"
, so we select a configuration which maximizes accuracy.Options for the searcher can be passed via
search_options
. We use defaults, except for changingnum_init_random
(see below) to the number of workers plus two.Finally, we create the tuner, passing
trial_backend
,scheduler
, as well as the stopping criterion for the experiment (stop after 120 seconds) and the number of workers. The experiment is started bytuner.run()
.
FIFOScheduler
provides the full range
of arguments. Here, we list the most important ones:
config_space
: Hyperparameter search space. This argument is mandatory. Apart from hyperparameters to be searched over, the space may contain fixed parameters (such asepochs
in the example above). Aconfig
passed to the training script is always extended by these fixed parameters. If you use a benchmark, you can usebenchmark["config_space"]
here, or you can modify this default search space.searcher
: Selects searcher to be used (see below).search_options
: Options to configure the searcher (see below).metric
,mode
: Name of metric to tune (i.e, key used inreport
call by the training script), which is either to be minimized (mode="min"
) or maximized (mode="max"
). If you use a benchmark, just usebenchmark["metric"]
andbenchmark["mode"]
here.points_to_evaluate
: Allows to specify a list of configurations which are evaluated first. If your training code corresponds to some open source ML algorithm, you may want to use the defaults provided in the code. The entry (or entries) inpoints_to_evaluate
do not have to specify values for all hyperparameters. For any hyperparameter not listed there, the following rule is used to choose a default. Forfloat
andint
value type, the mid-point of the search range is used (in linear or log scaling). For categorical value type, the first entry in the value set is used. The default is a single config with all values chosen by the default rule. Pass an empty list in order to not specify any initial configs.random_seed
: Master random seed. Random sampling in schedulers and searchers are done by a number ofnumpy.random.RandomState
generators, whose seeds are derived fromrandom_seed
. If not given, a random seed is sampled and printed in the log.
Random Search
The simplest HPO baseline is random search, which you obtain with
searcher="random"
, or by using
RandomSearch
instead of
FIFOScheduler
. Search decisions are not based on past data, a new
configuration is chosen by sampling attribute values at random, from
distributions specified in config_space
. These distributions are detailed
here.
If points_to_evaluate
is specified, configurations are first taken from
this list before any are drawn at random. Options for configuring the searcher
are given in search_options
. These are:
debug_log
: IfTrue
, a useful log output about the search progress is printed.allow_duplicates
: IfTrue
, the same configuration may be suggested more than once. The default isFalse
, in that sampling is without replacement.
Bayesian Optimization
Bayesian optimization is obtained by searcher='bayesopt'
, or by using
BayesianOptimization
instead of
FIFOScheduler
. More information about Bayesian optimization is provided
here.
Options for configuring the searcher are given in search_options
. These
include options for the random searcher.
GPFIFOSearcher
provides the
full range of arguments. We list the most important ones:
num_init_random
: Number of initial configurations chosen at random (or viapoints_to_evaluate
). In fact, the number of initial configurations is the maximum of this and the length ofpoints_to_evaluate
. Afterwards, configurations are chosen by Bayesian optimization (BO). In general, BO is only used once at least one metric value from past trials is available. We recommend to set this value to the number of workers plus two.opt_nstarts
,opt_maxiter
: BO employs a Gaussian process surrogate model, whose own hyperparameters (e.g., kernel parameters, noise variance) are chosen by empirical Bayesian optimization. In general, this is done whenever new data becomes available. It is the most expensive computation in each round.opt_maxiter
is the maximum number of L-BFGS iterations. We runopt_nstarts
such optimizations from random starting points and pick the best.max_size_data_for_model
,max_size_top_fraction
: GP computations scale cubically with the number of observations, and decision making can become very slow for too many trials. Whenever there are more thanmax_size_data_for_model
observations, the dataset is downsampled to this size. Here,max_size_data_for_model * max_size_top_fraction
of the entries correspond to the cases with the best metric values, while the remaining entries are drawn at random (without replacement) from all other cases. Defaults toDEFAULT_MAX_SIZE_DATA_FOR_MODEL
.opt_skip_init_length
,opt_skip_period
: Refitting the GP hyperparameters in each round can become expensive, especially when the number of observations grows large. If so, you can choose to do it only everyopt_skip_period
rounds. Skipping optimizations is done only once the number of observations is aboveopt_skip_init_length
.gp_base_kernel
: Selects the covariance (or kernel) function to be used in the surrogate model. Current choices are “matern52-ard” (Matern5/2
with automatic relevance determination; the default) and “matern52-noard” (Matern5/2
without ARD).acq_function
: Selects the acquisition function to be used. Current choices are “ei” (negative expected improvement; the default) and “lcb” (lower confidence bound). The latter has the form \(\mu(x) - \kappa \sigma(x)\), where \(\mu(x)\), \(\sigma(x)\) are predictive mean and standard deviation, and \(\kappa > 0\) is a parameter, which can be passed viaacq_function_kwargs={"kappa": 0.5}
for \(\kappa = 0.5\).input_warping
: If this isTrue
, inputs are warped before being fed into the covariance function, the effective kernel becomes \(k(w(x), w(x'))\), where \(w(x)\) is a warping transform with two non-negative parameters per component. These parameters are learned along with other parameters of the surrogate model. Input warping allows the surrogate model to represent non-stationary functions, while still keeping the numbers of parameters small. Note that only such components of \(x\) are warped which belong to non-categorical hyperparameters.boxcox_transform
: If this isTrue
, target values are transformed before being fitted with a Gaussian marginal likelihood. This is using the Box-Cox transform with a parameter \(\lambda\), which is learned alongside other parameters of the surrogate model. The transform is \(\log y\) for \(\lambda = 0\), and \(y - 1\) for \(\lambda = 1\). This option requires the targets to be positive.
HyperbandScheduler
This scheduler comes in at least two different variants, one may stop trials
early (type="stopping"
), the other may pause trials and resume them later
(type="promotion"
). For tuning neural network models, it tends to work
much better than FIFOScheduler
. You may have read about successive halving
and Hyperband before. Chances are you read about synchronous scheduling of
parallel evaluations, while both HyperbandScheduler
and FIFOScheduler
implement asynchronous scheduling, which can be substantially more
efficient. This tutorial provides
details about synchronous and asynchronous variants of successive halving and
Hyperband.
Here is a launcher script using
HyperbandScheduler
:
import logging
from syne_tune.backend import LocalBackend
from syne_tune.optimizer.schedulers import HyperbandScheduler
from syne_tune import Tuner, StoppingCriterion
from benchmarking.benchmark_definitions import \
mlp_fashionmnist_benchmark
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
n_workers = 4
max_wallclock_time = 120
# We pick the MLP on FashionMNIST benchmark
# The 'benchmark' object contains arguments needed by scheduler and
# searcher (e.g., 'mode', 'metric'), along with suggested default values
# for other arguments (which you are free to override)
benchmark = mlp_fashionmnist_benchmark()
config_space = benchmark.config_space
backend = LocalBackend(entry_point=benchmark.script)
# MOBSTER: Combination of asynchronous successive halving with
# GP-based Bayesian optimization
searcher = 'bayesopt'
search_options = {'num_init_random': n_workers + 2}
scheduler = HyperbandScheduler(
config_space,
searcher=searcher,
search_options=search_options,
type="stopping",
max_resource_attr=benchmark.max_resource_attr,
resource_attr=benchmark.resource_attr,
mode=benchmark.mode,
metric=benchmark.metric,
grace_period=1,
reduction_factor=3,
)
tuner = Tuner(
trial_backend=backend,
scheduler=scheduler,
stop_criterion=StoppingCriterion(
max_wallclock_time=max_wallclock_time
),
n_workers=n_workers,
)
tuner.run()
Much of this launcher script is the same as for FIFOScheduler
, but
HyperbandScheduler
comes with a number
of extra arguments we will explain in the sequel (type
,
max_resource_attr
, grace_period
, reduction_factor
,
resource_attr
). The mlp_fashionmnist
benchmark trains a two-layer MLP
on FashionMNIST
(more details are
here). The accuracy is computed and
reported at the end of each epoch:
for epoch in range(resume_from + 1, config['epochs'] + 1):
train_model(config, state, train_loader)
accuracy = validate_model(config, state, valid_loader)
report(epoch=epoch, accuracy=accuracy)
While metric="accuracy"
is the criterion to be optimized,
resource_attr="epoch"
is the resource attribute. In the schedulers
discussed here, the resource attribute must be a positive integer.
HyperbandScheduler
maintains reported
metrics for all trials at certain rung levels (levels of resource attribute
epoch
at which scheduling decisions are done). When a trial reports
(epoch, accuracy)
for a rung level == epoch
, the scheduler makes a
decision whether to stop (pause) or continue. This decision is done based on
all accuracy
values encountered before at the same rung level. Whenever a
trial is stopped (or paused), the executing worker becomes available to evaluate
a different configuration.
Rung level spacing and stop/go decisions are determined by the parameters
max_resource_attr
, grace_period
, and reduction_factor
. The first
is the name of the attribute in config_space
which contains the maximum
number of epochs to train (max_resource_attr == "epochs"
in our
benchmark). This allows the training script to obtain
max_resource_value = config["max_resource_attr"]
. Rung levels are
\(r_{min}, r_{min} \eta, r_{min} \eta^2, \dots, r_{max}\), where
\(r_{min}\) is grace_period
, \(\eta\) is reduction_factor
, and
\(r_{max}\) is max_resource_value
. In the example above,
max_resource_value = 81
, grace_period = 1
, and reduction_factor = 3
,
so that rung levels are 1, 3, 9, 27, 81. The spacing is such that stop/go
decisions are done less frequently for trials which already went further: they
have earned trust by not being stopped earlier. \(r_{max}\) need not be
of the form \(r_{min} \eta^k\). If max_resource_value = 56
in the
example above, the rung levels would be 1, 3, 9, 27, 56.
Given such a rung level spacing, stop/go decisions are done by comparing
accuracy
to the 1 / reduction_factor
quantile of values recorded at
the rung level. In the example above, our trial is stopped if accuracy
is
no better than the best 1/3 of previous values (the list includes the current
accuracy
value), otherwise it is stopped.
Further details about HyperbandScheduler
and multi-fidelity HPO methods
are given in this tutorial.
HyperbandScheduler
provides the full
range of arguments. Here, we list the most important ones:
max_resource_attr
,grace_period
,reduction_factor
: As detailed above, these determine the rung levels and the stop/go decisions. The resource attribute is a positive integer. We needreduction_factor >= 2
. Note that instead ofmax_resource_attr
, you can also usemax_t
, as detailed here.rung_increment
: This parameter can be used instead ofreduction_factor
(the latter takes precedence). In this case, rung levels are spaced linearly: \(r_{min} + j \nu, j = 0, 1, 2, \dots\), where \(\nu\) isrung_increment
. The stop/go rule in the successive halving scheduler is set based on the ratio of successive rung levels.rung_levels
: Alternatively, the user can specify the list of rung levels directly (positive integers, strictly increasing). The stop/go rule in the successive halving scheduler is set based on the ratio of successive rung levels.type
: The most important values are"stopping", "promotion"
(see above).brackets
: Number of brackets to be used in Hyperband. More details are found here. The default is 1 (successive halving).
Depending on the searcher, this scheduler supports:
Asynchronous successive halving (ASHA) [
searcher="random"
]MOBSTER [
searcher="bayesopt"
]Asynchronous BOHB [
searcher="kde"
]Hyper-Tune [
searcher="hypertune"
]Cost-aware Bayesian optimization [
searcher="bayesopt_cost"
]Bore [
searcher="bore"
]DyHPO [
searcher="dyhpo", type="dyhpo"
]
We will only consider the first two searchers in this tutorial.
Asynchronous Hyperband (ASHA)
If HyperbandScheduler
is configured
with a random searcher, we obtain ASHA, as proposed in
A System for Massively Parallel Hyperparameter Tuning.
More details are provided here.
Nothing much can be configured via search_options
in this case. The
arguments are the same as for random search with FIFOScheduler
.
Model-based Asynchronous Hyperband (MOBSTER)
If HyperbandScheduler
is configured with
a Bayesian optimization searcher, we obtain MOBSTER, as proposed in
Model-based Asynchronous Hyperparameter and Neural Architecture Search.
By default, MOBSTER uses a multi-task Gaussian process surrogate model for
metrics data observed at all resource levels. More details are provided
here.
Recommendations
Finally, we provide some general recommendations on how to use our built-in schedulers.
If you can afford it for your problem, random search is a useful baseline (
RandomSearch
). However, if even a single full evaluation takes a long time, try ASHA (ASHA
) instead. The default for ASHA istype="stopping"
, but you should considertype="promotion"
as well (more details on this choice are given here.Use these baseline runs to get an idea how long your experiment needs to run. It is recommended to use a stopping criterion of the form
stop_criterion=StoppingCriterion(max_wallclock_time=X)
, so that the experiment is stopped afterX
seconds.If your tuning problem comes with an obvious resource parameter, make sure to implement it such that results are reported during the evaluation, not only at the end. When training a neural network model, choose the number of epochs as resource. In other situations, choosing a resource parameter may be more difficult. Our schedulers require positive integers. Make sure that evaluations for the same configuration scale linearly in the resource parameter: an evaluation up to
2 * r
should be roughly twice as expensive as one up tor
.If your problem has a resource parameter, always make sure to try
HyperbandScheduler
, which in many cases runs much faster thanFIFOScheduler
.If you end up tuning the same ML algorithm or neural network model on different datasets, make sure to set
points_to_evaluate
appropriately. If the model comes from frequently used open source code, its built-in defaults will be a good choice. Any hyperparameter not covered inpoints_to_evaluate
is set using a midpoint heuristic. While still better than choosing the first configuration at random, this may not be very good.In general, the defaults should work well if your tuning problem is expensive enough (at least a minute per unit of
r
). In such cases, MOBSTER (MOBSTER
) can outperform ASHA substantially. However, if your problem is cheap, so you can afford a lot of evaluations, the searchers based on GP surrogate models may end up expensive. In fact, once the number of evaluations surpassed a certain threshold, the data is filtered down before fitting the surrogate model (see here). You can adjust this threshold or changeopt_skip_period
in order to speed up MOBSTER.