Model-based Asynchronous Hyperband
We have seen that asynchronous decision-making tends to outperform synchronous variants in practice, and model-based extensions of the latter can outperform random sampling of new configurations. In this section, we discuss combinations of Bayesian optimization with asynchronous decision-making, leading to the currently best performing multi-fidelity methods in Syne Tune.
All examples here can either be run in stopping or promotion mode of ASHA. We will use the promotion mode here (i.e., pause-and-resume scheduling).
Surrogate Models of Learning Curves
Recall that validation error after \(r\) epochs is denoted by \(f(\mathbf{x}, r)\), with \(\mathbf{x}\) the configuration. Here, \(r\mapsto f(\mathbf{x}, r)\) is called learning curve. A learning curve surrogate model predicts \(f(\mathbf{x}, r)\) from observed data. A difficult requirement in the context of multi-fidelity HPO is that observations are much more abundant at smaller resource levels \(r\), while predictions are more valuable at larger \(r\).
In the context of Gaussian process based Bayesian optimization, Syne Tune supports a number of different learning curve surrogate models. The type of model is selected upon construction of the scheduler:
scheduler = MOBSTER(
config_space,
type="promotion",
search_options=dict(
model="gp_multitask",
gp_resource_kernel="exp-decay-sum",
),
metric=benchmark.metric,
mode=benchmark.mode,
resource_attr=resource_attr,
random_seed=random_seed,
max_resource_attr=max_resource_attr,
)
Here, options configuring the searcher are collected in search_options
. The
most important options are model
, selecting the type of surrogate model,
and gp_resource_kernel
selecting the covariance model in the case
model="gp_multitask"
.
Independent Processes at each Rung Level
A simple learning curve surrogate model is obtained by
search_options["model"] = "gp_independent"
. Here, \(f(\mathbf{x}, r)\)
at each rung level \(r\) is represented by an independent Gaussian process
model. The models have individual constant mean functions
\(\mu_r(\mathbf{x}) = \mu_r\) and covariance functions
\(k_r(\mathbf{x}, \mathbf{x}') = c_r k(\mathbf{x}, \mathbf{x}')\),
where \(k(\mathbf{x}, \mathbf{x}')\) is a Matern-5/2 ARD kernel without
variance parameter, which is shared between the models, and the \(c_r > 0\)
are individual variance parameters. The idea is that while validation errors at
different rung levels may be scaled and shifted, they should still exhibit
similar dependencies on the hyperparameters. The noise variance \(\sigma^2\)
used in the Gaussian likelihood is the same across all data. However, if
search_options["separate_noise_variances"] = True
, different noise
variances \(\sigma_r^2\) are used for data at different rung levels.
Multi-Task Gaussian Process Models
A more advanced set of learning curve surrogate models is obtained by
search_options["model"] = "gp_multitask"
(which is the default for
asynchronous MOBSTER). In this case, a single Gaussian process model
represents \(f(\mathbf{x}, r)\) directly, with mean function
\(\mu(\mathbf{x}, r)\) and covariance function
\(k((\mathbf{x}, r), (\mathbf{x}', r'))\). The GP model is selected by
search_options["gp_resource_kernel"]
, currently supported options are
"exp-decay-sum"
, "exp-decay-combined"
, "exp-decay-delta1"
,
"freeze-thaw"
, "matern52"
, "matern52-res-warp"
,
"cross-validation"
. The default choice is "exp-decay-sum"
, which is
inspired by the exponential decay model proposed
here. Details about these different
models are given here and in
the source code.
Decision-making is somewhat more expensive with "gp_multitask"
than with
"gp_independent"
, because the notorious cubic scaling of GP inference
applies over observations made at all rung levels. However, the extra cost is
limited by the fact that most observations by far are made at the lowest
resource level \(r_{min}\) anyway.
Additive Gaussian Models
Two additional models are selected by
search_options["model"] = "gp_expdecay"
and
search_options["model"] = "gp_issm"
. The former is the exponential
decay model proposed here, the latter is
a variant thereof. These additive Gaussian models represent dependencies across
\(r\) in a cheaper way than in "gp_multitask"
, and they can be fit to
all observed data, not just at rung levels. Also, joint sampling is cheap.
However, at this point, additive Gaussian models remain experimental, and they will not be further discussed here. They can be used with MOBSTER, but not with Hyper-Tune.
Asynchronous MOBSTER
MOBSTER combines ASHA and
asynchronous Hyperband with GP-based Bayesian optimization. A Gaussian process
learning curve surrogate model is fit to the data at all rung levels, and
posterior predictive distributions are used in order to compute acquisition
function values and decide on which configuration to start next. We distinguish
between MOBSTER-JOINT with a GP multi-task model ("gp_multitask"
) and
MOBSTER-INDEP with an independent GP model ("gp_independent"
), as detailed
above. The acquisition function is expected improvement (EI) at the rung level
\(r_{acq}\) also used by BOHB.
Our launcher script runs (asynchronous)
MOBSTER-JOINT if method="MOBSTER-JOINT"
. The searcher can be configured
with search_options
, but MOBSTER-JOINT with the "exp-decay-sum"
covariance model is the default.
API docs:
Baseline:
MOBSTER
Additional arguments:
HyperbandScheduler
search_options
:GPMultiFidelitySearcher
As shown below, MOBSTER can outperform ASHA significantly. This is achieved by starting many less trials that stop very early (after 1 epoch) due to poor performance. Essentially, MOBSTER rapidly learns some important properties about the NASBench-201 problem and avoids basic mistakes which random sampling of configurations runs into at a constant rate. While ASHA stops such poor trials early, they still take away resources, which MOBSTER can spend on longer evaluations of more promising configurations. This advantage of model-based over random sampling based multi-fidelity methods is even more pronounced when starting and stopping jobs comes with delays. Such delays are typically present in real world distributed systems, but are absent in our simulations.
Different to BOHB, MOBSTER takes into account pending evaluations, i.e. trials which have been started but did not return metric values yet. This is done by integrating out their metric values by Monte Carlo. Namely, we draw a certain number of joint samples over pending targets and average the acquisition function over these. In the context of multi-fidelity, if a trial is running, a pending evaluation is registered for the next recent rung level it will reach.
Why is the surrogate model in MOBSTER-JOINT fit to the data at rung levels
only? After all, training scripts tend to report validation errors after each
epoch, why not use all this data? Syne Tune allows to do so (for the
"gp_multitask"
model), by passing searcher_data="all"
when creating
the HyperbandScheduler
(another
intermediate is searcher_data="rungs_and_last"
). However, while this may
lead to a more accurate model, it also becomes more expensive to fit, and does
not tend to make a difference, so the default searcher_data="rungs"
is
recommended.
Finally, we can also combine ASHA with
BOHB decision-making, by choosing
searcher="kde"
in
HyperbandScheduler
. This is an
asynchronous version of BOHB.
MOBSTER-INDEP
Our launcher script runs
(asynchronous) MOBSTER-INDEP if method="MOBSTER-INDEP"
. The independent
GPs model is selected by search_options["model"] = "gp_independent"
.
MOBSTER tends to perform slightly better with a joint multi-task GP model than
with an independent GPs model, justifying the Syne Tune default. In our
experience so far, changing the covariance model in MOBSTER-JOINT has only
marginal impact.
API docs:
Baseline:
MOBSTER
Additional arguments:
HyperbandScheduler
search_options
:GPMultiFidelitySearcher
(here, we usesearch_options["model"] = "gp_independent"
)
MOBSTER and Hyperband
Just like ASHA can be run with multiple brackets,
so can MOBSTER, simply by selecting brackets
when creating
HyperbandScheduler
. In our experience so
far, just like with ASHA, MOBSTER tends to work best with a single bracket.
Controlling MOBSTER Computations
MOBSTER often outperforms ASHA substantially. However, when applied to a problem where many evaluations can be done, fitting the GP surrogate model to all observed data can become slow. In fact, Gaussian process inference scales cubically in the number of observations. The amount of computation spent by MOBSTER can be controlled:
Setting the limit
max_size_data_for_model
: Once the total number of observations is above this limit, the data is sampled down to this size. This is done in a way which retains all observations from trials which reached higher rung levels, while data from trials stopped early are more likely to be removed. This down sampling is redone every time the surrogate model is fit, so that new data (especially at higher rungs) is taken into account. Also, scheduling decisions about stopping, pausing, or promoting trials are always done based on all data.The default value for
max_size_data_for_model
isDEFAULT_MAX_SIZE_DATA_FOR_MODEL
. It can be changed by passingsearch_options = {"max_size_data_for_model": XYZ}
when creating the MOBSTER scheduler. You can switch off the limit mechanism by passingNone
or a very large value. As the current default value is on the smaller end, to ensure fast computations, you may want to experiment with larger values as well.Parameters
opt_skip_init_length
,opt_skip_period
: When fitting the GP surrogate model, the most expensive computation by far is refitting its own parameters, such as kernel parameters. The frequency of this computation can be regulated, as detailed here.
Hyper-Tune
Hyper-Tune is a model-based extension of
ASHA with some additional features compared to MOBSTER. It can be seen as
extending MOBSTER-INDEP (with the "gp_independent"
surrogate model) in two
ways. First, it uses an acquisition function based on an ensemble predictive
distribution, while MOBSTER relies on the \(r_{acq}\) heuristic from BOHB.
Second, if multiple brackets are used (Hyperband case), Hyper-Tune offers an
adaptive mechanism to sample the bracket for a new trial. Both extensions are
based on a quantification of consistency of data on different rung levels, which
is used to weight rung levels according to their reliability for making
decisions (namely, which configuration \(\mathbf{x}\) and bracket
\(r_{min}\) to associate with a new trial).
Our launcher script runs Hyper-Tune
if method="HYPERTUNE-INDEP"
. The searcher can be configured with
search_options
, but the independent GPs model "gp_independent"
is the
default. In this example, Hyper-Tune is using a single bracket, so the
difference to MOBSTER-INDEP is due to the ensemble predictive distribution for
the acquisition function.
Syne Tune also implements Hyper-Tune with the GP multi-task surrogate models
used in MOBSTER. In result plots for this tutorial, original Hyper-Tune is
called HYPERTUNE-INDEP, while this latter variant is called HYPERTUNE-JOINT.
Our launcher script runs this variant
if method="HYPERTUNE-JOINT"
.
API docs:
Baseline:
HyperTune
Additional arguments:
HyperbandScheduler
search_options
:HyperTuneSearcher
(search_options["model"] = "gp_independent"
by default, but HYPERTUNE-JOINT is using"gp_multitask"
)
Finally, computations of Hyper-Tune can be controlled in the same way as in MOBSTER.
Hyper-Tune with Multiple Brackets
Just like ASHA and MOBSTER, Hyper-Tune can also be run with multiple brackets,
simply by using the brackets
argument of
HyperbandScheduler
. If brackets > 1
,
Hyper-Tune samples the bracket for a new trial from an adaptive distribution
closely related to the ensemble distribution used for acquisitions. Our
launcher script runs Hyper-Tune with 4
brackets if method="HYPERTUNE4-INDEP"
.
Recall that both ASHA and MOBSTER tend to work better for one than for multiple brackets. This may well be due to the fixed, non-adaptive distribution that brackets are sampled from. Ideally, a method would learn over time whether a low rung level tends to be reliable in predicting the ordering at higher ones, or whether it should rather be avoided (and \(r_{min}\) should be increased). This is what the adaptive mechanism in Hyper-Tune tries to do. In our comparisons, we find that HYPERTUNE-INDEP with multiple brackets can outperform MOBSTER-JOINT with a single bracket.
Details
In this section, we provide some details about Hyper-Tune and our implementation. The Hyper-Tune extensions are based on a quantification of consistency of data on different rung levels For example, assume that \(r < r_{*}\) are two rung levels, with sufficiently many points at \(r_{*}\). If \(\mathcal{X}_{*}\) collects trials with data at \(r_{*}\), all these have also been observed at \(r\). Sampling \(f(\mathcal{X}_{*}, r)\) from the posterior distribution of the surrogate model, we can compare the ordering of these predictions at \(r\) with the ordering of observations at \(r_{*}\), using a pair-wise ranking loss. A large loss value means frequent cross-overs of learning curves between \(r\) and \(r_{*}\), and predictions at rung level \(r\) are unreliable when it comes to the ordering of trials \(\mathcal{X}_{*}\) at \(r_{*}\).
At any point during the algorithm, denote by \(r_{*}\) the largest rung
level with a sufficient number of observations (our implementation requires 6
points). Assuming that \(r_{*} > r_{min}\), we can estimate a distribution
\([\theta_r]\) over rung levels \(\mathcal{R}_{*} =
\{r\in\mathcal{R}\, |\, r\le r_{*}\}\) as follows. We draw \(S\) independent
samples from the model at these rung levels. For each sample \(s\), we
compute loss values \(l_{r, s}\) for \((r, r_{*})\) over all
\(r\in\mathcal{R}_{*}\), and determine the argmin
indicator
\([\text{I}_{l_{r, s} = m_s}]\), where
\(m_s = \text{min}(l_{r, s} | r\in\mathcal{R}_{*})\). The distribution
\([\theta_r]\) is obtained as normalized sum of these indicators over
\(s=1,\dots, S\). We also need to compute loss values \(l_{r_{*}, s}\),
this is done using a cross-validation approximation, see
here or the code in
syne_tune.optimizer.schedulers.searchers.bayesopt.gpautograd.hypertune
for details. In the beginning, with too little data at the second rung level,
we use \(\theta_{r_{min}} = 1\) and 0 elsewhere.
Decisions about a new configuration are based on an acquisition function over a predictive distribution indexed by \(\mathbf{x}\) alone. For Hyper-Tune, an ensemble distribution with weighting distribution \([\theta_r]\) is used. Sampling from this distribution works by first sampling \(r\sim [\theta_r]\), then \(f(\mathbf{x}) = f(\mathbf{x}, r)\) from the predictive distribution for that \(r\). This means that models from all rung levels are potentially used, weighted by how reliable they predict the ordering at the highest level \(r_{*}\) supported by data. In our experiments so far, this adaptive weighting can outperform the \(r_{acq}\) heuristic used in BOHB and MOBSTER.
Note that our implementation generalizes Hyper-Tune in that ranking losses and \([\theta_r]\) are estimated once \(r_{*} > r_{min}\) (i.e., once \(r_{*}\) is equal to the second rung level). In the original work, one has to wait until \(r_{*} = r_{max}\), i.e. the maximum rung level is supported by enough data. We find that for many expensive tuning problems, early decision-making can make a large difference, so if the Hyper-Tune extensions provide benefits, they should be used as early during the experiment as possible. For example, in the trial plots for Hyper-Tune shown above, it takes more than 10000 seconds for 6 trials to reach the full 200 epochs, so in the original variant of Hyper-Tune, advanced decision-making only starts when more than half of the experiment is already done.
If Hyper-Tune is used with more than one bracket, the \([\theta_r]\) is also used in order to sample the bracket for a new trial. To this end, we need to determine a distribution \(P(r)\) over all rung levels which feature as \(r_{min}\) in a bracket. In our NASBench-201 example, if Hyper-Tune is run with 5 brackets, the support of \(P(r)\) would be \(\mathcal{S} = \{1, 3, 9, 27, 81\}\). Also, denote the default distribution used in ASHA and MOBSTER by \(P_0(r)\). Let \(r_0 = \text{min}(r_{*}, \text{max}(\mathcal{S}))\). For \(r\in\mathcal{S}\), we define \(P(r) = M \theta_r / r\) for \(r\le r_0\), and \(P(r) = P_0(r)\) for \(r > r_0\), where \(M = \sum_{r\in\mathcal{S}, r\le r_0} P_0(r)\). In other words, we use \(\theta_r / r\) for rung levels supported by data, and the default \(P_0(r)\) elsewhere. Once more, this slightly generalizes Hyper-Tune.
DyHPO
DyHPO is another recent model-based
multi-fidelity method. It is a promotion-based scheduler like the ones below
with type="promotion"
, but differs from MOBSTER and Hyper-Tune in that
promotion decisions are done based on the surrogate model, not on the
quantile-based rule of successive halving. In a nutshell:
Rung levels are equi-spaced: \(\mathcal{R} = \{ r_{min}, r_{min} + \nu, r_{min} + 2 \nu, \dots \}\). If \(r_{min} = \nu\), this means that a trial which is promoted or started from scratch, always runs for \(\nu\) resources, independent of its current rung level.
Once a worker is free, we can either promote a paused trial or start a new one. In DyHPO, all paused trials compete with a number of new configurations for the next \(\nu\) resources to be spent. The scoring criterion is a special version of expected improvement, so depends on the surrogate model.
Different to MOBSTER, the surrogate model is used more frequently. Namely, in MOBSTER, if any trial can be promoted, the surrogate model is not accessed. This means that DyHPO comes with higher decision-making costs, which need to be controlled.
Since scoring trials paused at the highest rung populated so far requires extrapolation in terms of resource \(r\), it cannot be used with
search_options["model"] = "gp_independent"
. The other surrogate models are supported.
Our implementation of DyHPO differs from the published work in a number of important points:
DyHPO uses an advanced surrogate model based on a neural network covariance kernel which is fitted to the current data. Our implementation supports DyHPO with the GP surrogate models detailed above, except for
"gp_independent"
.Our decision rule is different from DyHPO as published, and can be seen as a hybrid between DyHPO and ASHA. Namely, we throw a coin \(\{0, 1\}\) with probability \(P_1\) being configurable as
probability_sh
. If this gives 1, we try to promote a trial using the ASHA rule based on quantiles. Here, the quantile thresholds are adjusted to the linear spacing of rung levels. If no trial can be promoted this way, we fall back to the DyHPO rule. If the coin comes up 0, we use the DyHPO rule. The algorithm as published is obtained for \(P_1 = 0\). However, we find that a non-zeroprobability_sh
is crucial for obtaining robust behaviour, since the original DyHPO rule on its own tends to start too many trials at the beginning before promoting any paused ones.Since in DyHPO, the surrogate model is used more frequently than in MOBSTER, it is important to control surrogate model computations, as detailed above. Apart from the default for
max_size_data_for_model
, we also useopt_skip_period = 3
as default for DyHPO.
API docs:
Baseline:
DyHPO
Additional arguments:
HyperbandScheduler
search_options
:GPMultiFidelitySearcher
(note thatsearch_options["model"]
must not be equal to"gp_independent"
)