Model-based Asynchronous Hyperband

We have seen that asynchronous decision-making tends to outperform synchronous variants in practice, and model-based extensions of the latter can outperform random sampling of new configurations. In this section, we discuss combinations of Bayesian optimization with asynchronous decision-making, leading to the currently best performing multi-fidelity methods in Syne Tune.

All examples here can either be run in stopping or promotion mode of ASHA. We will use the promotion mode here (i.e., pause-and-resume scheduling).

Surrogate Models of Learning Curves

Recall that validation error after \(r\) epochs is denoted by \(f(\mathbf{x}, r)\), with \(\mathbf{x}\) the configuration. Here, \(r\mapsto f(\mathbf{x}, r)\) is called learning curve. A learning curve surrogate model predicts \(f(\mathbf{x}, r)\) from observed data. A difficult requirement in the context of multi-fidelity HPO is that observations are much more abundant at smaller resource levels \(r\), while predictions are more valuable at larger \(r\).

In the context of Gaussian process based Bayesian optimization, Syne Tune supports a number of different learning curve surrogate models. The type of model is selected upon construction of the scheduler:

scheduler = MOBSTER(
    config_space,
    type="promotion",
    search_options=dict(
        model="gp_multitask",
        gp_resource_kernel="exp-decay-sum",
    ),
    metric=benchmark.metric,
    mode=benchmark.mode,
    resource_attr=resource_attr,
    random_seed=random_seed,
    max_resource_attr=max_resource_attr,
)

Here, options configuring the searcher are collected in search_options. The most important options are model, selecting the type of surrogate model, and gp_resource_kernel selecting the covariance model in the case model="gp_multitask".

Independent Processes at each Rung Level

A simple learning curve surrogate model is obtained by search_options["model"] = "gp_independent". Here, \(f(\mathbf{x}, r)\) at each rung level \(r\) is represented by an independent Gaussian process model. The models have individual constant mean functions \(\mu_r(\mathbf{x}) = \mu_r\) and covariance functions \(k_r(\mathbf{x}, \mathbf{x}') = c_r k(\mathbf{x}, \mathbf{x}')\), where \(k(\mathbf{x}, \mathbf{x}')\) is a Matern-5/2 ARD kernel without variance parameter, which is shared between the models, and the \(c_r > 0\) are individual variance parameters. The idea is that while validation errors at different rung levels may be scaled and shifted, they should still exhibit similar dependencies on the hyperparameters. The noise variance \(\sigma^2\) used in the Gaussian likelihood is the same across all data. However, if search_options["separate_noise_variances"] = True, different noise variances \(\sigma_r^2\) are used for data at different rung levels.

Multi-Task Gaussian Process Models

A more advanced set of learning curve surrogate models is obtained by search_options["model"] = "gp_multitask" (which is the default for asynchronous MOBSTER). In this case, a single Gaussian process model represents \(f(\mathbf{x}, r)\) directly, with mean function \(\mu(\mathbf{x}, r)\) and covariance function \(k((\mathbf{x}, r), (\mathbf{x}', r'))\). The GP model is selected by search_options["gp_resource_kernel"], currently supported options are "exp-decay-sum", "exp-decay-combined", "exp-decay-delta1", "freeze-thaw", "matern52", "matern52-res-warp", "cross-validation". The default choice is "exp-decay-sum", which is inspired by the exponential decay model proposed here. Details about these different models are given here and in the source code.

Decision-making is somewhat more expensive with "gp_multitask" than with "gp_independent", because the notorious cubic scaling of GP inference applies over observations made at all rung levels. However, the extra cost is limited by the fact that most observations by far are made at the lowest resource level \(r_{min}\) anyway.

Additive Gaussian Models

Two additional models are selected by search_options["model"] = "gp_expdecay" and search_options["model"] = "gp_issm". The former is the exponential decay model proposed here, the latter is a variant thereof. These additive Gaussian models represent dependencies across \(r\) in a cheaper way than in "gp_multitask", and they can be fit to all observed data, not just at rung levels. Also, joint sampling is cheap.

However, at this point, additive Gaussian models remain experimental, and they will not be further discussed here. They can be used with MOBSTER, but not with Hyper-Tune.

Asynchronous MOBSTER

MOBSTER combines ASHA and asynchronous Hyperband with GP-based Bayesian optimization. A Gaussian process learning curve surrogate model is fit to the data at all rung levels, and posterior predictive distributions are used in order to compute acquisition function values and decide on which configuration to start next. We distinguish between MOBSTER-JOINT with a GP multi-task model ("gp_multitask") and MOBSTER-INDEP with an independent GP model ("gp_independent"), as detailed above. The acquisition function is expected improvement (EI) at the rung level \(r_{acq}\) also used by BOHB.

Our launcher script runs (asynchronous) MOBSTER-JOINT if method="MOBSTER-JOINT". The searcher can be configured with search_options, but MOBSTER-JOINT with the "exp-decay-sum" covariance model is the default.

API docs:

As shown below, MOBSTER can outperform ASHA significantly. This is achieved by starting many less trials that stop very early (after 1 epoch) due to poor performance. Essentially, MOBSTER rapidly learns some important properties about the NASBench-201 problem and avoids basic mistakes which random sampling of configurations runs into at a constant rate. While ASHA stops such poor trials early, they still take away resources, which MOBSTER can spend on longer evaluations of more promising configurations. This advantage of model-based over random sampling based multi-fidelity methods is even more pronounced when starting and stopping jobs comes with delays. Such delays are typically present in real world distributed systems, but are absent in our simulations.

Different to BOHB, MOBSTER takes into account pending evaluations, i.e. trials which have been started but did not return metric values yet. This is done by integrating out their metric values by Monte Carlo. Namely, we draw a certain number of joint samples over pending targets and average the acquisition function over these. In the context of multi-fidelity, if a trial is running, a pending evaluation is registered for the next recent rung level it will reach.

Why is the surrogate model in MOBSTER-JOINT fit to the data at rung levels only? After all, training scripts tend to report validation errors after each epoch, why not use all this data? Syne Tune allows to do so (for the "gp_multitask" model), by passing searcher_data="all" when creating the HyperbandScheduler (another intermediate is searcher_data="rungs_and_last"). However, while this may lead to a more accurate model, it also becomes more expensive to fit, and does not tend to make a difference, so the default searcher_data="rungs" is recommended.

Finally, we can also combine ASHA with BOHB decision-making, by choosing searcher="kde" in HyperbandScheduler. This is an asynchronous version of BOHB.

MOBSTER-INDEP

Our launcher script runs (asynchronous) MOBSTER-INDEP if method="MOBSTER-INDEP". The independent GPs model is selected by search_options["model"] = "gp_independent". MOBSTER tends to perform slightly better with a joint multi-task GP model than with an independent GPs model, justifying the Syne Tune default. In our experience so far, changing the covariance model in MOBSTER-JOINT has only marginal impact.

API docs:

MOBSTER and Hyperband

Just like ASHA can be run with multiple brackets, so can MOBSTER, simply by selecting brackets when creating HyperbandScheduler. In our experience so far, just like with ASHA, MOBSTER tends to work best with a single bracket.

Controlling MOBSTER Computations

MOBSTER often outperforms ASHA substantially. However, when applied to a problem where many evaluations can be done, fitting the GP surrogate model to all observed data can become slow. In fact, Gaussian process inference scales cubically in the number of observations. The amount of computation spent by MOBSTER can be controlled:

  • Setting the limit max_size_data_for_model: Once the total number of observations is above this limit, the data is sampled down to this size. This is done in a way which retains all observations from trials which reached higher rung levels, while data from trials stopped early are more likely to be removed. This down sampling is redone every time the surrogate model is fit, so that new data (especially at higher rungs) is taken into account. Also, scheduling decisions about stopping, pausing, or promoting trials are always done based on all data.

    The default value for max_size_data_for_model is DEFAULT_MAX_SIZE_DATA_FOR_MODEL. It can be changed by passing search_options = {"max_size_data_for_model": XYZ} when creating the MOBSTER scheduler. You can switch off the limit mechanism by passing None or a very large value. As the current default value is on the smaller end, to ensure fast computations, you may want to experiment with larger values as well.

  • Parameters opt_skip_init_length, opt_skip_period: When fitting the GP surrogate model, the most expensive computation by far is refitting its own parameters, such as kernel parameters. The frequency of this computation can be regulated, as detailed here.

Hyper-Tune

Hyper-Tune is a model-based extension of ASHA with some additional features compared to MOBSTER. It can be seen as extending MOBSTER-INDEP (with the "gp_independent" surrogate model) in two ways. First, it uses an acquisition function based on an ensemble predictive distribution, while MOBSTER relies on the \(r_{acq}\) heuristic from BOHB. Second, if multiple brackets are used (Hyperband case), Hyper-Tune offers an adaptive mechanism to sample the bracket for a new trial. Both extensions are based on a quantification of consistency of data on different rung levels, which is used to weight rung levels according to their reliability for making decisions (namely, which configuration \(\mathbf{x}\) and bracket \(r_{min}\) to associate with a new trial).

Our launcher script runs Hyper-Tune if method="HYPERTUNE-INDEP". The searcher can be configured with search_options, but the independent GPs model "gp_independent" is the default. In this example, Hyper-Tune is using a single bracket, so the difference to MOBSTER-INDEP is due to the ensemble predictive distribution for the acquisition function.

Syne Tune also implements Hyper-Tune with the GP multi-task surrogate models used in MOBSTER. In result plots for this tutorial, original Hyper-Tune is called HYPERTUNE-INDEP, while this latter variant is called HYPERTUNE-JOINT. Our launcher script runs this variant if method="HYPERTUNE-JOINT".

API docs:

Finally, computations of Hyper-Tune can be controlled in the same way as in MOBSTER.

Hyper-Tune with Multiple Brackets

Just like ASHA and MOBSTER, Hyper-Tune can also be run with multiple brackets, simply by using the brackets argument of HyperbandScheduler. If brackets > 1, Hyper-Tune samples the bracket for a new trial from an adaptive distribution closely related to the ensemble distribution used for acquisitions. Our launcher script runs Hyper-Tune with 4 brackets if method="HYPERTUNE4-INDEP".

Recall that both ASHA and MOBSTER tend to work better for one than for multiple brackets. This may well be due to the fixed, non-adaptive distribution that brackets are sampled from. Ideally, a method would learn over time whether a low rung level tends to be reliable in predicting the ordering at higher ones, or whether it should rather be avoided (and \(r_{min}\) should be increased). This is what the adaptive mechanism in Hyper-Tune tries to do. In our comparisons, we find that HYPERTUNE-INDEP with multiple brackets can outperform MOBSTER-JOINT with a single bracket.

Details

In this section, we provide some details about Hyper-Tune and our implementation. The Hyper-Tune extensions are based on a quantification of consistency of data on different rung levels For example, assume that \(r < r_{*}\) are two rung levels, with sufficiently many points at \(r_{*}\). If \(\mathcal{X}_{*}\) collects trials with data at \(r_{*}\), all these have also been observed at \(r\). Sampling \(f(\mathcal{X}_{*}, r)\) from the posterior distribution of the surrogate model, we can compare the ordering of these predictions at \(r\) with the ordering of observations at \(r_{*}\), using a pair-wise ranking loss. A large loss value means frequent cross-overs of learning curves between \(r\) and \(r_{*}\), and predictions at rung level \(r\) are unreliable when it comes to the ordering of trials \(\mathcal{X}_{*}\) at \(r_{*}\).

At any point during the algorithm, denote by \(r_{*}\) the largest rung level with a sufficient number of observations (our implementation requires 6 points). Assuming that \(r_{*} > r_{min}\), we can estimate a distribution \([\theta_r]\) over rung levels \(\mathcal{R}_{*} = \{r\in\mathcal{R}\, |\, r\le r_{*}\}\) as follows. We draw \(S\) independent samples from the model at these rung levels. For each sample \(s\), we compute loss values \(l_{r, s}\) for \((r, r_{*})\) over all \(r\in\mathcal{R}_{*}\), and determine the argmin indicator \([\text{I}_{l_{r, s} = m_s}]\), where \(m_s = \text{min}(l_{r, s} | r\in\mathcal{R}_{*})\). The distribution \([\theta_r]\) is obtained as normalized sum of these indicators over \(s=1,\dots, S\). We also need to compute loss values \(l_{r_{*}, s}\), this is done using a cross-validation approximation, see here or the code in syne_tune.optimizer.schedulers.searchers.bayesopt.gpautograd.hypertune for details. In the beginning, with too little data at the second rung level, we use \(\theta_{r_{min}} = 1\) and 0 elsewhere.

Decisions about a new configuration are based on an acquisition function over a predictive distribution indexed by \(\mathbf{x}\) alone. For Hyper-Tune, an ensemble distribution with weighting distribution \([\theta_r]\) is used. Sampling from this distribution works by first sampling \(r\sim [\theta_r]\), then \(f(\mathbf{x}) = f(\mathbf{x}, r)\) from the predictive distribution for that \(r\). This means that models from all rung levels are potentially used, weighted by how reliable they predict the ordering at the highest level \(r_{*}\) supported by data. In our experiments so far, this adaptive weighting can outperform the \(r_{acq}\) heuristic used in BOHB and MOBSTER.

Note that our implementation generalizes Hyper-Tune in that ranking losses and \([\theta_r]\) are estimated once \(r_{*} > r_{min}\) (i.e., once \(r_{*}\) is equal to the second rung level). In the original work, one has to wait until \(r_{*} = r_{max}\), i.e. the maximum rung level is supported by enough data. We find that for many expensive tuning problems, early decision-making can make a large difference, so if the Hyper-Tune extensions provide benefits, they should be used as early during the experiment as possible. For example, in the trial plots for Hyper-Tune shown above, it takes more than 10000 seconds for 6 trials to reach the full 200 epochs, so in the original variant of Hyper-Tune, advanced decision-making only starts when more than half of the experiment is already done.

If Hyper-Tune is used with more than one bracket, the \([\theta_r]\) is also used in order to sample the bracket for a new trial. To this end, we need to determine a distribution \(P(r)\) over all rung levels which feature as \(r_{min}\) in a bracket. In our NASBench-201 example, if Hyper-Tune is run with 5 brackets, the support of \(P(r)\) would be \(\mathcal{S} = \{1, 3, 9, 27, 81\}\). Also, denote the default distribution used in ASHA and MOBSTER by \(P_0(r)\). Let \(r_0 = \text{min}(r_{*}, \text{max}(\mathcal{S}))\). For \(r\in\mathcal{S}\), we define \(P(r) = M \theta_r / r\) for \(r\le r_0\), and \(P(r) = P_0(r)\) for \(r > r_0\), where \(M = \sum_{r\in\mathcal{S}, r\le r_0} P_0(r)\). In other words, we use \(\theta_r / r\) for rung levels supported by data, and the default \(P_0(r)\) elsewhere. Once more, this slightly generalizes Hyper-Tune.

DyHPO

DyHPO is another recent model-based multi-fidelity method. It is a promotion-based scheduler like the ones below with type="promotion", but differs from MOBSTER and Hyper-Tune in that promotion decisions are done based on the surrogate model, not on the quantile-based rule of successive halving. In a nutshell:

  • Rung levels are equi-spaced: \(\mathcal{R} = \{ r_{min}, r_{min} + \nu, r_{min} + 2 \nu, \dots \}\). If \(r_{min} = \nu\), this means that a trial which is promoted or started from scratch, always runs for \(\nu\) resources, independent of its current rung level.

  • Once a worker is free, we can either promote a paused trial or start a new one. In DyHPO, all paused trials compete with a number of new configurations for the next \(\nu\) resources to be spent. The scoring criterion is a special version of expected improvement, so depends on the surrogate model.

  • Different to MOBSTER, the surrogate model is used more frequently. Namely, in MOBSTER, if any trial can be promoted, the surrogate model is not accessed. This means that DyHPO comes with higher decision-making costs, which need to be controlled.

  • Since scoring trials paused at the highest rung populated so far requires extrapolation in terms of resource \(r\), it cannot be used with search_options["model"] = "gp_independent". The other surrogate models are supported.

Our implementation of DyHPO differs from the published work in a number of important points:

  • DyHPO uses an advanced surrogate model based on a neural network covariance kernel which is fitted to the current data. Our implementation supports DyHPO with the GP surrogate models detailed above, except for "gp_independent".

  • Our decision rule is different from DyHPO as published, and can be seen as a hybrid between DyHPO and ASHA. Namely, we throw a coin \(\{0, 1\}\) with probability \(P_1\) being configurable as probability_sh. If this gives 1, we try to promote a trial using the ASHA rule based on quantiles. Here, the quantile thresholds are adjusted to the linear spacing of rung levels. If no trial can be promoted this way, we fall back to the DyHPO rule. If the coin comes up 0, we use the DyHPO rule. The algorithm as published is obtained for \(P_1 = 0\). However, we find that a non-zero probability_sh is crucial for obtaining robust behaviour, since the original DyHPO rule on its own tends to start too many trials at the beginning before promoting any paused ones.

  • Since in DyHPO, the surrogate model is used more frequently than in MOBSTER, it is important to control surrogate model computations, as detailed above. Apart from the default for max_size_data_for_model, we also use opt_skip_period = 3 as default for DyHPO.

API docs: