Model-based Synchronous Hyperband

All methods considered so far have been extensions of random search by clever multi-fidelity scheduling. In this section, we consider combinations of Bayesian optimization with multi-fidelity scheduling, where configurations are chosen based on performance of previously chosen ones, rather than being sampled at random.

Basics of Syne Tune: Bayesian Optimization provides an introduction to Bayesian optimization in Syne Tune.

Synchronous BOHB

The first model-based method we consider is BOHB, which uses the TPE formulation of Bayesian optimization. In the latter, an approximation to the expected improvement (EI) acquisition function is interpreted via a ratio of two densities. BOHB uses kernel density estimators rather than tree Parzen estimators (as in TPE) to model the two densities.

BOHB uses the same scheduling mechanism (i.e., rung levels, promotion decisions) than synchronous Hyperband (or SH), but it uses a model fit to past data for suggesting the configuration of every new trial. Recall that validation error after \(r\) epochs is denoted by \(f(\mathbf{x}, r)\), where \(\mathbf{x}\) is the configuration. BOHB fits KDEs separately to the data obtained at each rung level. When a new configuration is to be suggested, it first determines the largest rung level \(r_{acq}\) supported by enough data for the two densities to be properly fit. It then makes a TPE decision at this resource level. Our launcher script runs synchronous BOHB if method="BOHB".

API docs:

Baseline: SyncBOHB
Additional arguments: SynchronousGeometricHyperbandScheduler

While BOHB is often more efficient than SYNCHB, it is held back by synchronous decision-making. Note that BOHB does not model the random function \(f(\mathbf{x}, r)\) directly, which makes it hard to properly react to pending evaluations, i.e. trials which have been started but did not return metric values yet. BOHB ignores pending evaluations if present, which could lead to redundant decisions being made if the number of workers (i.e., parallelization factor) is large.

Synchronous MOBSTER

Another model-based variant is synchronous MOBSTER. We will provide more details on MOBSTER below, when discussing model-based asynchronous methods.

Our launcher script runs synchronous MOBSTER if method="SYNCMOBSTER". Note that the default surrogate model for SyncMOBSTER is gp_independent, where the data at each rung level is represented by an independent Gaussian process (more details are given here). It turns out that SyncMOBSTER outperforms SyncBOHB substantially on the benchmark chosen here.

API docs:

Baseline: SyncMOBSTER
Additional arguments: SynchronousGeometricHyperbandScheduler

When running these experiments with the simulator backend, we note that suddenly it takes quite some time for an experiment to be finished. Still many times faster than real time, we now need many minutes instead of seconds. This is a reminder that model-based decision-making can take time. In GP-based Bayesian optimization, hyperparameters of a Gaussian process model are fit for every decision, and acquisition functions are being optimized over many candidates. On the real time scale (the x axis in our result plots), this time is often well spent. After all, SyncMOBSTER outperforms SyncBOHB significantly. But since decision-making computations cannot be tabulated, they slow down the simulations.

As a consequence, we should be careful with result plots showing performance with respect to number of training evaluations, as these hide both the time required to make decisions, as well as potential inefficiencies in scheduling jobs in parallel. HPO methods should always be compared with real experiment time on the x axis, and the any-time performance of methods should be visualized by plotting curves, not just quoting “final values”. Examples are provided here.

Note

Syne Tune allows to launch experiments remotely and in parallel in order to still obtain results rapidly, as is detailed here.

Differential Evolution Hyperband

Another recent model-based extension of synchronous Hyperband is Differential Evolution Hyperband (DEHB). DEHB is typically run with multiple brackets. A main difference to Hyperband is that configurations promoted from a rung to the next are also modified by an evolutionary rule, involving mutation, cross-over and selection. Since configurations are not just sampled once, but potentially modified at every rung, the hope is to find well-performing configurations faster. Our launcher script runs DEHB if method="DEHB".

API docs:

Baseline: DEHB
Additional arguments: GeometricDifferentialEvolutionHyperbandScheduler

The main feature of DEHB over synchronous Hyperband is that configurations can be modified at every rung. However, this feature also has a drawback. Namely, DEHB cannot make effective use of checkpointing. If a trial is resumed with a different configuration, starting from its last recent checkpoint is not admissable. However, our implementation is careful to make use of checkpointing in the very first bracket of DEHB, which is equivalent to a normal run of synchronous SH.