Extending Asynchronous Hyperband
Syne Tune provides powerful generic scheduler templates for popular methods like successive halving and Hyperband. These can be run with synchronous or asynchronous decision-making. The most important generic templates at the moment are:
FIFOScheduler: Full evaluation scheduler, baseclass for many others. See also
FIFOScheduler.HyperbandScheduler: Asynchronous successive halving and Hyperband. See also
HyperbandScheduler.SynchronousHyperbandScheduler: Synchronous successive halving and Hyperband. See also
SynchronousHyperbandScheduler.
Chances are your idea for a new scheduler maps to one of these templates, in which case you can save a lot of time and headache by just extending the template, rather than re-implementing the wheel. Due to Syne Tune’s modular design of schedulers and their components (e.g., searchers, decision rules), you may even get more than you bargained for.
In this section, we will walk through an example of how to furnish the asynchronous successive halving scheduler with a specific searcher.
HyperbandScheduler
Details about asynchronous successive halving and Hyperband are given in the
Multi-fidelity HPO tutorial. This is a
multi-fidelity scheduler, where trials report intermediate results (e.g.,
validation error at the end of each epoch of training). We can formalize this
notion by the concept of resource \(r = 1, 2, 3, \dots\) (e.g.,
\(r\) is the number of epochs trained). A generic implementation of this
method is provided in class:HyperbandScheduler.
Let us have a look at its arguments not shared with the base class
class:FIFOScheduler:
A mandatory argument is
resource_attr, which is the name of a field in theresultdictionary passed toscheduler.on_trial_report. This field contains the resource \(r\) for which metric values have been reported. For example, if a trial reports validation error at the end of the 5-th epoch of training,resultcontains{resource_attr: 5}.We already noted the arguments
max_resource_attrandmax_tin class:FIFOScheduler. They are used to determine the maximum resource \(r_{max}\) (e.g., the total number of epochs a trial is to be trained, if not stopped before). As discussed in detail here, it is best practice reserving a field in the configuration spacescheduler.config_spaceto contain \(r_{max}\). If this is done, its name should be passed inmax_resource_attr. Now, every configuration sent to the training script contains \(r_{max}\), which should not be hardcoded in the script. Moreover, ifmax_resource_attris used, a pause-and-resume scheduler (e.g.,HyperbandSchedulerwithtype="stopping") can modify this field in the configuration of a trial which is only to be run until a certain resource less than \(r_{max}\). Nevertheless, ifmax_resource_attris not used, then \(r_{max}\) has to be passed explicitly viamax_t(which is not needed ifmax_resource_attris used).reduction_factor,grace_period,bracketsare important parameters detailed in the tutorial. Ifbrackets > 1, we run asynchronous Hyperband with this number of brackets, while forbracket == 1we run asynchronous successive halving (this is the default).As detailed in the tutorial,
typedetermines whether the method uses early stopping (type="stopping") or pause-and-resume scheduling (type="promotion"). Further choices oftypeactivate specific algorithms such as RUSH, PASHA, or cost-sensitive successive halving.
Kernel Density Estimator Searcher
One of the most flexible ways of extending
HyperbandScheduler is to provide it with
a novel searcher. In order to
understand how this is done, we will walk through
MultiFidelityKernelDensityEstimator
and
KernelDensityEstimator.
This searcher implements suggest as in
BOHB, as also detailed in
this tutorial. In a
nutshell, the searcher splits all observations into two parts (good and
bad), depending on metric values lying above or below a certain quantile, and
fits kernel density estimators to these two subsets. It then makes decisions
based on a particular ratio of these densities, which is approximating a
variant of the expected improvement acquisition function.
We begin with the base class
KernelDensityEstimator,
which works with schedulers implementing
TrialSchedulerWithSearcher
(the most important one being FIFOScheduler),
but already implements most of what is needed in the multi-fidelity context.
The code does quite some bookkeeping concerned with mapping configurations to feature vectors. If you want to do this from scratch for your searcher, we recommend to use
HyperparameterRanges. However,KernelDensityEstimatorwas extracted from the original BOHB implementation.Observation data is collected in
self.X(feature vectors for configurations) andself.y(values forself._metric, negated ifself.mode == "max"). In particular, the_updatemethod simply appends new data to these members.get_configfits KDEs to the good and bad parts ofself.X,self.y. It then samplesself.num_candidatesconfigurations at random, evaluates the TPE acquisition function for each candidate, and returns the best one.configure_scheduleris a callback which allows the searcher to check whether its scheduler is compatible, and to depend on details of this scheduler. In our case, we check whether the scheduler implementsTrialSchedulerWithSearcher, which is the minimum requirement for a searcher.
Note
Any scheduler configured by a searcher should inherit from
TrialSchedulerWithSearcher,
which mainly makes sure that
configure_scheduler()
is called before the searcher is first used. It is also strongly recommended
to implement configure_scheduler for a new searcher, restricting usage
to compatible schedulers.
The class
MultiFidelityKernelDensityEstimator
inherits from KernelDensityEstimator:
On top of
self.Xandself.y, it also maintains resource values \(r\) for each datapoint inself.resource_levels.get_configremains the same, only its subroutinetrain_kdefor training the good and bad density models is modified. The idea is to fit these to data from a single rung level, namely the largest level at which we have observed at leastself.num_min_data_pointspoints.configure_schedulerrestricts usage to schedulers implementingMultiFidelitySchedulerMixin, which all multi-fidelity schedulers need to inherit from (examples areHyperbandSchedulerfor asynchronous Hyperband andSynchronousHyperbandSchedulerfor synchronous Hyperband). It also callsconfigure_scheduler(). Moreover,self.resource_attris obtained from the scheduler, so does not have to be passed.
Note
Any multi-fidelity scheduler configured by a searcher should inherit from both
TrialSchedulerWithSearcher and
MultiFidelitySchedulerMixin.
The latter is a basic API to be implemented by multi-fidelity schedulers, which
is used by the configure_scheduler of searchers specialized to multi-fidelity
HPO. Doing so makes sure any new multi-fidelity scheduler can seamlessly be
used with any such searcher.
While being functional and simple, the
MultiFidelityKernelDensityEstimator does not showcase the full range of
information exchanged between HyperbandScheduler and a searcher. In
particular:
register_pending: BOHB does not take pending evaluations into account.remove_case,evaluation_failedare not implemented.get_state,clone_from_stateare not implemented, so schedulers with this searcher are not properly serialized.
For a more complete and advanced example, the reader is invited to study
GPMultiFidelitySearcher and
GPFIFOSearcher.
This searcher takes pending evaluations into account (by way of fantasizing).
Moreover, it can be configured with a Gaussian process model and an acquisition
function, which is optimized in a gradient-based manner.
Moreover, as already noted here,
HyperbandScheduler also allows to configure the decision rule for
stop/continue or pause/resume as part of on_trial_report. Examples for this
are found in
StoppingRungSystem,
PromotionRungSystem,
RUSHStoppingRungSystem,
PASHARungSystem,
CostPromotionRungSystem.