Overview of Module Structure
We begin with an overview of the module structure of the Bayesian optimization (BO) code in Syne Tune. Feel free to directly move to the first example and come back here for reference.
Recall that
Bayesian optimization is implemented in a searcher, which is a component of
a scheduler responsible for suggesting the next configuration to sample, given
data from earlier trials. While searchers using BO are located in
syne_tune.optimizer.schedulers.searchers
and submodules, the BO code
itself is found in syne_tune.optimizer.schedulers.searchers.bayesopt
.
Recall that
a typical BO algorithm is configured by a surrogate model and an acquisition
function. In Syne Tune, acquisition functions are implemented generically,
while (except for special cases) surrogate models can be grouped in two
different classes:
Gaussian process based surrogate models: Implementations in
gpautograd
.Surrogate models based on
scikit-learn
like estimators: Implementations insklearn
.
The remaining code in syne_tune.optimizer.schedulers.searchers.bayesopt
is generic or wraps lower-level code. Submodules are as follows:
datatypes
: Collects types related to maintaining data obtained from trials. The most important class isTuningJobState
, which collects relevant data during an experiment. Note that other relevant classes are insyne_tune.optimizer.schedulers.searchers.utils
, such asHyperparameterRanges
, which wraps a configuration space and maps configurations to encoded vectors used as inputs to a surrogate model.models
: Contains a range of surrogate models, both for single and multi-fidelity tuning, along with the machinery to fit parameters of these models. In a nutshell, retraining of parameters and posterior computations for a surrogate model are defined inEstimator
, which returns aPredictor
to be used for posterior predictions, which in turn drive the optimization of an acquisition function. A model-based searcher interacts with aModelStateTransformer
, which maintains the state of the experiment (aTuningJobState
object) and interacts with anEstimator
. Subclasses ofEstimator
andPredictor
are mainly wrappers of underlying code ingpautograd
orsklearn
. Details will be provided shortly. This module also contains a range of acquisition functions, mostly inmeanstd_acqfunc
.tuning_algorithms
: The Bayesian optimization logic resides here, mostly inBayesianOptimizationAlgorithm
. Interfaces for all relevant concepts are defined inbase_classes
:Predictor
: Probabilistic predictor obtained from surrogate model, to be plugged into acquisition function.AcquisitionFunction
: Acquisition function, which is optimized in order to suggest the next configuration.ScoringFunction
: Base class ofAcquisitionFunction
which does not support gradient computations. Score functions can be used to rank a finite number of candidates.LocalOptimizer
: Local optimizer for minimizing the acquisition function.
gpautograd
: The Gaussian process based surrogate models, defined inmodels
, can be implemented in different ways. Syne Tune currently uses the lightweight autograd library, and the corresponding implementation lies in this module.sklearn
: Collects code required to implement surrogate models based onscikit-learn
like estimators.
Note
The most low-level code for Gaussian process based Bayesian optimization is
contained in
gpautograd
, which
is specific to autograd and L-BFGS
optimization. Unless you want to implement a new kernel function, you
probably do not have to extend this code. As we will see, most extensions of
interest can be done in
models
(new
surrogate model, new acquisition function), or in
tuning_algorithms
(different BO workflow).
A Walk Through Bayesian Optimization
The key primitive of BO is to suggest a next configuration to evaluate the
unknown target function at (e.g., the validation error after training a
machine learning model with a hyperparameter configuration), based on all
data gathered about this function in the past. This primitive is triggered in
the get_config()
method of a BO searcher. It consists of two main steps:
Estimate surrogate model(s), given all data obtained. Often, a single surrogate model represents the target metric of interest, but in generalized setups such as multi-fidelity, constrained, or multi-objective BO, surrogate models may be fit to several metrics. A surrogate model provides predictive distributions for the metric it represents, at any configuration, which allows BO to explore the space of configurations not yet sampled at. For most built-in GP based surrogate models, estimation is done by maximizing the log marginal likelihood, as we see in more detail below.
Use probabilistic predictions of surrogate models to search for the best next configuration to sample at. This is done in
BayesianOptimizationAlgorithm
, and is the main focus here.
BayesianOptimizationAlgorithm
can suggest a batch of num_requested_candidates > 1
. If
greedy_batch_selection == True
, this is done greedily, one configuration
at a time, yet diversity is maintained by inserting already suggested
configurations as pending into the state. If greedy_batch_selection == False
,
we simply return the num_requested_candidates
top-scoring configurations.
For simplicity, we focus on num_requested_candidates == 1
, so that
a single configuration is suggested. This happens in several steps:
First, a list of
num_initial_candidates
initial configurations is drawn at random frominitial_candidates_generator
of typeCandidateGenerator
.Next, these configurations are scored using
initial_candidates_scorer
of typeScoringFunction
. This is a parent class ofAcquisitionFunction
, but acquisition functions support gradient computation as well. The scoring function typically depends on a predictor obtained from a surrogate model.Finally, local optimization of an acquisition function is run, using an instance of
LocalOptimizer
, which depends on an acquisition function and one or more predictors. Local optimization is initialized with the top-scoring configuration from the previous step. If it fails or does not result in a configuration with a better acquisition value, then this initial configuration is returned. The final local optimization can be skipped by passing an instance ofNoOptimization
.
This workflow offers a number of opportunities for customization:
The
initial_candidates_generator
by default draws configurations at random with replacement (checking for duplicates is expensive, and does not add value). This could be replaced by pseudo-random sampling with better coverage properties, or by Latin hypercube designs.The
initial_candidate_scorer
is often the same as the acquisition function in the final local optimization. Other acquisition strategies, such as (independent) Thompson sampling, can be implemented here.You may want to customize the acquisition function feeding into local optimization (and initial scoring), more details are provided below.