Overview of Module Structure

We begin with an overview of the module structure of the Bayesian optimization (BO) code in Syne Tune. Feel free to directly move to the first example and come back here for reference.

Recall that Bayesian optimization is implemented in a searcher, which is a component of a scheduler responsible for suggesting the next configuration to sample, given data from earlier trials. While searchers using BO are located in syne_tune.optimizer.schedulers.searchers and submodules, the BO code itself is found in syne_tune.optimizer.schedulers.searchers.bayesopt. Recall that a typical BO algorithm is configured by a surrogate model and an acquisition function. In Syne Tune, acquisition functions are implemented generically, while (except for special cases) surrogate models can be grouped in two different classes:

Gaussian process based surrogate models: Implementations in gpautograd.
Surrogate models based on scikit-learn like estimators: Implementations in sklearn.

The remaining code in syne_tune.optimizer.schedulers.searchers.bayesopt is generic or wraps lower-level code. Submodules are as follows:

datatypes: Collects types related to maintaining data obtained from trials. The most important class is TuningJobState, which collects relevant data during an experiment. Note that other relevant classes are in syne_tune.optimizer.schedulers.searchers.utils, such as HyperparameterRanges, which wraps a configuration space and maps configurations to encoded vectors used as inputs to a surrogate model.
models: Contains a range of surrogate models, both for single and multi-fidelity tuning, along with the machinery to fit parameters of these models. In a nutshell, retraining of parameters and posterior computations for a surrogate model are defined in Estimator, which returns a Predictor to be used for posterior predictions, which in turn drive the optimization of an acquisition function. A model-based searcher interacts with a ModelStateTransformer, which maintains the state of the experiment (a TuningJobState object) and interacts with an Estimator. Subclasses of Estimator and Predictor are mainly wrappers of underlying code in gpautograd or sklearn. Details will be provided shortly. This module also contains a range of acquisition functions, mostly in meanstd_acqfunc.
tuning_algorithms: The Bayesian optimization logic resides here, mostly in BayesianOptimizationAlgorithm. Interfaces for all relevant concepts are defined in base_classes:
- Predictor: Probabilistic predictor obtained from surrogate model, to be plugged into acquisition function.
- AcquisitionFunction: Acquisition function, which is optimized in order to suggest the next configuration.
- ScoringFunction: Base class of AcquisitionFunction which does not support gradient computations. Score functions can be used to rank a finite number of candidates.
- LocalOptimizer: Local optimizer for minimizing the acquisition function.
gpautograd: The Gaussian process based surrogate models, defined in models, can be implemented in different ways. Syne Tune currently uses the lightweight autograd library, and the corresponding implementation lies in this module.
sklearn: Collects code required to implement surrogate models based on scikit-learn like estimators.

Note

The most low-level code for Gaussian process based Bayesian optimization is contained in gpautograd, which is specific to autograd and L-BFGS optimization. Unless you want to implement a new kernel function, you probably do not have to extend this code. As we will see, most extensions of interest can be done in models (new surrogate model, new acquisition function), or in tuning_algorithms (different BO workflow).

A Walk Through Bayesian Optimization

The key primitive of BO is to suggest a next configuration to evaluate the unknown target function at (e.g., the validation error after training a machine learning model with a hyperparameter configuration), based on all data gathered about this function in the past. This primitive is triggered in the get_config() method of a BO searcher. It consists of two main steps:

Estimate surrogate model(s), given all data obtained. Often, a single surrogate model represents the target metric of interest, but in generalized setups such as multi-fidelity, constrained, or multi-objective BO, surrogate models may be fit to several metrics. A surrogate model provides predictive distributions for the metric it represents, at any configuration, which allows BO to explore the space of configurations not yet sampled at. For most built-in GP based surrogate models, estimation is done by maximizing the log marginal likelihood, as we see in more detail below.
Use probabilistic predictions of surrogate models to search for the best next configuration to sample at. This is done in BayesianOptimizationAlgorithm, and is the main focus here.

BayesianOptimizationAlgorithm can suggest a batch of num_requested_candidates > 1. If greedy_batch_selection == True, this is done greedily, one configuration at a time, yet diversity is maintained by inserting already suggested configurations as pending into the state. If greedy_batch_selection == False, we simply return the num_requested_candidates top-scoring configurations. For simplicity, we focus on num_requested_candidates == 1, so that a single configuration is suggested. This happens in several steps:

First, a list of num_initial_candidates initial configurations is drawn at random from initial_candidates_generator of type CandidateGenerator.
Next, these configurations are scored using initial_candidates_scorer of type ScoringFunction. This is a parent class of AcquisitionFunction, but acquisition functions support gradient computation as well. The scoring function typically depends on a predictor obtained from a surrogate model.
Finally, local optimization of an acquisition function is run, using an instance of LocalOptimizer, which depends on an acquisition function and one or more predictors. Local optimization is initialized with the top-scoring configuration from the previous step. If it fails or does not result in a configuration with a better acquisition value, then this initial configuration is returned. The final local optimization can be skipped by passing an instance of NoOptimization.

This workflow offers a number of opportunities for customization:

The initial_candidates_generator by default draws configurations at random with replacement (checking for duplicates is expensive, and does not add value). This could be replaced by pseudo-random sampling with better coverage properties, or by Latin hypercube designs.
The initial_candidate_scorer is often the same as the acquisition function in the final local optimization. Other acquisition strategies, such as (independent) Thompson sampling, can be implemented here.
You may want to customize the acquisition function feeding into local optimization (and initial scoring), more details are provided below.