Implementing a Surrogate Model

In Bayesian optimization (BO), a surrogate model represents the data observed from a target metric so far, and its probabilistic predictions at new configurations (typically involving both predictive mean and variance) guides the search for a most informative next acquisition. In this section, we will show how surrogate models are implemented in Syne Tune, and give an example of how a novel model can be added.

Recall from above that Syne Tune offers surrogate model from two broad classes: Gaussian process based models and scikit-learn estimator based models. Both are implemented in terms of the same abstractions Estimator, and Predictor. We will first walk through GP based surrogate models, then dive into an example of how to implement a new scikit-learn estimator based model. More details about how to extend GP based models are provided further below.

Example

Before diving into details, let us look at a simple example for how to implement a new surrogate model in Syne Tune, of the scikit-learn estimator based type. It does not come with some of the complexities of Gaussian process based surrogate models, to be discussed below:

  • Fantasizing is not supported

  • MCMC (or ensemble predictions) is not supported

  • Gradient-based optimization of an acquisition function is not supported, in that Bayesian optimization is scoring a finite number of candidates drawn at random, selecting the best

The full example code is given here. We implement subclasses of SKLearnPredictor and SKLearnEstimator. These are wrapped by SKLearnPredictorWrapper and SKLearnEstimatorWrapper.

examples/launch_sklearn_surrogate_bo.py
from syne_tune.optimizer.schedulers.searchers.bayesopt.sklearn import (
    SKLearnEstimator,
    SKLearnPredictor,
)


class BayesianRidgePredictor(SKLearnPredictor):
    """
    Predictor for surrogate model given by ``sklearn.linear_model.BayesianRidge``.
    """

    def __init__(self, ridge: BayesianRidge):
        self.ridge = ridge

    def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        return self.ridge.predict(X, return_std=True)


class BayesianRidgeEstimator(SKLearnEstimator):
    """
    Estimator for surrogate model given by ``sklearn.linear_model.BayesianRidge``.

    None of the parameters of ``BayesianRidge`` are exposed here, so they are all
    fixed up front.
    """

    def __init__(self, *args, **kwargs):
        self.ridge = BayesianRidge(*args, **kwargs)

    def fit(
        self, X: np.ndarray, y: np.ndarray, update_params: bool
    ) -> SKLearnPredictor:
        self.ridge.fit(X, y.ravel())
        return BayesianRidgePredictor(ridge=copy.deepcopy(self.ridge))


  • The BayesianRidgeEstimator is wrapping the scikit-learn estimator sklearn.linear_model.BayesianRidge, which implements a form of Bayesian regression estimation. While this method has hyperparameters, they are automatically set in fit, so we do not need to make them explicit. The result of fit is a BayesianRidgePredictor instance which wraps a copy of the fitted scikit-learn estimator.

  • In BayesianRidgePredictor, the predict methods calls the equivalent of the scikit-learn estimator with return_std=True, so that both predictive means and stddevs are returned.

The remaining launcher script is much the same as other examples, except that FIFOScheduler is used with a particular searcher:

examples/launch_sklearn_surrogate_bo.py
    searcher = SKLearnSurrogateSearcher(
        config_space=config_space,
        metric=METRIC_ATTR,
        estimator=BayesianRidgeEstimator(),
        scoring_class=EIAcquisitionFunction,
    )

The Predictor Class

Scikit-learn based estimators are typically rather simple and based on deterministic machine learning methods. Bayesian optimization is usually run with Bayesian models, where proper quantification of uncertainty is center-stage, and supporting these is a little more difficult.

In Bayesian statistics, (surrogate) models are conditioned on data in order to obtain a posterior distribution, represented by a posterior state. Given this state, probabilistic predictions can be done at arbitrary input points. This is done by objects of type Predictor, whose methods deal with predictions on new configurations.

Note

Before moving on, it is important to understand the difference between conditioning a probabilistic model on data in order to obtain a posterior distribution, with which probabilistic predictions (i.e., mean and variance) can be computed at input points, and learning (or fitting) the (hyper)parameters of the model. For a Bayesian surrogate model, the latter involves Markov Chain Monte Carlo or marginal likelihood optimization, which requires conditioning on data several times. For non-Bayesian models, parameters are often fit by cross-validation.

At this point, there are a number of relevant concepts:

  • A model can be “fitted” by Markov Chain Monte Carlo (MCMC), in which case its predictive distribution is an ensemble. This is why prediction methods returns lists. In the default case (single model, no MCMC), these lists are of size one.

  • A model may support fantasizing in order to properly deal with pending configurations in the current state (see also register_pending in the discussion here). At least in the Gaussian process surrogate model case, fantasizing is done by drawing nf samples of target values for the pending configurations, then average predictions over this sample. The Gaussian predictive distributions in this average share the same variance, but have different means. A surrogate model which does not support fantasizing, can ignore this extra complexity.

Take the example of a basic Gaussian process surrogate model, which is behind BayesianOptimization. The predictor is GaussProcPredictor. This class can serve models fit by marginal likelihood optimization (empirical Bayes) or MCMC, but let us focus on the former. Predictions in this model are based on a posterior state, which maintains a representation of the Gaussian posterior distribution needed for probabilistic predictions. Say we would like do a prediction at some configuration \(\mathbf{c}\). First, this configuration is mapped to an (encoded) input vector \(\mathbf{x}\). Next, predictive distributions are computed, using the posterior state:

\[P(y | \mathbf{x}) = \left[ \mathcal{N}(y | \mu_j(\mathbf{x}), \sigma^2(\mathbf{x})) \right],\quad j=1,\dots, \mathrm{nf}.\]

Here, nf denotes the number of fantasy samples (nf=1 if fantasizing is not supported). This is served by methods of Predictor:

  • hp_ranges_for_prediction: Returns instance of HyperparameterRanges which is used to map a configuration \(\mathbf{c}\) to an encoded vector \(\mathbf{x}\).

  • predict: Given a matrix \(\mathbf{X}\) of input vectors (these are the rows \(\mathbf{x}_i\)), return a list of dictionaries. In our non-MCMC example, this list has length 1. The dictionary contains statistics of the predictive distribution. In our example, this would be predictive means (key “mean”) and predictive standard deviations (key “std”). More precisely, the entry for “mean” would be a matrix \([\mu_j(\mathbf{x}_i)]_{i,j}\) of shape (n, nf), where n is the number of input vectors, and the entry for “std” would be a vector \([\sigma(\mathbf{x}_i)]_i\) of shape (n,). If the surrogate model does not support fantasizing, the entry for “mean” is also a vector of shape (n,).

  • predict_candidates: Version of predict, where the input is a list of configurations \([\mathbf{c}_j]\), which are first mapped to rows of the matrix \(\mathbf{X}\) by using hp_ranges_for_prediction.

  • keys_predict: Keys of dictionaries returned by predict. If a surrogate model is to be used with a standard acquisition function, such as expected improvement, it needs to return at least means (“mean”) and standard deviations (“std”). However, in other contexts, a surrogate model may be deterministic, in which case only means (“mean”) are returned. This method allows an acquisition function to check whether it can work with surrogate models passed to it.

  • backward_gradient: This method is needed in order to support local gradient-based optimization of an acquisition function, as discussed here. It is detailed below.

  • current_best: A number of acquisition functions depend on the incumbent, which is a smooth approximation to the best target value observed so far. Typically, this is implemented as \(\mathrm{min}(\mu_j(\mathbf{x}_i))\) over all inputs \(\mathbf{x}_i\) already sampled for previous trials. As with predict, this returns a list of vectors of shape (nf,), catering for fantasizing. If fantasizing is not supported, this is a list of scalars, and the list size is 1 for non-MCMC.

Note

In fact, GaussProcPredictor inherits from BasePredictor, which extends the base interface by some helper code to implement the current_best method.

Supporting Local Gradient-based Optimization

As discussed above, BO in Syne Tune supports local gradient-based optimization of an acquisition function. This needs to be supported by an implementation of Predictor, in terms of the backward_gradient method.

In the most basic case, an acquisition function \(\alpha(\mathbf{x})\) has the following structure:

\[\alpha(\mathbf{x}) = \alpha(\mu(\mathbf{x}), \sigma(\mathbf{x})).\]

We ignore fantasizing here, otherwise \(\mu(\mathbf{x})\) becomes a vector. For gradient-based optimization, we need derivatives

\[\frac{\partial\alpha}{\partial\mathbf{x}} = \frac{\partial\alpha}{\partial\mu} \frac{\partial\mu}{\partial\mathbf{x}} + \frac{\partial\alpha}{\partial\sigma} \frac{\partial\sigma}{\partial\mathbf{x}}.\]

The backward_gradient method takes arguments \(\mathbf{x}\) (input) and a dictionary mapping “mean” to \(\partial\alpha/\partial\mu\) at \(\mu = \mu(\mathbf{x})\), “std” to \(\partial\alpha/\partial\sigma\) at \(\sigma = \sigma(\mathbf{x})\) (head_gradients), and returns the gradient \(\partial\alpha/\partial\mathbf{x}\).

Readers familiar with deep learning frameworks like PyTorch may wonder why we don’t just combine surrogate model and acquisition function into forming \(\alpha(\mathbf{x})\), and compute its gradient by reverse mode differentiation. However, this would strongly couple the two concepts, in that they would have to be implemented in the same auto-differentiation system. Instead, backward_gradient decouples the gradient computation into head gradients for the acquisition function, which (as we will see) can be implemented in native NumPy, and backward_gradient for the surrogate model itself. For Syne Tune’s Gaussian process surrogate models, the latter is implemented using autograd. If the predict method is implemented using this framework, gradients are obtained automatically as usual.

ModelStateTransformer and Estimator

An instance of Predictor represents the posterior distribution of a model conditioned on observed data. Where does this conditioning take place? Note that while machine learning APIs like scikit-learn couple fitting and prediction in a single API, these two are decoupled in Syne Tune by design:

  • Estimator: The most important method is fit_from_state(). It computes the posterior state by conditioning on observed data, which are sufficient statistics required for probabilistic predictions. Moreover, if update_params=True, this final conditioning is preceded by fitting the (hyper)parameters of the model (this is more expensive, and if update_params=False, the current parameters are used without updating them).

  • Predictor: Wraps the posterior state computed by the Estimator, allows for predictions.

The fitting of surrogate models underlying a Bayesian optimization experiment happens in ModelStateTransformer, which interfaces between a model-based searcher and the surrogate model. The ModelStateTransformer maintains the state of the experiment, where all data about observations and pending configurations are collected. Its fit() method triggers fitting the surrogate models to the current data (this step can be skipped for computational savings) and computing their posterior states.

ModelStateTransformer hands down these tasks to an object of type Estimator, which is specific to the surrogate model being used. For our Gaussian process example, this would be GaussProcEmpiricalBayesEstimator. Here, parameters of the Gaussian process models (such as parameters of the covariance function) are fitted by marginal likelihood maximization, and the GP posterior state is computed.

Note

To summarize, if your surrogate model needs to be fit to data, you need to implement a subclass of Estimator, whose fit_from_state method takes in data in form of a TuningJobState and returns a Predictor. You can use transform_state_to_data() in order to convert the TuningJobState object into the usual pair of feature matrix features and target vector targets, along with normalization of targets.