Benchmarking with Simulator Backend

The fastest and cheapest way to compare a number of different HPO methods, or variants thereof, is benchmarking with the simulator backend. In this case, all training evaluations are simulated by querying metric and time values from a tabulated blackbox or a surrogate model. Not only are expensive computations on GPUs avoided, but the experiment also runs faster than real time. In some cases, results for experiments with max_wallclock_time of several hours, can be obtained in a few seconds.


In order to use surrogate benchmarks and the simulator backend, you need to have the blackbox-repository dependencies installed, as detailed here. For the YAHPO blackbox, you also need the yahpo dependencies. Note that the first time you use a surrogate benchmark, its data files are downloaded and stored to your S3 bucket, this can take a considerable amount of time. The next time you use the benchmark, it is loaded from your local disk or your S3 bucket, which is fast.


The experimentation framework in syne_tune.experiments which is used here, is not limited to benchmarking (i.e., comparing the performance between different HPO methods), but is also the default way to run many experiments in parallel, say with different configuration spaces. This is explained more in this tutorial.

Defining the Experiment

As usual in Syne Tune, the experiment is defined by a number of scripts. We will look at an example in benchmarking/examples/benchmark_hypertune/. Common code used in these benchmarks can be found in syne_tune.experiments.

Let us look at the scripts in order, and how you can adapt them to your needs:

  • benchmarking/examples/benchmark_hypertune/ Defines the HPO methods to take part in the experiment, in the form of a dictionary methods which maps method names to factory functions, which in turn map MethodArguments to scheduler objects. The MethodArguments class contains the union of attributes needed to configure schedulers. In particular, scheduler_kwargs contains constructor arguments. For your convenience, the mapping from MethodsArguments to scheduler are defined for most baseline methods in syne_tune.experiments.default_baselines (as noted just below, this mapping involves merging argument dictionaries), but you can override arguments as well (for example, type in the examples here). Note that if you like to compare different variants of a method, you need to create different entries in methods, for example Methods.MOBSTER_JOINT and Methods.MOBSTER_INDEP are different variants of MOBSTER.

  • benchmarking/examples/benchmark_hypertune/ Defines the benchmarks to be considered in this experiment, in the form of a dictionary benchmark_definitions with values of type SurrogateBenchmarkDefinition. In general, you will just pick definitions from syne_tune.experiments.benchmark_definitions, unless you are using your own surrogate benchmark not contained in Syne Tune. But you can also modify parameters, for example surrogate and surrogate_kwargs in order to select a different surrogate model, or you can change the defaults for n_workers or max_wallclock_time.

  • benchmarking/examples/benchmark_hypertune/ Script for launching experiments locally. All you typically need to do here is to import syne_tune.experiments.launchers.hpo_main_simulator and (optionally) to add additional command line arguments you would like to parameterize your experiment with. In our example here, we add two options, num_brackets which configures Hyperband schedulers, and num_samples which configures the Hyper-Tune methods only. Apart from extra_args, you also need to define map_method_args, which modifies method_kwargs (the arguments of MethodArguments) based on the extra arguments. Details for map_method_args are given just below. Finally, main() is called with your methods and benchmark_definitions dictionaries, and (optionally) with extra_args and map_method_args. We will see shortly how the launcher is called, and what happens inside.

  • benchmarking/examples/benchmark_hypertune/ Script for launching experiments remotely, in that each experiment runs as its own SageMaker training job, in parallel with other experiments. You need to import syne_tune.experiments.launchers.launch_remote_simulator and pass the same methods, benchmark_definitions, extra_args as in benchmarking.examples.benchmark_hypertune.hpo_main. Moreover, you need to specify paths for source dependencies. If you installed Syne Tune from sources, it is easiest to specify source_dependencies=benchmarking.__path__, as this allows access to all benchmarks and examples included there. On top of that, you can pass an indicator function is_expensive_method to tag the HPO methods which are themselves expensive to run. As detailed below, our script runs different seeds (repetitions) in parallel for expensive methods, but sequentially for cheap ones. We will see shortly how the launcher is called, and what happens inside.

  • benchmarking/examples/benchmark_hypertune/requirements.txt: Dependencies for to be run remotely as SageMaker training job, in the context of launching experiments remotely. In particular, this needs the dependencies of Syne Tune itself. A safe bet here is syne-tune[extra] and tqdm (which is the default if requirements.txt is missing). However, you can decrease startup time by narrowing down the dependencies you really need (see FAQ). In our example here, we need gpsearchers and kde for methods. For simulated experiments, you always need to have blackbox-repository here. In order to use YAHPO benchmarks, also add yahpo.

Specifying Extra Arguments

In many cases, you will want to run different methods using their default arguments, or only change them as part of the definition in But sometimes, it can be useful to be able to set options via extra command line arguments. This can be done via extra_args and map_method_args, which are typically used in order to be able to configure scheduler arguments for certain methods. But in principle, any argument of MethodArguments can be modified. Here, extra_args is simply extending arguments to the command line parser, where the name field contains the name of the option without any leading “-“.

map_method_args has the signature

method_kwargs = map_method_args(args, method, method_kwargs)

Here, method_kwargs are arguments of MethodArguments, which can be modified by map_method_args (the modified dictionary is returned). args is the result of command line parsing, and method is the name of the method to be constructed based on these arguments. The latter argument allows map_method_args to depend on the method. In our example benchmarking/examples/benchmark_hypertune/, num_brackets applies to all methods, while num_samples only applies to the variants of Hyper-Tune. Both arguments modify the dictionary scheduler_kwargs in MethodArguments, which contains constructor arguments for the scheduler.

Note the use of recursive_merge. This means that the changes done in map_method_args are recursively merged into the prior method_kwargs. In our example, we may already have method_kwargs.scheduler_kwargs or even method_kwargs.scheduler_kwargs.search_options. While the new settings here take precedence, prior content of method_kwargs not affected remains in place. In the same way, extra arguments passed to baseline wrappers in syne_tune.experiments.default_baselines are recursively merged into the arguments determined by the default logic.


map_method_args is applied to rewrite method_kwargs just before the method is created. This means that all entries of MethodArguments can be modified from their default values. You can also use map_method_args independent of extra_args (however, if extra_args is given, then map_method_args must be given as well).

Writing Extra Results

By default, Syne Tune writes result files metadata.json,, and tuner.dill for every experiment, see here. Here, contains all data reported by training jobs, along with time stamps. The contents of this dataframe can be customized, by adding extra columns to it. This is done by passing extra_results_composer of type ExtraResultsComposer when creating the StoreResultsCallback callback, which is passed in callbacks to Tuner. You can use this mechanism by passing a ExtraResultsComposer object as extra_results to main. This object extracts extra information and returns it as dictionary, which is appended to the results dataframe. A complete example is benchmarking/examples/benchmark_dyhpo.

Launching Experiments Locally

Here is an example of how simulated experiments are launched locally (if you installed Syne Tune from source, you need to start the script from the benchmarking/examples directory):

python benchmark_hypertune/ \
  --experiment_tag tutorial-simulated --benchmark nas201-cifar100 \
  --method ASHA --num_seeds 10

This call runs a number of experiments sequentially on the local machine:

  • experiment_tag: Results of experiments are written to ~/syne-tune/{experiment_tag}/*/{experiment_tag}-*/. This name should confirm to S3 conventions (alphanumerical and -; no underscores).

  • benchmark: Selects benchmark from keys of benchmark_definitions. If this is not given, experiments for all keys in benchmark_definitions are run in sequence.

  • method: Selects HPO method to run from keys of methods. If this is not given, experiments for all keys in methods are run in sequence.

  • num_seeds: Each experiment is run num_seeds times with different seeds (0, ..., num_seeds - 1). Due to random factors both in training and tuning, a robust comparison of HPO methods requires such repetitions. Fortunately, these are cheap to obtain in the simulation context. Another parameter is start_seed (default: 0), giving seeds start_seed, ..., num_seeds - 1. For example, --start_seed 5  --num_seeds 6 runs for a single seed equal to 5. The dependence of random choices on the seed is detailed below.

  • max_wallclock_time, n_workers: These arguments overwrite the defaults specified in the benchmark definitions.

  • max_size_data_for_model: Parameter for Bayesian optimization, MOBSTER or Hyper-Tune, see here and here.

  • scale_max_wallclock_time: If 1, and if n_workers is given as argument, but not max_wallclock_time, the benchmark default benchmark.max_wallclock_time is multiplied by :math:B / min(A, B), where A = n_workers, B = benchmark.n_workers. This means we run for longer if n_workers < benchmark.n_workers, but keep benchmark.max_wallclock_time the same otherwise.

  • use_long_tuner_name_prefix: If 1, results for an experiment are written to a directory whose prefix is f"{experiment_tag}-{benchmark_name}-{seed}", followed by a postfix containing date-time and a 3-digit hash. If 0, the prefix is experiment_tag only. The default is 1 (long prefix).

  • restrict_configurations: See below.

  • fcnet_ordinal: Applies to FCNet benchmarks only. The hyperparameter hp_init_lr has domain choice([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]). Since the parameter is really ordinal, this is not a good choice. With this option, the domain can be switched to different variants of ordinal. The default is nn-log, which is the domain logordinal([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]) (this is also the replacement which streamline_config_space() would do). In order to keep the original categorical domain, use --fcnet_ordinal none.

If you defined additional arguments via extra_args, you can use them here as well. For example, --num_brackets 3 would run all multi-fidelity methods with 3 brackets (instead of the default 1).

Launching Experiments Remotely

There are some drawbacks of launching experiments locally. First, they block the machine you launch from. Second, different experiments are run sequentially, not in parallel. Remote launching has exactly the same parameters as launching locally, but experiments are sliced along certain axes and run in parallel, using a number of SageMaker training jobs. Here is an example (if you installed Syne Tune from source, you need to start the script from the benchmarking/examples directory):

python benchmark_hypertune/ \
  --experiment_tag tutorial-simulated --benchmark nas201-cifar100 \
  --num_seeds 10

Since --method is not used, we run experiments for all methods. Also, we run experiments for 10 seeds. There are 7 methods, so the total number of experiments is 70 (note that we select a single benchmark here). Running this command will launch 43 SageMaker training jobs, which do the work in parallel. Namely, for methods ASHA, SYNCHB, BOHB, all 10 seeds are run sequentially in a single SageMaker job, since our is_expensive_method function returns False for them. Simulating experiments is so fast for these methods that it is best to run seeds sequentially. However, for MOBSTER-JOINT, MOBSTER-INDEP, HYPERTUNE-INDEP, HYPERTUNE-JOINT, our is_expensive_method returns True, and we use one SageMaker training jobs for each seeds, giving rise to 4 * 10 = 40 jobs running in parallel. For these methods, the simulation time is quite a bit longer, because decision making takes more time (these methods fit Gaussian process surrogate models to data and optimize acquisition functions). Results are written to ~/syne-tune/{experiment_tag}/ASHA/ for the cheap method ASHA, and to /syne-tune/{experiment_tag}/MOBSTER-INDEP-3/ for the expensive method MOBSTER-INDEP and seed 3.

The command above selected a single benchmark nas201-cifar100. If --benchmark is not given, we iterate over all benchmarks in benchmark_definitions. This is done sequentially, which works fine for a limited number of benchmarks.

However, you may want to run experiments on a large number of benchmarks, and to this end also parallelize along the benchmark axis. To do so, you can pass a nested dictionary as benchmark_definitions. For example, we could use the following:

from syne_tune.experiments.benchmark_definitions import (

benchmark_definitions = {
    "nas201": nas201_benchmark_definitions,
    "fcnet": fcnet_benchmark_definitions,
    "lcbench": lcbench_selected_benchmark_definitions,

In this case, experiments are sliced along the axis ("nas201", "fcnet", "lcbench") to be run in parallel in different SageMaker training jobs.

Dealing with ResourceLimitExceeded Errors

When launching many experiments in parallel, you may run into your AWS resource limits, so that no more SageMaker training jobs can be run. The default behaviour in this case is to wait for 10 minutes and try again. You can influence this by --estimator_fit_backoff_wait_time <wait_time>, where <wait_time> is the waiting time between attempts in seconds. If this is 0 or negative, the script terminates with an error once your resource limits are reached.

Pitfalls of Experiments from Tabulated Blackboxes

Comparing HPO methods on tabulated benchmarks, using simulation, has obvious benefits. Costs are very low. Moreover, results are often obtain many times faster than real time. However, we recommend you do not rely on such kind of benchmarking only. Here are some pitfalls:

  • Tabulated benchmarks are often of limited complexity, because more complex benchmarks cannot be sampled exhaustively

  • Tabulated benchmarks do not reflect the stochasticity of real benchmarks (e.g., random weight initialization, random ordering of mini-batches)

  • While tabulated benchmarks like nas201 or fcnet are evaluated exhaustively or on a fine grid, other benchmarks (like lcbench) contain observations only at a set of randomly chosen configurations, while their configuration space is much larger or even infinite. For such benchmarks, you can either restrict the scheduler to suggest configurations only from the set supported by the benchmark (see subsection just below), or you can use a surrogate model which interpolates observations from those contained in the benchmark to all others in the configuration space. Unfortunately, the choice of surrogate model can strongly affect the benchmark, for the same underlying data. As a general recommendation, you should be careful with surrogate benchmarks which offer a large configuration space, but are based on only medium amounts of real data.

Restricting Scheduler to Configurations of Tabulated Blackbox

For a tabulated benchmark like lcbench, most entries of the configuration space are not covered by data. For such, you can either use a surrogate, which can be configured by attributes surrogate, surrogate_kwargs, and add_surrogate_kwargs of SurrogateBenchmarkDefinition. Or you can restrict the scheduler to only suggest configurations covered by data. The latter is done by the option --restrict_configurations 1. The advantage of doing so is that your comparison does not depend on the choice of surrogate, but only on the benchmark data itself. However, there are also some drawbacks:

  • This option is currently not supported for the following schedulers:

    • Grid Search

    • SyncBOHB

    • BOHB

    • DEHB

    • REA

    • KDE

    • PopulationBasedTraining

    • ZeroShotTransfer


    • MOASHA

  • Schedulers like Gaussian process based Bayesian optimization typically use local gradient-based optimization of the acquisition function. This is not possible with --restrict_configurations 1. Instead, they evaluate the acquisition function at a finite number num_init_candidates of points and pick the best one

  • In general, you should avoid to use surrogate benchmarks which offer a large configuration space, but are based on only medium amounts of real data. When using --restrict_configurations 1 with such a benchmark, your methods may perform better than they should, just because they nearly sample the space exhaustively

In general, --restrict_configurations 1 is supported for schedulers which select the next configuration from a finite set. In contrast, methods like DEHB or BOHB (or Bayesian optimization with local acquisition function optimization) optimize over encoded vectors, then round the solution back to a configuration. In order to use a tabulated benchmark like lcbench with these methods, you need to specify a surrogate. Maybe the least intrusive surrogate is nearest neighbor. Here is the benchmark definition for lcbench:

def lcbench_benchmark(dataset_name: str, datasets=None) -> SurrogateBenchmarkDefinition:
    The default is to use nearest neighbour regression with ``K=1``. If
    you use a more sophisticated surrogate, it is recommended to also
    define ``add_surrogate_kwargs``, for example:

    .. code-block:: python

           "predict_curves": True,
           "fit_differences": ["time"],

    :param dataset_name: Value for ``dataset_name``
    :param datasets: Used for transfer learning
    :return: Definition of benchmark
    return SurrogateBenchmarkDefinition(
        surrogate="KNeighborsRegressor",  # 1-nn surrogate
        surrogate_kwargs={"n_neighbors": 1},

The 1-NN surrogate is selected by surrogate="KNeighborsRegressor" and setting the number of nearest neighbors to 1. For each configuration, the surrogate finds the nearest neighbor in the table (w.r.t. Euclidean distance between encoded vectors) and returns its metric values.

Selecting Benchmarks from benchmark_definitions

Each family of tabulated (or surrogate) blackboxes accessible to the benchmarking tooling discussed here, are represented by a Python file in syne_tune.experiments.benchmark_definitions (the same directly also contains definitions for real benchmarks). For example:

Typically, a blackbox concerns a certain machine learning algorithm with a fixed configuration space. Many of them have been evaluated over a number of different datasets. Note that in YAHPO, a blackbox is called scenario, and a dataset is called instance, so that a scenario can have a certain number of instances. In our terminology, a tabulated benchmark is obtained by selecting a blackbox together with a dataset.

The files in syne_tune.experiments.benchmark_definitions typically contain:

  • Functions named *_benchmark, which map arguments (such as dataset_name) to the benchmark definition SurrogateBenchmarkDefinition and * being the name of the blackbox (or scenario).

  • Dictionaries named *_benchmark_definitions with SurrogateBenchmarkDefinition values. If a blackbox has a lot of datasets, we also define a dictionary *_selected_benchmark_definitions, which selects benchmarks which are interesting (e.g., not all baselines achieving the same performance rapidly). In general, we recommend starting with these selected benchmarks.

The YAHPO Family

A rich source of blackbox surrogates in Syne Tune comes from YAHPO, which is also detailed in this paper. YAHPO contains a number of blackboxes (called scenarios), some of which over a lot of datasets (called instances). All our definitions are in syne_tune.experiments.benchmark_definitions.yahpo. Further details can also be found in the import code syne_tune.blackbox_repository.conversion_scripts.scripts.yahpo_import. Here is an overview:

  • yahpo_nb301: NASBench301. Single scenario and instance.

  • yahpo_lcbench: LCBench. Same underlying data than our own LCBench, but different surrogate model.

  • yahpo_iaml: Family of blackboxes, parameterized by ML method (yahpo_iaml_methods) and target metric (yahpo_iaml_metrics). Each of th`ese have 4 datasets (OpenML datasets).

  • yahpo_rbv2: Family of blackboxes, parameterized by ML method (yahpo_rbv2_methods) and target metric (yahpo_rbv2_metrics). Each of these come with a large number of datasets (OpenML datasets). Note that compared to YAHPO Gym, we filtered out scenarios which are invalid (e.g., F1 score 0, AUC/F1 equal to 1). We also determined useful max_wallclock_time values (yahpo_rbv2_max_wallclock_time), and selected benchmarks which show interesting behaviour (yahpo_rbv2_selected_instances).


At present (YAHPO Gym v1.0), the yahpo_lcbench surrogate has been trained on invalid LCBench original data (namely, values for first and last fidelity value have to be removed). As long as this is not fixed, we recommend using our built-in lcbench blackbox instead.


In YAHPO Gym, yahpo_iaml and yahpo_rbv2 have a fidelity attribute trainsize with values between 1/20 and 1, which is the fraction of full dataset the method has been trained. Our import script multiplies trainsize values with 20 and designates type randint(1, 20), since common Syne Tune multi-fidelity schedulers require resource_attr values to be positive integers. yahpo_rbv2 has a second fidelity attribute repl, whose value is constant 10, this is removed by our import script.