Comparison of Methods
=====================

In this section, we present an empirical comparison of all methods discussed in
this tutorial. The methodology of our study is as follows:

* We use the NASBench-201 benchmark (CIFAR100 dataset)
* All methods are run with a ``max_wallclock_time`` limit of 6 hours (or 21600
  seconds). We plot minimum validation error attained as function of wallclock
  time (which, in our case, is simulated time)
* Results are aggregated over a number of repetitions. The number of
  repetitions is 50 for SYNCSH, SYNCHB, BOHB, DEHB, ASHA-STOP, ASHA-PROM,
  ASHA6-STOP and SYNCMOBSTER, while MOBSTER-JOINT, MOBSTER-INDEP,
  HYPERTUNE1-INDEP, HYPERTUNE4-INDEP and HYPERTUNE-JOINT are repeated 30 times.
  Figures plot the interquartile mean in bold and a bootstrap 95% confidence
  interval for this estimator in dashed lines (the IQM is a robust estimator of
  the mean, but depends on more data than the median)
* SYNCSH, ASHA-STOP, ASHA-PROM, MOBSTER-JOINT, MOBSTER-INDEP, HYPERTUNE1-INDEP
  use 1 bracket, HYPERTUNE4-INDEP, HYPERTUNE-JOINT use 4 brackets, and SYNCHB,
  BOHB, DEHB, SYNCMOBSTER use the maximum of 6 brackets
* In SYNCSH, SYNCHB, ASHA-STOP, ASHA-PROM, ASHA6-STOP, new configurations are
  drawn at random, while BOHB, SYNCMOBSTER, MOBSTER-JOINT, MOBSTER-INDEP,
  HYPERTUNE1-INDEP, HYPERTUNE4-INDEP, HYPERTUNE-JOINT are variants of Bayesian
  optimization. In DEHB, configurations in the first bracket are drawn at
  random, but in later brackets, they are evolved from earlier ones
* ASHA-STOP, ASHA6-STOP use early stopping, while SYNCSH, SYNCHB, BOHB,
  SYNCMOBSTER, ASHA-PROM, MOBSTER-JOINT, MOBSTER-INDEP, HYPERTUNE1-INDEP,
  HYPERTUNE4-INDEP, HYPERTUNE-JOINT use pause-and-resume. DEHB is a synchronous
  method, but does not resume trials from checkpoints (except in the very first
  bracket)

Here are results, grouped by synchronous decision-making, asynchronous
decision-making (promotion type), and asynchronous decision-making (stopping
type). ASHA-PROM results are repeated in all plots for reference.

.. |Synchronous HPO| image:: img/mf_tutorial_comparison_1.png

+--------------------------------+
| |Synchronous HPO|              |
+================================+
| Synchronous Multi-fidelity HPO |
+--------------------------------+

.. |Asynchronous HPO| image:: img/mf_tutorial_comparison_2.png

+---------------------------------------------+
| |Asynchronous HPO|                          |
+=============================================+
| Asynchronous Multi-fidelity HPO (promotion) |
+---------------------------------------------+

.. |Asynchronous Stopping| image:: img/mf_tutorial_comparison_3.png

+--------------------------------------------+
| |Asynchronous Stopping|                    |
+============================================+
| Asynchronous Multi-fidelity HPO (stopping) |
+--------------------------------------------+

These results are obtained on a single benchmark with a rather small
configuration space. Nevertheless, they are roughly in line with results we
obtained on a larger range of benchmarks. A few conclusions can be drawn, which
may help readers choosing the best HPO method and its configuration for their
own problem.

* Asynchronous methods outperform synchronous ones in general, in particular
  when it comes to any-time performance. A notable exception (on this
  benchmark) is SYNCMOBSTER, which performs en par with the best asynchronous
  methods.
* Among the synchronous methods, SYNCMOBSTER performs best, followed by
  BOHB. SYNCHB and SYNCSH perform very similar. The performance of DEHB is
  somewhat disappointing on this benchmark.
* The best-performing methods on this benchmark are MOBSTER-JOINT and
  HYPERTUNE1-INDEP, with HYPERTUNE4-INDEP a close runner-up. For MOBSTER, the
  joint multi-task surrogate model should be preferred, while for HYPERTUNE,
  the independent GPs model works better.
* On this benchmark, moving to multiple brackets does not pay off for the
  asynchronous methods. However, on benchmarks where the choice of
  :math:`r_{min}` is more critical, moving beyond successive halving can be
  beneficial. In such cases, we currently recommend to use HYPERTUNE-INDEP,
  whose adaptive weighting and bracket sampling is clearly more effective than
  simpler heuristics used in Hyperband or BOHB.