Getting Started with Hyperparameter Tuning

In this section, you will learn what is needed to get hyperparameter tuning up and running. We will look at an example where a deep learning language model is trained on natural language text.

What is Hyperparameter Tuning?

When solving a business problem with machine learning, there are parts which can be automated by spending compute resources, and other parts require human expert attention and choices to be made. By automating some of the more tedious parts of the latter, hyperparameter tuning shifts the needle between these cost factors. Like any other smart tool, it saves you time to concentrate on where your strengths really lie, and where you can create the most value.

At a high level, hyperparameter tuning finds configurations of a system which optimize a target metric (or several ones, as we will see later). We can try any configuration from a configuration space, but each evaluation of the system has a cost and takes time. The main challenge of hyperparameter tuning is to run as few trials as possible, so that total costs are minimal. Also, if possible, trials should be run in parallel, so that the total experiment time is minimal.

In this tutorial, we will mostly be focussed on making decisions and tuning free parameters in the context of training machine learning models on data, so their predictions can be used as part of a solution to a business problem. There are many other steps between the initial need and a deployed solution, such as understanding business requirements, collecting, cleaning and labeling data, monitoring and maintenance. Some of these can be addressed with automated tuning as well, others need different tools.

A common paradigm for decision-making and parameter tuning is to try a number of different configurations and select the best in the end.

A trial consists of training a model on a part of the data (the training data). Here, training is an automated process (for example, stochastic gradient descent on weight and biases of a neural network model), given a configuration (e.g., what learning rate is used, what batch size, etc.). Then, the trained model is evaluated on another part of the data (validation data, disjoint from training data), giving rise to a quality metric (e.g., validation error, AUC, F1), or even several ones. For small datasets, we can also use cross-validation, by repeating training and evaluation on a number of different splits, reporting the average of validation metrics.
This metric value (or values) is the response of the system to a configuration. Note that the response is stochastic: if we run again with the same configuration, we may get a different value. This is because training has random elements (e.g., initial weights are sampled, ordering of training data).

Enough high level and definitions, let us dive into an example.

Annotating a Training Script

First, we need a script to execute a trial, by training a model and evaluating it. Since training models is bread and butter to machine learners, you will have no problem to come up with one. We start with an example: training_script_report_end.py. Ignoring the boilerplate, here are the important parts. First, we define the hyperparameters which should be optimized over:

transformer_wikitext2/code/training_script_report_end.py – hyperparameters

from syne_tune import Reporter
from syne_tune.config_space import randint, uniform, loguniform, add_to_argparse


METRIC_NAME = "val_loss"

MAX_RESOURCE_ATTR = "epochs"


_config_space = {
    "lr": loguniform(1e-6, 1e-3),
    "dropout": uniform(0, 0.99),
    "batch_size": randint(16, 48),
    "momentum": uniform(0, 0.99),
    "clip": uniform(0, 1),
}

The keys of _config_space are the hyperparameters we would like to tune (lr, dropout, batch_size, momentum, clip). It also defines their ranges and datatypes, we come back to this below.
METRIC_NAME is the name of the target metric returned, MAX_RESOURCE_ATTR the key name for how many epochs to train.

Next, here is the function which executes a trial:

transformer_wikitext2/code/training_script_report_end.py – objective

def objective(config):
    torch.manual_seed(config["seed"])
    use_cuda = config["use_cuda"]
    if torch.cuda.is_available() and not use_cuda:
        print("WARNING: You have a CUDA device, so you should run with --use-cuda 1")
    device = torch.device("cuda" if use_cuda else "cpu")
    # [1]
    # Download data, setup data loaders
    corpus = download_dataset(config)
    ntokens = len(corpus.dictionary)
    train_data = batchify(corpus.train, bsz=config["batch_size"], device=device)
    valid_data = batchify(corpus.valid, bsz=10, device=device)
    # Used for reporting metrics to Syne Tune
    report = Reporter()
    # [2]
    # Create model and optimizer
    model, optimizer, criterion = create_training_objects(config, ntokens, device)
    # [3]
    for epoch in range(1, config[MAX_RESOURCE_ATTR] + 1):
        train(model, train_data, optimizer, criterion, config, ntokens, epoch)
    # [4]
    # Report validation loss back to Syne Tune
    val_loss = evaluate(model, valid_data, criterion, config, ntokens)
    report(**{METRIC_NAME: val_loss})

The input config to objective is a configuration dictionary, containing values for the hyperparameters and other fixed parameters (such as the number of epochs to train).
[1] We start with downloading training and validation data. The training data loader train_data depends on hyperparameter config["batch_size"].
[2] Next, we create model and optimizer. This depends on the remaining hyperparameters in config.
[3] We then run config[MAX_RESOURCE_ATTR] epochs of training.
[4] Finally, we compute the error on the validation data and report it back to Syne Tune. The latter is done by creating report of type Reporter and calling it with a dictionary, using METRIC_NAME as key.

Finally, the script needs some command line arguments:

transformer_wikitext2/code/training_script_report_end.py – command line arguments

    parser = argparse.ArgumentParser(
        description="PyTorch Wikitext-2 Transformer Language Model",
        formatter_class=argparse.RawTextHelpFormatter,
    )
    parser.add_argument(
        "--" + MAX_RESOURCE_ATTR, type=int, default=40, help="upper epoch limit"
    )
    parser.add_argument("--use_cuda", type=int, default=1)
    parser.add_argument(
        "--input_data_dir",
        type=str,
        default="./",
        help="location of the data corpus",
    )
    parser.add_argument(
        "--optimizer_name", type=str, default="sgd", choices=["sgd", "adam"]
    )
    parser.add_argument("--bptt", type=int, default=35, help="sequence length")
    parser.add_argument("--seed", type=int, default=1111, help="random seed")
    parser.add_argument(
        "--precision", type=str, default="float", help="float | double | half"
    )
    parser.add_argument(
        "--log_interval",
        type=int,
        default=200,
        help="report interval",
    )
    parser.add_argument("--d_model", type=int, default=256, help="width of the model")
    parser.add_argument(
        "--ffn_ratio", type=int, default=1, help="the ratio of d_ffn to d_model"
    )
    parser.add_argument("--nlayers", type=int, default=2, help="number of layers")
    parser.add_argument(
        "--nhead",
        type=int,
        default=2,
        help="the number of heads in the encoder/decoder of the transformer model",
    )
    add_to_argparse(parser, _config_space)

    args, _ = parser.parse_known_args()
    args.use_cuda = bool(args.use_cuda)

    objective(config=vars(args))

We use an argument parser parser. Hyperparameters can be added by add_to_argparse(parser, _config_space), given the configuration space is defined in this script, or otherwise you can do this manually. We also need some more inputs, which are not hyperparameters, for example MAX_RESOURCE_ATTR.

You can also provide the input to a training script as JSON file.

Compared to a vanilla training script, we only added two lines, creating report and calling it for reporting the validation error at the end.

Choosing a Configuration Space

Apart from annotating a training script, making hyperparameters explicit as inputs, you also need to define a configuration space. In our example, we add this definition to the script, but you can also keep it separate and use the same training script with different configuration spaces:

transformer_wikitext2/code/training_script_report_end.py – configuration space

_config_space = {
    "lr": loguniform(1e-6, 1e-3),
    "dropout": uniform(0, 0.99),
    "batch_size": randint(16, 48),
    "momentum": uniform(0, 0.99),
    "clip": uniform(0, 1),
}

Each hyperparameters gets assigned a data type and a range. In this example, batch_size is an integer, while lr, dropout, momentum, clip are floats. lr is encoded in log scale.

Syne Tune provides a range of data types. Choosing them well requires a bit of attention, guidelines are given here.

Specifying Default Values

Once you have annotated your training script and chosen a configuration space, you have specified all the input Syne Tune needs. You can now specify the details about your tuning experiment in code, as discussed here. However, Syne Tune provides some tooling in syne_tune.experiments which makes the life of most users easier, and we will use this tooling in the rest of the tutorial. To this end, we need to define some defaults about how experiments are to be run (most of these can be overwritten by command line arguments):

transformer_wikitext2/code/transformer_wikitext2_definition.py

from pathlib import Path

from transformer_wikitext2.code.training_script import (
    _config_space,
    METRIC_NAME,
    RESOURCE_ATTR,
    MAX_RESOURCE_ATTR,
)
from syne_tune.experiments.benchmark_definitions.common import RealBenchmarkDefinition
from syne_tune.remote.constants import (
    DEFAULT_GPU_INSTANCE_1GPU,
    DEFAULT_GPU_INSTANCE_4GPU,
)


def transformer_wikitext2_benchmark(sagemaker_backend: bool = False, **kwargs):
    if sagemaker_backend:
        instance_type = DEFAULT_GPU_INSTANCE_1GPU
    else:
        # For local backend, GPU cores serve different workers
        instance_type = DEFAULT_GPU_INSTANCE_4GPU
    fixed_parameters = dict(
        **{MAX_RESOURCE_ATTR: 40},
        d_model=256,
        ffn_ratio=1,
        nlayers=2,
        nhead=2,
        bptt=35,
        optimizer_name="sgd",
        input_data_dir="./",
        use_cuda=1,
        seed=1111,
        precision="float",
        log_interval=200,
    )
    config_space = {**_config_space, **fixed_parameters}
    _kwargs = dict(
        script=Path(__file__).parent / "training_script.py",
        config_space=config_space,
        metric=METRIC_NAME,
        mode="min",
        max_resource_attr=MAX_RESOURCE_ATTR,
        resource_attr=RESOURCE_ATTR,
        max_wallclock_time=5 * 3600,
        n_workers=4,
        instance_type=instance_type,
        framework="PyTorch",
    )
    _kwargs.update(kwargs)
    return RealBenchmarkDefinition(**_kwargs)

All you need to do is to provide a function (transformer_wikitext2_benchmark here) which returns an instance of RealBenchmarkDefinition. The most important fields are:

script: Filename of training script.
config_space: The configuration space to be used by default. This consists of two parts. First, the hyperparameters from _config, already discussed above. Second, fixed_parameters are passed to each trial as they are. In particular, we would like to train for 40 epochs, so pass {MAX_RESOURCE_ATTR: 40}.
metric, max_resource_attr, resource_attr: Names of inputs to and metrics reported from the training script. If mode == "max", the target metric metric is maximized, if mode == "min", it is minimized.
max_wallclock_time: Wallclock time the experiment is going to run (5 hours in our example).
n_workers: Maximum number of trials which run in parallel (4 in our example). The achievable degree of parallelism may be lower, depending on which execution backend is used and which hardware instance we run on.

Also, note the role of **kwargs in the function signature, which allows to overwrite any of the default values (e.g., for max_wallclock_time, n_workers, or instance_type) with command line arguments.

Note

In the Syne Tune experimentation framework, a tuning problem (i.e., training and evaluation script together with defaults) is called a benchmark. This terminology is used even if the goal of experimentation is not benchmarking (i.e., comparing different HPO methods), as is the case in this tutorial here.