SageMaker Backend

Limitations of the Local Backend

We have been using the local backend LocalBackend in this tutorial so far. Due to its simplicity and very low overheads for starting, stopping, or resuming trials, this is the preferred choice for getting started. But with models and datasets getting larger, some disadvantages become apparent:

All concurrent training jobs (as well as the tuning algorithm itself) are run as subprocesses on the same instance. This limits the number of workers by what is offered by the instance type. You can set n_workers to any value you like, but what you really get depends on available resources. If you want 4 GPU workers, your instance types needs to have at least 4 GPUs, and each training job can use only one of them.
It is hard to encapsulate dependencies of your training code. You need to specify them explicitly, and they need to be compatible with the Syne Tune dependencies. You cannot use Docker images.
You may be used to work with SageMaker frameworks, or even specialized setups such as distributed training. In such cases, it is hard to get tuning to work with the local backend.

Launcher Script for SageMaker Backend

Syne Tune offers the SageMaker backend SageMakerBackend as alternative to the local one. Using it requires some preparation, as is detailed here.

Recall our launcher script. In order to use the SageMaker backend, we need to create trial_backend differently:

trial_backend = SageMakerBackend(
    # we tune a PyTorch Framework from Sagemaker
    sm_estimator=PyTorch(
        entry_point=entry_point.name,
        source_dir=str(entry_point.parent),
        instance_type="ml.c5.4xlarge",
        instance_count=1,
        role=get_execution_role(),
        dependencies=[str(repository_root_path() / "benchmarking")],
        max_run=int(1.05 * args.max_wallclock_time),
        framework_version="1.7.1",
        py_version="py3",
        disable_profiler=True,
        debugger_hook_config=False,
        sagemaker_session=default_sagemaker_session(),
    ),
    metrics_names=[metric],
)

In essence, the SageMakerBackend is parameterized with a SageMaker estimator, which executes the training script. In our example, we use the PyTorch SageMaker framework as a pre-built container for the dependencies our training scripts requires. However, any other type of SageMaker estimator can be used here just as well. Finally, if you include any of the metrics reported by your training script in metrics_names, their values are visualized in the dashboard for the SageMaker training job.

If your training script requires additional dependencies not contained in the chosen SageMaker framework, you can specify those in a requirements.txt file in the same directory as your training script (i.e., in the source_dir of the SageMaker estimator). In our example, this file needs to contain the filelock dependence.

Note

This simple example avoids complications about writing results to S3 in a unified manner, or using special features of SageMaker which can speed up tuning substantially. For more information about the SageMaker backend, please consider this tutorial.