Iterative Problems#

Iterative problems refer to the objectives that can run in multiple iterations and in each of the iteration, it can report the current metrics and may also checkpoint and resume. To optimize iterative problems, the main approach to early stop the bad trials, and move compute resource to the promising ones.

Deep learning models are the most typical iterative problems in parameter optimization.

from fugue_notebook import setup
setup()
from tune import TUNE_OBJECT_FACTORY
TUNE_OBJECT_FACTORY.set_temp_path("/tmp")

Reorganize Your Keras Model Code#

In this demo we are going to use Keras Model as our iterative problem example. To develop a keras model, we need to do several things such as construct the model, compile and fit. In order to use tune package for a keras problem, you need to fill out our template class KerasTrainingSpec, it doesn’t really add anything to your logic, but it keeps your code organized. Plus with the template, you can still use it in common way - to train it on local machine given epochs. You will not lose anything or add any unnecessary thing by implementing this class.

from tensorflow.keras import layers, models
from tensorflow.keras.datasets import boston_housing

from tune_tensorflow import KerasTrainingSpec

class HousingSpec(KerasTrainingSpec):
    def __init__(self, params, dfs):
        assert 0 == len(dfs)
        super().__init__(params, dfs)
        (self.train_data, self.train_targets), (
            self.test_data,
            self.test_targets,
        ) = boston_housing.load_data()

    def get_model(self):
        l1, l2 = self.params.simple_value["l1"], self.params.simple_value["l2"]
        model = models.Sequential()
        model.add(
            layers.Dense(l1, activation="relu", input_shape=(self.train_data.shape[1],))
        )
        model.add(layers.Dense(l2, activation="relu"))
        model.add(layers.Dense(1))
        return model

    def get_compile_params(self):
        return dict(optimizer="rmsprop", loss="mse", metrics=["mae"])

    def get_fit_params(self):
        return [self.train_data, self.train_targets], dict(
            validation_data=(self.test_data, self.test_targets), shuffle=True
        )

    def get_fit_metric(self, history):
        return float(history.history["val_mae"][-1])

    def generate_sort_metric(self, metric: float) -> float:
        return metric

Notice that you can also define how to save and load the current model in the template. The default way to save is to only save the weights. The default way to load is to call get_model and then load weights.

For iterative problems, we will separate the execution to several rungs, each rung for a deep learning model will contain several epochs. Save and load is on rung level, not on epoch level. So if you want to checkpoint on each epoch, you can add your own callbacks.

For deep learning models, it normally uses GPU to train, and it can be expensive, so before tuning, it’s important to test with your local CPU or GPU to see if the training spec can run one rung successfully. To do that, you only need to call compute_sort_metric. It will produce the final output metric used for tuning. The metric must be smaller better. You can implement generate_sort_metric for different type of metrics. For errors they are already smaller better, but for things like precision, you should return the negative value to be smaller better.m

spec = HousingSpec(dict(l1=8, l2=16),dfs={})
spec.compute_sort_metric(epochs=3)
2022-01-17 20:14:13.626973: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:14:13.628269: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Epoch 1/3
404/404 [==============================] - 1s 2ms/sample - loss: 43595.9143 - mae: 204.9338 - val_loss: 31703.6621 - val_mae: 175.8306
Epoch 2/3
404/404 [==============================] - 0s 106us/sample - loss: 23562.3374 - mae: 150.2194 - val_loss: 17490.9506 - val_mae: 130.1817
Epoch 3/3
404/404 [==============================] - 0s 114us/sample - loss: 12554.4363 - mae: 108.1499 - val_loss: 8439.3577 - val_mae: 89.3966
89.39659118652344

Define the simplest case and run#

After calling compute_sort_metric, we know it works fine for each run. So now let’s define the search space and start tuning.

import pandas as pd

from tune import Space, Grid, Rand, RandInt, Choice

from tune_tensorflow import (
    keras_space,
    suggest_keras_models_by_continuous_asha,
    suggest_keras_models_by_hyperband,
    suggest_keras_models_by_sha,
)

If you have read the tutorial for non-iterative problems, you will find the utility functions are highly consistent. For keras_space, just like sk_space, you specify the spec class type, followed by hyperparameters. The following space will contain only two configurations.

space = keras_space(HousingSpec, l1=Grid(8, 16), l2=16)

# using successive halving to search in the space
result = suggest_keras_models_by_sha(
    space,
    plan=[(2.0,1)]  # for each of the two configurations, we run 2 epochs, and select 1 as the final result based on sorted_metric.
)

result
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
Train on 404 samples, validate on 102 samples
Epoch 1/2
404/404 [==============================] - 1s 2ms/sample - loss: 738.2188 - mae: 23.1714 - val_loss: 242.4906 - val_mae: 12.4173
Epoch 2/2
404/404 [==============================] - 0s 104us/sample - loss: 153.6241 - mae: 9.1089 - val_loss: 94.3623 - val_mae: 7.3219
Train on 404 samples, validate on 102 samples
Epoch 1/2
404/404 [==============================] - 1s 2ms/sample - loss: 270.2734 - mae: 13.0701 - val_loss: 144.0259 - val_mae: 9.4996
Epoch 2/2
404/404 [==============================] - 0s 101us/sample - loss: 142.3855 - mae: 9.0876 - val_loss: 99.8096 - val_mae: 7.7550
[{'trial': {'trial_id': 'ac7e4624-96f7-5c75-b541-96931ce24f93', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 8, 'l2': 16}, 'metadata': {}, 'keys': []}, 'metric': 7.321929931640625, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 8, 'l2': 16}, 'metadata': {}, 'cost': 2.0, 'rung': 0, 'sort_metric': 7.321929931640625, 'log_time': datetime.datetime(2022, 1, 17, 20, 14, 15, 387890)}]

Successive Halving#

If you don’t have a lot of computing resource, for example you are tuning using your own machine with just 1 GPU, then, successive halving is a great choice. In this implementation, the models will be saved to disk after each rung, and will be retrieved at the beginning of next rung if promoted.

%%time
space = keras_space(HousingSpec, l1=RandInt(8,32), l2=RandInt(8,32))

result = suggest_keras_models_by_sha(
    space.sample(16,0),
    plan=[(1.0,8),(2.0,4),(4.0,2),(8.0,1)],  # a traditional successive halving plan, you can invent your own plan
    top_n=2, # we only keep up to 2 best results
)

result
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 222.8466 - mae: 11.1890 - val_loss: 112.0152 - val_mae: 8.8059
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 9989.0720 - mae: 98.1604 - val_loss: 6710.0068 - val_mae: 80.8969
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 1013.7837 - mae: 26.4647 - val_loss: 169.5791 - val_mae: 11.1495
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 1ms/sample - loss: 401.6506 - mae: 13.0272 - val_loss: 121.1484 - val_mae: 9.2976
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 1530.6857 - mae: 31.0897 - val_loss: 236.4487 - val_mae: 12.2354
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 1184.4694 - mae: 27.8627 - val_loss: 262.2075 - val_mae: 13.6149
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 2307.2402 - mae: 34.4787 - val_loss: 502.5728 - val_mae: 17.8211
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 396.2434 - mae: 14.3466 - val_loss: 94.6467 - val_mae: 7.4835
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 10797.1170 - mae: 84.4990 - val_loss: 3823.7644 - val_mae: 46.8471
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 174.3172 - mae: 8.9402 - val_loss: 67.0529 - val_mae: 6.1133
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 4577.9229 - mae: 62.8989 - val_loss: 1344.7374 - val_mae: 34.7966
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 6560.3082 - mae: 76.4061 - val_loss: 2116.2962 - val_mae: 44.6558
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 534.7672 - mae: 16.2389 - val_loss: 148.5417 - val_mae: 9.2515
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 1ms/sample - loss: 8301.4639 - mae: 70.3609 - val_loss: 3741.6719 - val_mae: 42.5120
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 2ms/sample - loss: 6877.0804 - mae: 75.8768 - val_loss: 2219.4958 - val_mae: 42.5003
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 1s 1ms/sample - loss: 1229.9572 - mae: 25.9055 - val_loss: 202.9586 - val_mae: 11.1453
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 109.0708 - mae: 7.3074 - val_loss: 79.0839 - val_mae: 6.2364
Epoch 3/3
404/404 [==============================] - 0s 108us/sample - loss: 72.0105 - mae: 6.0785 - val_loss: 61.2814 - val_mae: 5.7532
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 102.5736 - mae: 7.2892 - val_loss: 79.0816 - val_mae: 6.9023
Epoch 3/3
404/404 [==============================] - 0s 106us/sample - loss: 79.1249 - mae: 6.3079 - val_loss: 77.1768 - val_mae: 6.7798
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 149.6943 - mae: 9.3627 - val_loss: 92.7162 - val_mae: 7.5977
Epoch 3/3
404/404 [==============================] - 0s 111us/sample - loss: 106.2104 - mae: 7.7644 - val_loss: 94.0889 - val_mae: 7.2299
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 131.2470 - mae: 8.9093 - val_loss: 175.1610 - val_mae: 10.1505
Epoch 3/3
404/404 [==============================] - 0s 104us/sample - loss: 105.4369 - mae: 7.9283 - val_loss: 104.0349 - val_mae: 7.8951
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 99.8089 - mae: 7.4534 - val_loss: 150.5006 - val_mae: 10.9159
Epoch 3/3
404/404 [==============================] - 0s 102us/sample - loss: 67.4589 - mae: 5.9227 - val_loss: 71.2206 - val_mae: 6.6132
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 1ms/sample - loss: 166.9106 - mae: 9.2106 - val_loss: 87.9771 - val_mae: 7.0022
Epoch 3/3
404/404 [==============================] - 0s 100us/sample - loss: 107.1787 - mae: 7.7246 - val_loss: 81.7703 - val_mae: 6.9779
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 134.8245 - mae: 9.0797 - val_loss: 87.4413 - val_mae: 6.8109
Epoch 3/3
404/404 [==============================] - 0s 105us/sample - loss: 84.0637 - mae: 6.5991 - val_loss: 78.7119 - val_mae: 6.9185
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 2ms/sample - loss: 101.6866 - mae: 7.4164 - val_loss: 65.9363 - val_mae: 6.0129
Epoch 3/3
404/404 [==============================] - 0s 108us/sample - loss: 67.1361 - mae: 5.8809 - val_loss: 60.6061 - val_mae: 5.9089
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
Train on 404 samples, validate on 102 samples
Epoch 4/7
404/404 [==============================] - 1s 2ms/sample - loss: 94.2489 - mae: 7.0036 - val_loss: 59.6185 - val_mae: 5.4919
Epoch 5/7
404/404 [==============================] - 0s 98us/sample - loss: 73.8441 - mae: 6.2665 - val_loss: 62.5825 - val_mae: 5.4681
Epoch 6/7
404/404 [==============================] - 0s 105us/sample - loss: 74.8096 - mae: 6.1939 - val_loss: 91.8281 - val_mae: 6.8802
Epoch 7/7
404/404 [==============================] - 0s 103us/sample - loss: 65.8624 - mae: 5.7054 - val_loss: 86.8929 - val_mae: 6.6587
Train on 404 samples, validate on 102 samples
Epoch 4/7
404/404 [==============================] - 1s 2ms/sample - loss: 74.8165 - mae: 6.1639 - val_loss: 58.7839 - val_mae: 5.7485
Epoch 5/7
404/404 [==============================] - 0s 98us/sample - loss: 60.7689 - mae: 5.4857 - val_loss: 57.5218 - val_mae: 5.6207
Epoch 6/7
404/404 [==============================] - 0s 100us/sample - loss: 58.8136 - mae: 5.3922 - val_loss: 79.4196 - val_mae: 7.2567
Epoch 7/7
404/404 [==============================] - 0s 102us/sample - loss: 59.7249 - mae: 5.5724 - val_loss: 68.3266 - val_mae: 6.0944
Train on 404 samples, validate on 102 samples
Epoch 4/7
404/404 [==============================] - 1s 2ms/sample - loss: 95.5928 - mae: 7.1267 - val_loss: 63.9240 - val_mae: 5.9148
Epoch 5/7
404/404 [==============================] - 0s 97us/sample - loss: 70.6773 - mae: 6.0004 - val_loss: 63.6641 - val_mae: 6.1819
Epoch 6/7
404/404 [==============================] - 0s 103us/sample - loss: 64.8519 - mae: 5.9047 - val_loss: 78.0059 - val_mae: 6.3313
Epoch 7/7
404/404 [==============================] - 0s 102us/sample - loss: 81.2107 - mae: 6.4607 - val_loss: 60.6446 - val_mae: 5.5617
Train on 404 samples, validate on 102 samples
Epoch 4/7
404/404 [==============================] - 1s 2ms/sample - loss: 85.5458 - mae: 6.7063 - val_loss: 82.2353 - val_mae: 6.9334
Epoch 5/7
404/404 [==============================] - 0s 100us/sample - loss: 75.9433 - mae: 6.0527 - val_loss: 76.2981 - val_mae: 6.4629
Epoch 6/7
404/404 [==============================] - 0s 102us/sample - loss: 75.8841 - mae: 6.1496 - val_loss: 77.4582 - val_mae: 6.6357
Epoch 7/7
404/404 [==============================] - 0s 103us/sample - loss: 76.3783 - mae: 6.1970 - val_loss: 72.5362 - val_mae: 6.4215
NativeExecutionEngine doesn't respect num_partitions ROWCOUNT
Train on 404 samples, validate on 102 samples
Epoch 8/15
404/404 [==============================] - 1s 1ms/sample - loss: 95.0065 - mae: 6.5318 - val_loss: 70.3543 - val_mae: 6.8896
Epoch 9/15
404/404 [==============================] - 0s 99us/sample - loss: 62.5861 - mae: 5.7215 - val_loss: 59.2268 - val_mae: 5.4957
Epoch 10/15
404/404 [==============================] - 0s 101us/sample - loss: 72.4533 - mae: 6.0955 - val_loss: 57.9465 - val_mae: 5.7885
Epoch 11/15
404/404 [==============================] - 0s 102us/sample - loss: 56.1888 - mae: 5.3666 - val_loss: 82.5438 - val_mae: 6.6327
Epoch 12/15
404/404 [==============================] - 0s 99us/sample - loss: 65.8400 - mae: 5.7120 - val_loss: 53.4288 - val_mae: 5.2826
Epoch 13/15
404/404 [==============================] - 0s 103us/sample - loss: 64.2736 - mae: 5.7927 - val_loss: 53.7256 - val_mae: 5.4018
Epoch 14/15
404/404 [==============================] - 0s 98us/sample - loss: 64.7464 - mae: 5.8500 - val_loss: 69.4926 - val_mae: 6.8882
Epoch 15/15
404/404 [==============================] - 0s 98us/sample - loss: 53.1147 - mae: 5.3081 - val_loss: 62.2629 - val_mae: 6.2421
Train on 404 samples, validate on 102 samples
Epoch 8/15
404/404 [==============================] - 1s 1ms/sample - loss: 62.6136 - mae: 5.8689 - val_loss: 55.7373 - val_mae: 5.5580
Epoch 9/15
404/404 [==============================] - 0s 104us/sample - loss: 53.0064 - mae: 5.0910 - val_loss: 54.3579 - val_mae: 5.4173
Epoch 10/15
404/404 [==============================] - 0s 105us/sample - loss: 55.2219 - mae: 5.3011 - val_loss: 54.2386 - val_mae: 5.4190
Epoch 11/15
404/404 [==============================] - 0s 101us/sample - loss: 54.9532 - mae: 5.2254 - val_loss: 61.4893 - val_mae: 6.1040
Epoch 12/15
404/404 [==============================] - 0s 101us/sample - loss: 53.2293 - mae: 5.0438 - val_loss: 56.2200 - val_mae: 5.4832
Epoch 13/15
404/404 [==============================] - 0s 98us/sample - loss: 55.3434 - mae: 5.3792 - val_loss: 54.3165 - val_mae: 5.4973
Epoch 14/15
404/404 [==============================] - 0s 103us/sample - loss: 47.2370 - mae: 4.7986 - val_loss: 68.3576 - val_mae: 6.1569
Epoch 15/15
404/404 [==============================] - 0s 100us/sample - loss: 50.2139 - mae: 5.0862 - val_loss: 53.1893 - val_mae: 5.3379
CPU times: user 25.1 s, sys: 895 ms, total: 26 s
Wall time: 24 s
[{'trial': {'trial_id': '9f135e3f-83a6-5b7f-b5f3-af6637662c49', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 17, 'l2': 23}, 'metadata': {}, 'keys': []}, 'metric': 5.337876319885254, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 17, 'l2': 23}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.337876319885254, 'log_time': datetime.datetime(2022, 1, 17, 20, 14, 40, 212724)},
 {'trial': {'trial_id': '360e486c-6d27-5abc-bb8d-c9bd388e62c3', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 25, 'l2': 28}, 'metadata': {}, 'keys': []}, 'metric': 6.242136001586914, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 25, 'l2': 28}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 6.242136001586914, 'log_time': datetime.datetime(2022, 1, 17, 20, 14, 39, 247491)}]

Run with Spark#

If you have a Spark/Dask cluster, you can easily scale your search by providing a different execution_engine and execution_engine_conf. But if you are training a deep learning model, normally you need a GPU for each worker. How to make Spark and Dask start workers with GPU is a question about infrastructure. We can’t provide a guidance here because it very environment specific.

For this case we only use CPU, so we can use the local spark to utilize the 4 CPUs.

from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from fugue_spark import SparkExecutionEngine

fconf = {
    "fugue.rpc.server": "fugue.rpc.flask.FlaskRPCServer",
    "fugue.rpc.flask_server.host": "0.0.0.0",
    "fugue.rpc.flask_server.port": "1234",
    "fugue.rpc.flask_server.timeout": "2 sec",
}

conf = SparkConf().set("spark.ui.showConsoleProgress", "false")
sc = SparkContext(appName="Tuning", conf=conf)
spark = SparkExecutionEngine(SparkSession(sc), conf=fconf)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/17 20:14:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%%time

space = keras_space(HousingSpec, l1=RandInt(8,32), l2=RandInt(8,32))

result = suggest_keras_models_by_sha(
    space.sample(16,0),
    plan=[(1.0,8),(2.0,4),(4.0,2),(8.0,1)],
    execution_engine="spark",
    top_n=2,
)

result
2022-01-17 20:15:19.976085: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:19.975319: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:19.992851: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:19.993326: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Train on 404 samples, validate on 102 samples
2022-01-17 20:15:22.247629: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:22.252117: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:22.261620: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:22.268666: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:22.273279: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:22.292436: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:22.297228: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:22.300151: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Train on 404 samples, validate on 102 samples
Train on 404 samples, validate on 102 samples
 32/404 [=>............................] - ETA: 27s - loss: 4676.2480 - mae: 66.7921Train on 404 samples, validate on 102 samples
404/404 [==============================] - 3s 7ms/sample - loss: 340.1798 - mae: 13.7698 - val_loss: 94.0060 - val_mae: 7.4925
404/404 [==============================] - 3s 7ms/sample - loss: 1552.3031 - mae: 31.9502 - val_loss: 293.8547 - val_mae: 13.4522
 32/404 [=>............................] - ETA: 29s - loss: 167.2086 - mae: 9.71872022-01-17 20:15:25.338086: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
384/404 [===========================>..] - ETA: 0s - loss: 153.2323 - mae: 8.8183 2022-01-17 20:15:25.363546: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:25.370134: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
192/404 [=============>................] - ETA: 2s - loss: 155.4403 - mae: 9.6472 2022-01-17 20:15:25.376542: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
404/404 [==============================] - 3s 7ms/sample - loss: 146.9411 - mae: 8.5968 - val_loss: 86.5214 - val_mae: 6.6218
404/404 [==============================] - 3s 7ms/sample - loss: 118.1191 - mae: 8.3546 - val_loss: 84.6783 - val_mae: 6.5059
404/404 [==============================] - 3s 7ms/sample - loss: 1079.9114 - mae: 24.8831 - val_loss: 150.3951 - val_mae: 9.8546
Train on 404 samples, validate on 102 samples
192/404 [=============>................] - ETA: 3s - loss: 44175.5143 - mae: 207.3689 Train on 404 samples, validate on 102 samples
404/404 [==============================] - 4s 9ms/sample - loss: 35045.6733 - mae: 183.4657 - val_loss: 22545.5962 - val_mae: 148.9371
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 2s 6ms/sample - loss: 417.7416 - mae: 14.7083 - val_loss: 224.5273 - val_mae: 12.1856
320/404 [======================>.......] - ETA: 0s - loss: 833.7399 - mae: 23.8774 Train on 404 samples, validate on 102 samples
404/404 [==============================] - 4s 10ms/sample - loss: 731.1726 - mae: 22.0633 - val_loss: 392.5416 - val_mae: 16.1036
404/404 [==============================] - 4s 10ms/sample - loss: 500.7441 - mae: 15.3740 - val_loss: 162.9564 - val_mae: 9.7072
404/404 [==============================] - 2s 4ms/sample - loss: 1596.1780 - mae: 33.5328 - val_loss: 712.5643 - val_mae: 23.2933
2022-01-17 20:15:32.052858: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:32.059224: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:32.062204: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:32.067582: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Train on 404 samples, validate on 102 samples
2022-01-17 20:15:33.206317: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:33.217654: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
2022-01-17 20:15:33.225448: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:33.235287: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Train on 404 samples, validate on 102 samples
2022-01-17 20:15:33.702748: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:33.712510: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 2s 5ms/sample - loss: 124.2025 - mae: 8.0326 - val_loss: 89.4502 - val_mae: 7.7794
404/404 [==============================] - 2s 5ms/sample - loss: 10704.2132 - mae: 95.8298 - val_loss: 2650.0053 - val_mae: 49.5344
 32/404 [=>............................] - ETA: 21s - loss: 498.1444 - mae: 19.55802022-01-17 20:15:35.386821: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:35.392949: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
404/404 [==============================] - 2s 6ms/sample - loss: 161.3598 - mae: 9.2978 - val_loss: 77.9805 - val_mae: 6.7786
404/404 [==============================] - 2s 6ms/sample - loss: 2646.1785 - mae: 38.8303 - val_loss: 165.9240 - val_mae: 10.1542
Train on 404 samples, validate on 102 samples
404/404 [==============================] - 2s 5ms/sample - loss: 2202.6670 - mae: 44.1604 - val_loss: 921.8530 - val_mae: 27.4930
404/404 [==============================] - 1s 3ms/sample - loss: 649.4187 - mae: 18.3541 - val_loss: 105.7603 - val_mae: 8.1020
Train on 404 samples, validate on 102 samples
Epoch 2/3
Train on 404 samples, validate on 102 samples
Epoch 2/3
Train on 404 samples, validate on 102 samples
Epoch 2/3
Train on 404 samples, validate on 102 samples
Epoch 2/3
Train on 404 samples, validate on 102 samples
Epoch 2/3
Train on 404 samples, validate on 102 samples
Epoch 2/3
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 2s 6ms/sample - loss: 96.3630 - mae: 7.1133 - val_loss: 86.7123 - val_mae: 6.4099
Epoch 3/3
404/404 [==============================] - 2s 6ms/sample - loss: 77.7906 - mae: 6.3446 - val_loss: 63.2674 - val_mae: 5.8088
Epoch 3/3
404/404 [==============================] - 2s 6ms/sample - loss: 120.7039 - mae: 8.5010 - val_loss: 89.4613 - val_mae: 7.0364
Epoch 3/3
404/404 [==============================] - 2s 6ms/sample - loss: 81.8934 - mae: 6.7243 - val_loss: 67.9964 - val_mae: 6.2066
Epoch 3/3
404/404 [==============================] - 3s 6ms/sample - loss: 96.6596 - mae: 7.5033 - val_loss: 109.5510 - val_mae: 8.2991
Epoch 3/3
404/404 [==============================] - 0s 460us/sample - loss: 63.5300 - mae: 5.7175 - val_loss: 88.4708 - val_mae: 6.7695
404/404 [==============================] - 0s 481us/sample - loss: 76.5753 - mae: 6.2638 - val_loss: 82.1614 - val_mae: 6.3239
404/404 [==============================] - 3s 7ms/sample - loss: 88.1857 - mae: 6.7593 - val_loss: 55.6522 - val_mae: 5.2754
Epoch 3/3
404/404 [==============================] - 0s 607us/sample - loss: 62.7232 - mae: 5.5297 - val_loss: 66.8899 - val_mae: 6.1538
404/404 [==============================] - 0s 643us/sample - loss: 82.6789 - mae: 6.7664 - val_loss: 71.9773 - val_mae: 6.1288
404/404 [==============================] - 3s 7ms/sample - loss: 105.7667 - mae: 7.7023 - val_loss: 79.4545 - val_mae: 6.6914
Epoch 3/3
404/404 [==============================] - 0s 555us/sample - loss: 71.9645 - mae: 6.2691 - val_loss: 92.1092 - val_mae: 7.3004
404/404 [==============================] - 0s 483us/sample - loss: 60.0886 - mae: 5.4920 - val_loss: 92.0157 - val_mae: 6.9520
404/404 [==============================] - 0s 498us/sample - loss: 85.2209 - mae: 6.7701 - val_loss: 78.1736 - val_mae: 6.4667
2022-01-17 20:15:51.461786: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:15:51.467061: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Epoch 2/3
404/404 [==============================] - 1s 3ms/sample - loss: 87.1320 - mae: 7.0783 - val_loss: 64.7436 - val_mae: 6.1645
Epoch 3/3
404/404 [==============================] - 0s 141us/sample - loss: 70.1167 - mae: 5.9839 - val_loss: 92.4013 - val_mae: 8.2311
Train on 404 samples, validate on 102 samples
Epoch 4/7
Train on 404 samples, validate on 102 samples
Epoch 4/7
Train on 404 samples, validate on 102 samples
Epoch 4/7
404/404 [==============================] - 1s 3ms/sample - loss: 68.9716 - mae: 5.8691 - val_loss: 65.4753 - val_mae: 5.9187
Epoch 5/7
404/404 [==============================] - 1s 3ms/sample - loss: 87.4706 - mae: 6.9567 - val_loss: 88.7132 - val_mae: 6.8814
Epoch 5/7
404/404 [==============================] - 1s 3ms/sample - loss: 99.6994 - mae: 6.8227 - val_loss: 69.0279 - val_mae: 6.4795
Epoch 5/7
404/404 [==============================] - 0s 347us/sample - loss: 62.2751 - mae: 5.6500 - val_loss: 82.0277 - val_mae: 6.5475
Epoch 6/7
404/404 [==============================] - 0s 320us/sample - loss: 74.7241 - mae: 6.3082 - val_loss: 74.3547 - val_mae: 6.9067
Epoch 6/7
404/404 [==============================] - 0s 287us/sample - loss: 73.6867 - mae: 6.4039 - val_loss: 64.6798 - val_mae: 6.2432
Epoch 6/7
404/404 [==============================] - 0s 330us/sample - loss: 70.0915 - mae: 6.0928 - val_loss: 68.5346 - val_mae: 6.4581
Epoch 7/7
404/404 [==============================] - 0s 358us/sample - loss: 61.7143 - mae: 5.5699 - val_loss: 81.3843 - val_mae: 7.2658
Epoch 7/7
404/404 [==============================] - 0s 311us/sample - loss: 79.8676 - mae: 6.5244 - val_loss: 67.3844 - val_mae: 5.8493
Epoch 7/7
404/404 [==============================] - 0s 265us/sample - loss: 73.3285 - mae: 6.3710 - val_loss: 63.8848 - val_mae: 5.9922
404/404 [==============================] - 0s 309us/sample - loss: 59.3020 - mae: 5.4560 - val_loss: 62.7025 - val_mae: 6.0392
404/404 [==============================] - 0s 307us/sample - loss: 68.9727 - mae: 6.0001 - val_loss: 62.0919 - val_mae: 5.9221
2022-01-17 20:16:03.541848: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-17 20:16:03.547408: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
Train on 404 samples, validate on 102 samples
Epoch 4/7
404/404 [==============================] - 1s 3ms/sample - loss: 82.0983 - mae: 6.2723 - val_loss: 69.0765 - val_mae: 6.3377
Epoch 5/7
404/404 [==============================] - 0s 163us/sample - loss: 71.0467 - mae: 6.0588 - val_loss: 70.3952 - val_mae: 5.9410
Epoch 6/7
404/404 [==============================] - 0s 135us/sample - loss: 69.3331 - mae: 5.8237 - val_loss: 84.0916 - val_mae: 7.5332
Epoch 7/7
404/404 [==============================] - 0s 128us/sample - loss: 64.1412 - mae: 5.5068 - val_loss: 65.6148 - val_mae: 6.3422
Train on 404 samples, validate on 102 samples
Epoch 8/15
Train on 404 samples, validate on 102 samples
Epoch 8/15
404/404 [==============================] - 1s 2ms/sample - loss: 90.5388 - mae: 6.4925 - val_loss: 96.0720 - val_mae: 7.1895
Epoch 9/15
404/404 [==============================] - 1s 2ms/sample - loss: 77.9944 - mae: 6.2412 - val_loss: 63.7347 - val_mae: 5.7904
Epoch 9/15
404/404 [==============================] - 0s 145us/sample - loss: 61.0867 - mae: 5.5270 - val_loss: 64.8602 - val_mae: 5.9496
Epoch 10/15
404/404 [==============================] - 0s 144us/sample - loss: 69.5993 - mae: 5.9489 - val_loss: 64.3345 - val_mae: 5.6493
Epoch 10/15
404/404 [==============================] - 0s 140us/sample - loss: 66.8111 - mae: 5.9609 - val_loss: 67.1307 - val_mae: 6.4332
Epoch 11/15
404/404 [==============================] - 0s 135us/sample - loss: 68.1354 - mae: 5.9266 - val_loss: 70.3789 - val_mae: 6.7164
Epoch 11/15
404/404 [==============================] - 0s 135us/sample - loss: 63.4828 - mae: 5.5511 - val_loss: 58.8596 - val_mae: 5.7412
Epoch 12/15
404/404 [==============================] - 0s 132us/sample - loss: 60.6300 - mae: 5.7319 - val_loss: 64.1241 - val_mae: 6.0083
Epoch 12/15
404/404 [==============================] - 0s 138us/sample - loss: 71.7196 - mae: 6.0782 - val_loss: 153.0045 - val_mae: 10.5817
Epoch 13/15
404/404 [==============================] - 0s 133us/sample - loss: 63.2448 - mae: 5.6428 - val_loss: 70.0385 - val_mae: 6.7826
Epoch 13/15
404/404 [==============================] - 0s 136us/sample - loss: 63.4726 - mae: 5.6072 - val_loss: 65.5231 - val_mae: 6.3137
Epoch 14/15
404/404 [==============================] - 0s 137us/sample - loss: 64.7931 - mae: 5.8393 - val_loss: 69.3780 - val_mae: 5.8077
Epoch 14/15
404/404 [==============================] - 0s 131us/sample - loss: 69.3167 - mae: 5.9904 - val_loss: 90.2719 - val_mae: 6.9027
Epoch 15/15
404/404 [==============================] - 0s 130us/sample - loss: 60.3915 - mae: 5.5286 - val_loss: 58.7957 - val_mae: 5.6263
Epoch 15/15
404/404 [==============================] - 0s 138us/sample - loss: 59.7843 - mae: 5.4936 - val_loss: 60.0637 - val_mae: 5.9212
404/404 [==============================] - 0s 129us/sample - loss: 66.4969 - mae: 5.7762 - val_loss: 64.8697 - val_mae: 5.5372
CPU times: user 512 ms, sys: 163 ms, total: 675 ms
Wall time: 1min 15s
[{'trial': {'trial_id': '9f135e3f-83a6-5b7f-b5f3-af6637662c49', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 17, 'l2': 23}, 'metadata': {}, 'keys': []}, 'metric': 5.537221431732178, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 17, 'l2': 23}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.537221431732178, 'log_time': datetime.datetime(2022, 1, 17, 20, 16, 10, 702500)},
 {'trial': {'trial_id': '5679b586-e31c-5b98-84ea-2a779201a450', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 24, 'l2': 27}, 'metadata': {}, 'keys': []}, 'metric': 5.921214580535889, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 24, 'l2': 27}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.921214580535889, 'log_time': datetime.datetime(2022, 1, 17, 20, 16, 10, 699282)}]

Run with realtime monitoring#

On notebooks, it’s important and fun to track the progress in real time. In tune you can enable realtime callback with a monitor to watch the progress. There are 3 types of built in monitors for iterative problems:

  1. ts to monitor the up-to-date best metric collected

  2. hist to motitor the histogram of metrics collected

  3. rungs to monitor the trails performance at each rung

Notice to enable realtime monitoring, you must enable callback in execution_engine_conf.

In this example, we also added a warmup period. So the first rung will run more epochs before discarding bad ones.

from typing import Optional
from tune import Monitor
from tune_notebook import (
    NotebookSimpleHist,
    NotebookSimpleRungs,
    NotebookSimpleTimeSeries,
    PrintBest,
)

def to_monitor(obj) -> Optional[Monitor]:
    if obj is None:
        return None
    if isinstance(obj, Monitor):
        return obj
    if isinstance(obj, str):
        if obj == "hist":
            return NotebookSimpleHist()
        if obj == "rungs":
            return NotebookSimpleRungs()
        if obj == "ts":
            return NotebookSimpleTimeSeries()
        if obj == "text":
            return PrintBest()
    raise NotImplementedError(obj)

TUNE_OBJECT_FACTORY.set_monitor_converter(to_monitor)
%%time
space = keras_space(HousingSpec, l1=RandInt(8,32), l2=RandInt(8,32))
warmup = 4

result = suggest_keras_models_by_sha(
    space.sample(16,0),
    plan=[(warmup+1.0,8),(2.0,4),(4.0,2),(8.0,1)],
    execution_engine_conf={"callback":True},
    monitor="rungs",
    top_n=2,
)

result
../../_images/iterative_19_0.png
[] 5.095827102661133 {'trial': {'trial_id': '416996d0-aff1-5def-bf8a-ab44dd136c89', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 30, 'l2': 27}, 'metadata': {}, 'keys': []}, 'metric': 5.095827102661133, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 30, 'l2': 27}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.095827102661133, 'log_time': datetime.datetime(2022, 1, 17, 20, 16, 45, 616816)}
CPU times: user 32.6 s, sys: 1.82 s, total: 34.5 s
Wall time: 31.3 s
[{'trial': {'trial_id': '416996d0-aff1-5def-bf8a-ab44dd136c89', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 30, 'l2': 27}, 'metadata': {}, 'keys': []}, 'metric': 5.095827102661133, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 30, 'l2': 27}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.095827102661133, 'log_time': datetime.datetime(2022, 1, 17, 20, 16, 45, 616816)},
 {'trial': {'trial_id': 'b02ea4d5-c747-5f4b-b0b0-c6f572b1869a', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 27, 'l2': 11}, 'metadata': {}, 'keys': []}, 'metric': 6.027777194976807, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 27, 'l2': 11}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 6.027777194976807, 'log_time': datetime.datetime(2022, 1, 17, 20, 16, 46, 708740)}]

Hyperband#

Traditional Hyperband is roughly a grid search of traditional Successive Halving. In our solution, we generalized Hyperband. You can provide the plan by a two dimensional array, and you can invent your own!

%%time
result = suggest_keras_models_by_hyperband(
    space.sample(16,0),
    plans=[
        [(1.0,8),(2.0,4),(4.0,2),(8.0,1)],
        [(2.0,4),(4.0,2),(8.0,1)],
        [(4.0,2),(8.0,1)],
    ],
    execution_engine=spark,
    execution_engine_conf={"callback":True},
    monitor="rungs",
    top_n=2,
)

result
../../_images/iterative_21_0.png
[] 5.18780517578125 {'trial': {'trial_id': '5b9e9b06-fc51-55eb-9bd0-e9334edbe89f', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 18, 'l2': 32}, 'metadata': {}, 'keys': []}, 'metric': 5.18780517578125, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 18, 'l2': 32}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.18780517578125, 'log_time': datetime.datetime(2022, 1, 17, 20, 17, 19, 6718)}
CPU times: user 3.75 s, sys: 401 ms, total: 4.15 s
Wall time: 1min 4s
[{'trial': {'trial_id': '5b9e9b06-fc51-55eb-9bd0-e9334edbe89f', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 18, 'l2': 32}, 'metadata': {}, 'keys': []}, 'metric': 5.18780517578125, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 18, 'l2': 32}, 'metadata': {}, 'cost': 8.0, 'rung': 3, 'sort_metric': 5.18780517578125, 'log_time': datetime.datetime(2022, 1, 17, 20, 17, 19, 6718)},
 {'trial': {'trial_id': 'aa517e33-1675-5c02-851f-6f0f5fb1993f', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 32, 'l2': 10}, 'metadata': {}, 'keys': []}, 'metric': 5.464710712432861, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 32, 'l2': 10}, 'metadata': {}, 'cost': 8.0, 'rung': 2, 'sort_metric': 5.464710712432861, 'log_time': datetime.datetime(2022, 1, 17, 20, 17, 33, 663062)}]

Asynchronous Success Halving#

ASHA is an improved version of SHA. The idea is simple but the impact is big. Although on this Kaggle CPU instance, we can’t show the full power of ASHA, but in practice, if you have a decent amount of compute resource, ASHA will out perform SHA and Hyperband on almost every aspect: searching speed, cost, output model performance.

Even on this machine, you can see the time take for ASHA is a lot less than SHA with more searches done.

%%time
warmup = 1

result = suggest_keras_models_by_continuous_asha(
    space.sample(64,0),  # ASHA will use up to 64 samples
    plan=[(warmup+1.0,8),(2.0,4),(4.0,2),(8.0,1)],
    execution_engine=spark,
    execution_engine_conf={"callback":True},
    monitor="rungs",
    top_n=2,
)

result

404/404 [==============================] - 0s 151us/sample - loss: 72.1222 - mae: 6.0137 - val_loss: 72.0536 - val_mae: 6.1482
../../_images/iterative_23_1.png
[] 5.4638991355896 {'trial': {'trial_id': 'f748e3d7-7dd1-56ed-8ef8-389c077f0f53', 'params': {'__space__model': '__main__.HousingSpec', 'l1': 16, 'l2': 14}, 'metadata': {}, 'keys': []}, 'metric': 5.4638991355896, 'params': {'__space__model': '__main__.HousingSpec', 'l1': 16, 'l2': 14}, 'metadata': {}, 'cost': 4.0, 'rung': 2, 'sort_metric': 5.4638991355896, 'log_time': datetime.datetime(2022, 1, 17, 20, 18, 1, 960888)}
 32/404 [=>............................] - ETA: 9s - loss: 107.7255 - mae: 7.6795
404/404 [==============================] - 1s 3ms/sample - loss: 82.8418 - mae: 6.9814 - val_loss: 77.8392 - val_mae: 6.6640
Epoch 10/16

 32/404 [=>............................] - ETA: 0s - loss: 89.2248 - mae: 6.8394

Summary#

Successive Halving and Hyperband are great if you don’t have a lot of compute resource. ASHA is great if have sufficient compute resource.

We have generalized all these algorithms so you can fully customize the tuning plan by just providing arrays. And the process is scale agnostic and platform agnostice, as long as you have a distributed framework available, you can easily scale up the search while tracking the progress in real time.