Paper Summary: "A System for Massively Parallel Hyperparameter Tuning"

Paper Summary: "A System for Massively Parallel Hyperparameter Tuning": 
TL;DR - We take SHA (Each configuration has a small budget. They run and are evaluated, then keep the top 1/n%. You distribute the budget of the configurations that got cut out with the remaining configurations. You do this again and again) and turn it into ASHA --> by modifying it to be able to scale to massive parallelism through asynchronous training. 
Why does this paper matter? It's state of the art, a general solution (so for all machine learning models, not just NN's), takes advantage of parallelizations.
Abstract:ML Models have big hyperparameter spaces and training times. But we're getting more parallel computing + demand for ML in production --> Drives the need for hyperparameter optimization.
ASHA is a state-of-the-art simple + robust hyperparameter optimizing algorithm. ASHA uses parallelism + aggressive early stopping to achieve state-of-the-art results (even beating PBT).
Introduction:Modern ML Models = super successful, but they're still really sensitive to hyperparameters. Here's 4 trends/reasons why ASHA was developed:
1. Models are growing in complexity. Now we have dozens of hyperparameters, each interacting with each other in unknown ways --> Producing thousands of different settings to evaluate.
2. Datasets growing + model complexity = longer training times + expensive. This trend is horrible for testing out parameters
3. The combination of 1 + 2 means that if we're doing it sequentially (try out one thing, then use that information to set the next experiment) it would take years. So the next available option we have is to train it with parallel computing. 
4. ML is being used more and more in production (rather than only R&D) Thus we need mature infrastructures to help us with hyperparameter optimization. 
ASHA (standing for Asynchronous Successive Halving Algorithm) tackles all these problems. It's a production-quality hyperparameter optimization algorithm that supports massive parallelism. 
ASHA was inspired by the Successive Halving Algorithm (SHA) which was only theoretical and utilized early stopping + gave more resources to good runs.  ASHA = also goes through tons and tons of hyperparameter configurations (large-scale regime).
They compared ASHA with other techniques (Fabolas, PBT, BOHB) and SHA beat them all. ASHA one-ups SHA and provides the massive parallelization. ASHA is also simple, theoretically grounded and scales linearly with the number of workers. 
Related Work:Sequential Models: There are 2 ways which they do it: selecting the configuration or evaluating the configurations. 
Selecting method = finding good regions to pick the hyperparameters, and then sampling from that region. But we need past data to find those good regions, which means they aren't good for massive scaling.
Evaluation method = They early stop all the bad configurations and divert resources to the good configurations. But they rely on rule-based systems to terminate the bad configurations. 
SHA + Hyperpand doesn't have that same drawback. SHA sits inside of Hyperband. Hyperband is the thing determining what to stop and what to run. They do aggressive early stopping, which helps a ton. (ASHA focuses on taking SHA and allowing it to scale up)
There are also some hybrid versions between the Selection + the evaluation method. SHA/Hyberband > SMAC, but Fabolas sometimes beats SHA and sometimes losses. But when trying to scale Fabolas, it underperforms.
Another hybrid = Hyperband + selection. BOHB is a Bayesian optimization algorithm of the Hyperband + someone else also paralleled SHA --> But both perform worse than ASHA. But we can adapt the method of combining ASHA with selection methods.
Parallel Methods: 2 models that are talked about in this section:
PBT: It's an evolutionary approach that also does early stopping. But the 2 downsides are that it's not theoretically garenteed, and it's designed for neural networks, thus not generalizable. 
Vizier is a black box algorithm that Google is offering. It also has the option for early stopping, but this is only a 3X improvement. ASHA provides a magnitude speedup (10X)
Hyperparameter Optimization Systems: There are tons of these systems out there that have lots of different algorithms. But the ASHA guys are only interested in massively parallelizations of ASHA since it's specifically suited for parallelization. In the paper, they also provide specific optimization techniques for ASHA that will be beneficial for other hyperparameter optimization algorithms. 
Other algorithms are going for resource management. ASHA focuses on making it efficient. Also, other people are doing user-specified scaling. These guys are going for automated scaling. 
ASHA Algorithm:First let's dive into SHA, then into ASHA.
SHA = Give each configuration a small budget. You let them run and do their thing. Then you evaluate them, keep the top (1/n, ex: if n was 2, then we keep top 50%) and you multiply everyone's budget by n (2 in this example). The total computation budget remains the same throughout the whole training process!
Budget could be epochs, number of training examples, number of random features etc. 
The formal algorithm is below:
﻿
Rundown of the algorithm:
s_max = the max number of times we can loop. (example, if R = 1000, r = 1, & n = 10, then log_10(1000) is 3. We can loop through our halving part 3 times)
Then we just make sure that we can at least run it once without overflowing.
T = just the population of models that we're training
n_i = the number of models in the population right now (used later on in the top_k)
r_i = the resources allocated to each configuration
L = all the losses - we're inputting the parameters (which all come from the population) and the resources for each configuration. It's just going to run it and return the validation loss for each model
T = we update the population. We take the top_k models. Input the current population, losses, and the percent that we're keeping
When we have a lower s-value, the early stopping rate becomes more aggressive (since each configuration gets fewer resources).
Now before moving forward, here's some terminology we need: For each different s value, we call each s SHA a bracket. Within brackets, there are different rungs for each of the promotion levels there are. 
﻿
Figure (a) shows the promotion scheme. We start off with 9 configurations on rung 0. But the eta value (n) is 3. So we keep the top 33% (1/3). Now there are 3 left in rung 1. Finally, we cut it again and keep the top 33%, leaving us with 1 remaining. 
Figure (b) shows that within each bracket, the total budget remains the same throughout the rungs. But if you increase the s value by 1, we multiply the total budget by eta (n).
How do we parallelize this? A straightforward way would be just adding more machines. But that's slow. The formula for the length of training looks like this:
﻿
The whole part on the left is the number of rungs we have. So if we have 3 rungs, it'll take 3 x time(R). But the paper's ASHA runs at time(R). 
Another way would be to take all the surviving configurations and spread them out + if some brackets are done we'll just boot up other brackets. But that's also slow because: We have to wait until they're all done until we can find the top 1/n, and adding more brackets doesn't help to improve the top 1/n. 
Alright! Now let's get into ASHA:
To promote the maximum amount of parallelism, we just make it super asynchronous. So ASHA promotes configurations to the next rung whenever possible (so no need to wait for everyone to finish). If we can't promote anyone, we just add new configurations to the base rung. Inputs are the same as SHA, except no need to input the number of configurations. 
The time it takes ASHA to run is quick. With this many machines:
﻿
We take this much time to train a rung i:
﻿
Now if we just do the summation of this we get the total time to train it:
﻿
The key is that it's always less than 2 time(R)! 
Since hyberband runs SHA brackets, we can just do the same thing but with ASHAs.
Formally the ASHA algorithm looks like:
﻿
Explanation:
Repeat this whole function until we want to finish (until desired)
If we have a free worker: We grab the parameter and the rung number from the get_job function. Then we just find the loss of the model (input the parameter + the resource)
If we have a job that's completed, we just update the parameter value using the loss (we could do backpropagation or something like that)
For the get_job function:
﻿
Explanation:
We loop through all the rungs:
We check all the top candidates of this rung
Seach the candidates if any are promotable
If they are then just return the same parameters + increase the rung by 1
Else we just create a random parameter at rung 0 and return it
The reason why ASHA is so fast is that it sometimes promotes things that shouldn't be promoted - promoting configurations that aren't in the top 1/n models. But as the rungs go up, the chance that they survive is vanishingly low. 
Another cool part of ASHA is that it can: bound the resource per configuration, and naturally unbound it such that each configuration could have infinite computation. SHA didn't allow for the easy change-up. Also, ASHA allows for intermediate losses to determine the best configuration. But SHA needs to wait until the bracket is done to find the best.
Empirical Evaluation:They did some experiments with different hyperparameter optimizers:
This is for only 1 worker (so sequential). ASHA isn't doing as well (because of the mis-promotions). SHA + ASHA is almost the same as BOHB. 
﻿
Now we put a little more workers (25 workers). ASHA is way faster than all the other methods. But towards the end it's comparable to PBT and BOHB - except ASHA is still a more principled and general approach than them both. 
﻿
For specifically neural network training (16 workers). ASHA is again faster + obtains better results than the other 2 methods:
﻿
For large-scale performance (500 workers). Similarish results. But still a faster and better final results:
﻿
Productionizing ASHA:The authors were implementing ASHA into a platform and they came across some design decisions:
Usability: ASHA has tons of hyperparameters within its system which can be cumbersome to use. So they just simplified it down only the intuitive parameters and left the other hyperparameters to the advanced users.
[they then recommend some pre-set hyperparameters based on previous work + experimentation + go through the stopping mechanism (which is n) and the rationale behind it]
Automatic Scaling of Parallel Training: When we give more resources to models of higher rungs, we end up having to train them for longer and longer periods. The number of GPUs spent is not linear with the speedups of training - we get diminishing returns. 
﻿
Additionally, we have to start trading off resources for training speed, latency, throughput. 
Resource Allocation: Other methods have a static resource allocation + it's a first-in, first-out fashion. Both of these things are sub-optimal. [They implement a scheduler that tackles these problems]
Reproducibility in Distributed Environments: ASHA has some challenges with reproducibility. First, we stop a ton of jobs and start some new ones - this is basically all random. Thus we need some good checkpointing to be able to backtrack. Also since the promotions are also all asynchronous, we also have to keep track of all the promotions.
Conclusion:ASHA is a production-quality, general hyperparameter tuning. It's theoretically principled, simple, robust and can take advantage of massive parallelizations. It outperforms the state of the art and we've gone through some design decisions for putting it into production.
Paper Summary: "A System for Massively Parallel Hyperparameter Tuning"

Abstract:

Introduction:

Related Work:

ASHA Algorithm:

Empirical Evaluation:

Productionizing ASHA:

Conclusion:

Comments (loading...)

ML Paper Collection