Postulate is the best way to take and share notes for classes, research, and other learning.
Summary of "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" Paper: Note: This was literally one of the best ML papers I've read. I'd recommend giving the actual paper a read. State of the art + very well written and easy to understand!
Abstract
They noticed that if you scale up depth/width/resolution carefully you can get insane results: Faster, more accurate, less parameters
They implement a simple scaling co-efficient
Intro:
Scaling up CNN's = better. Scaling in depth, width, & resolution. Before you could only scale up 1 of them. If you did more it = lots of manual work + bad results + bad efficiency
Traditional = arbitrarily scaling up one of the three. But turns out there's a correlation - in fact it's constant! They call this the compound scaling method that does all 3 together.
If we're suddenly given more computation resources, we can change up our model accordingly with some ratios
On an intuitive level: Bigger your image, the more layers + the more channels you need. Thus it all scales up constantly
This uses less parameters, is faster and is more accurate. And can be transfered to lots of different benchmarks
Related Work:
ConvNet Accuracy: It increases only as the parameters increase. From 6.8M to 145M to 57M parameters to get to the state of the art on Imagenet
ConvNet Efficiency: So we've shoved CNNs on phones, so they're really effieincent. But we don't know how to transfer that effiency to larger models.
Model Scaling: Increasing width (channels), depth (layers) and image size = better on their own.
Compound Model Scaling:
This is that a CNN looks like:
N = the CNN itself
Circle dot = combine the layers:
F^L = the layer
X = the input
H,W,C = Height, Width, Channel
To wrap: CNN = a bunch of layers with their individual H,W,C for each layer, all attached to each other
And this is how we're going to formula the problem:
The d,r,w = what we're adjusting. d adjusts the number of layers. r adjusts the resolution. w adjusts the number of channels
It's really hard to find the optimum values. They're interconnected and they change depending on your resource constraints... So here's the run down of each one:
Depth: Increasing layers = more rich and complex features - thus more generalizable. Bad because vanishing gradient problem, but there are ways of combating it. Also diminishing returns as it grows
Width: Increasing channels = more features, thus easier to train. They're used for shallower networks, thus bad in the long run. Diminishing returns.
Resolution: Increasing image size = lets the CNN capture more fine-grain details. But diminishing returns
The observation: Increasing 1 of them = increases accuracy, but diminishingly
But when the researchers started playing around with them and trying to balance it it started getting better
Observation 2 = you have ot balance them all at once while scaling
Now we're not going to do it all manually, we'll just define a compounding scaling method:
User defines φ, it will depend on resources. We can look up the alpha, beta and gamma values through a "small grid search"
We're making alpha x beta^2 x gamma^2 = 2 such that it scales by 2^φ - easier to set according to resource computation
We shove the ^2 on beta and gamma because doubling their value, 4x's the computation time. But only doubles for the depth
EfficientNet Architecture:
Oh and they build their own super effieinct baseline model to improve the crap out of
We use the compounding scaling method by:
Define φ. Then find alpha, beta and gamma using:
We're going to have different stages of our thing (they'll be different models). Each we can just increase φ
What's cool is that we find a,b,g on a smaller baseline model, and it remains constant (it's expensive to find the values for bigger models)
Experiments:
They scaled up existing baseline models - crushed them
Used the cool new baseline model they created - crushed imagenet
Tried it on other datasets - crushed them
And you can transfer learn with it!
Discussion:
They're just better. More accuracy, less parameters, faster.
The CNN activation map is even better!
Conclusion:
Just summarizing the whole thing again
A Collection of Summaries of my favourite ML papers!