Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Summary of "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" Paper:

Profile picture of Dickson WuDickson Wu
May 20, 2021Last updated May 27, 20214 min read

Summary of "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" Paper: Note: This was literally one of the best ML papers I've read. I'd recommend giving the actual paper a read. State of the art + very well written and easy to understand!

Abstract

  • They noticed that if you scale up depth/width/resolution carefully you can get insane results: Faster, more accurate, less parameters

  • They implement a simple scaling co-efficient

Intro:

  • Scaling up CNN's = better. Scaling in depth, width, & resolution. Before you could only scale up 1 of them. If you did more it = lots of manual work + bad results + bad efficiency

  • Traditional = arbitrarily scaling up one of the three. But turns out there's a correlation - in fact it's constant! They call this the compound scaling method that does all 3 together.

  • If we're suddenly given more computation resources, we can change up our model accordingly with some ratios

  • On an intuitive level: Bigger your image, the more layers + the more channels you need. Thus it all scales up constantly

  • This uses less parameters, is faster and is more accurate. And can be transfered to lots of different benchmarks

Related Work:

  • ConvNet Accuracy: It increases only as the parameters increase. From 6.8M to 145M to 57M parameters to get to the state of the art on Imagenet

  • ConvNet Efficiency: So we've shoved CNNs on phones, so they're really effieincent. But we don't know how to transfer that effiency to larger models.

  • Model Scaling: Increasing width (channels), depth (layers) and image size = better on their own.

Compound Model Scaling:

  • This is that a CNN looks like:

  • 

    • N = the CNN itself

    • Circle dot = combine the layers:

    • F^L = the layer

    • X = the input

    • H,W,C = Height, Width, Channel

    • To wrap: CNN = a bunch of layers with their individual H,W,C for each layer, all attached to each other

  • And this is how we're going to formula the problem:

  • 

    • The d,r,w = what we're adjusting. d adjusts the number of layers. r adjusts the resolution. w adjusts the number of channels

  • It's really hard to find the optimum values. They're interconnected and they change depending on your resource constraints... So here's the run down of each one:

    • Depth: Increasing layers = more rich and complex features - thus more generalizable. Bad because vanishing gradient problem, but there are ways of combating it. Also diminishing returns as it grows

    • Width: Increasing channels = more features, thus easier to train. They're used for shallower networks, thus bad in the long run. Diminishing returns.

    • Resolution: Increasing image size = lets the CNN capture more fine-grain details. But diminishing returns

  • The observation: Increasing 1 of them = increases accuracy, but diminishingly

  • But when the researchers started playing around with them and trying to balance it it started getting better

  • Observation 2 = you have ot balance them all at once while scaling

  • Now we're not going to do it all manually, we'll just define a compounding scaling method:

  • 

  • User defines φ, it will depend on resources. We can look up the alpha, beta and gamma values through a "small grid search"

  • We're making alpha x beta^2 x gamma^2 = 2 such that it scales by 2^φ - easier to set according to resource computation

  • We shove the ^2 on beta and gamma because doubling their value, 4x's the computation time. But only doubles for the depth

EfficientNet Architecture:

  • Oh and they build their own super effieinct baseline model to improve the crap out of

  • We use the compounding scaling method by:

  • Define φ. Then find alpha, beta and gamma using:

  • 

  • 

  • We're going to have different stages of our thing (they'll be different models). Each we can just increase φ

  • What's cool is that we find a,b,g on a smaller baseline model, and it remains constant (it's expensive to find the values for bigger models)

Experiments:

  • They scaled up existing baseline models - crushed them

  • Used the cool new baseline model they created - crushed imagenet

  • Tried it on other datasets - crushed them

  • And you can transfer learn with it!

Discussion:

  • They're just better. More accuracy, less parameters, faster.

  • The CNN activation map is even better!

Conclusion:

  • Just summarizing the whole thing again


Comments (loading...)

Sign in to comment

ML Paper Collection

A Collection of Summaries of my favourite ML papers!