Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Summary of "YOLO" paper:

Profile picture of Dickson WuDickson Wu
May 21, 2021Last updated May 27, 20215 min read

Summary of "YOLO" paper: Note: This is a good easy to read paper too!

Abstract:

  • Before people modified classifiers to do object detection. But these guys use modified regression. It's 1 NN predicts bouding boxes + probabilities in 1 pass.

  • It's hella fast too - good enough for real-time detection. Low false positives, but more prone to where to put the boxes. It's also generalizable + outperforms other methods

Introduction:

  • YOLO = You only look once, that's how humans work. Our visual systems are really fast - let's do the same with computers

  • Current appoaches:

    • Deformable parts models: We literally take chunks of the image and try and classify it to try and find the object

    • R-CNN: we plop down boxes, classify it, refine the boxes, take out duplicates, and then rescore the boxes

    • All slow and hard to optimize since they're all made of multiple models

  • YOLO just does everything in 1 pass, in 1 model. It can create the bounding boxes + probabilities! This has some benfits

    • Super fast (45 fps, or 150 fps for faster versions)

    • 2 times more accurate than other real time systems

    • It also takes the whole image once, thus has context- less background errors

    • More generalizable

    • But still not as accurate as state of the art models. Not as good at small objects

Unified Detection:

  • So YOLO takes the full image, predicts for every category and creates bounding boxes for all of them. It takes in literally everything

  • We take the image and break it up into grids. Each grid then predicts if objects are in there and creates a confidence score.

  • Each grid outputs a bunch of bounding boxes, which equals x, y, w, h, and confidence. x,y = center of box. w,h = width and height of boxes.

  • Confidence:

    • 

    • The grid actually predicts the probability that a class is here

    • The boxes will have their confidence score of the object.

    • Then we just multiply them together, with an IOU (intersection over union)

    • End result = confidence!

  • Some varaibles:

    • S = the S x S grid we're creating

    • B = number of bounding boxes each grid creates

    • C = number of classes that our grid predicts

  • Network:

    • Regular CNN on the Pascal VOC dataset

    • First Conv layers = exract features

    • The Linear layers = will output probabilities + coordinates

    • [exact specifications of the network (like layers sizes and stuff]

  • Training

    • Trained conv layers on ImageNet for a week, then transfer learned it

    • Add in some conv layers + linear layers

    • Final layer = class probabilities + bounding box coordinates

    • Normalize the outputs, loss it (where we decrease loss of confidence and focus more on boxes (so it doesn't unstabelize the model))

    • We assign responsibility of the bounding box to the bounding box with the highest IOU thingy

    • The loss looks like this:

    • 

    • But it basically just takes the loss of the responsible bounding box, and takes the loss of the objects present in the grid

    • Then we set some hyperparameters [they go into detail in the paper]

  • Inference

    • Grid tries to divy up stuff, but if it's near the border we might get some duplicates - which we can clear up using non-maximal suppression

  • Limitations:

    • Spacial constraints means that we can't have lots of boxes per grid. Also it's bad at doing groups of small objects

    • It's also bad at generalizing to weird aspect ratios

    • The loss makes it such that we don't penalize the mistakes of boxes on small objects as much

Comparisons:

  • Deformable parts model: No need for lots of parts, just 1 CNN. Both do feature extraction, bounding box prediction, non-maximal supression, and context reasoning. But we're faster and more accurate

  • R-CNN: No need for lots of parts, , much much faster. YOLO has less boxes per image but there are other similarities

  • Other fast detectors: R-CNN = still too slow for real time, only sped up DPM. YOLO is just super fast

  • Deep-multibox: Multi-box ain't as generalizable. But it's still similar in the way it uses CNNs to predict bounding boxes

  • Overfeat: It's good for localization, but not detection performance. It's not good at global conext. And need a ton of post-processing

  • MultiGrasp: It's actually th proto-version of YOLO. Very similar but less complex and worse.

  • Compared to toehr real-time prediction systems: it's blazing fast and accurate

  • Compared to other state of the art detectors: It suffers the most in localization errors. But a lot less background errors.

  • And we can combine Fast R-CNN and YOLO. We literally just compare the outputted bounding boxes. It does much better but it's as slow as a Fast R-CNN

  • Worse than state of the art, generalizable (to detect a person in artwork)



Real time detection in the wild:

  • Functions like a tracking system and works!

Conclusion:

  • YOLO = easy, simple, powerful, fast,, generalizable, close to state of the art accuracy.


Comments (loading...)

Sign in to comment

ML Paper Collection

A Collection of Summaries of my favourite ML papers!