Postulate is the best way to take and share notes for classes, research, and other learning.
Summary of "YOLO" paper: Note: This is a good easy to read paper too!
Abstract:
Before people modified classifiers to do object detection. But these guys use modified regression. It's 1 NN predicts bouding boxes + probabilities in 1 pass.
It's hella fast too - good enough for real-time detection. Low false positives, but more prone to where to put the boxes. It's also generalizable + outperforms other methods
Introduction:
YOLO = You only look once, that's how humans work. Our visual systems are really fast - let's do the same with computers
Current appoaches:
Deformable parts models: We literally take chunks of the image and try and classify it to try and find the object
R-CNN: we plop down boxes, classify it, refine the boxes, take out duplicates, and then rescore the boxes
All slow and hard to optimize since they're all made of multiple models
YOLO just does everything in 1 pass, in 1 model. It can create the bounding boxes + probabilities! This has some benfits
Super fast (45 fps, or 150 fps for faster versions)
2 times more accurate than other real time systems
It also takes the whole image once, thus has context- less background errors
More generalizable
But still not as accurate as state of the art models. Not as good at small objects
Unified Detection:
So YOLO takes the full image, predicts for every category and creates bounding boxes for all of them. It takes in literally everything
We take the image and break it up into grids. Each grid then predicts if objects are in there and creates a confidence score.
Each grid outputs a bunch of bounding boxes, which equals x, y, w, h, and confidence. x,y = center of box. w,h = width and height of boxes.
Confidence:
The grid actually predicts the probability that a class is here
The boxes will have their confidence score of the object.
Then we just multiply them together, with an IOU (intersection over union)
End result = confidence!
Some varaibles:
S = the S x S grid we're creating
B = number of bounding boxes each grid creates
C = number of classes that our grid predicts
Network:
Regular CNN on the Pascal VOC dataset
First Conv layers = exract features
The Linear layers = will output probabilities + coordinates
[exact specifications of the network (like layers sizes and stuff]
Training
Trained conv layers on ImageNet for a week, then transfer learned it
Add in some conv layers + linear layers
Final layer = class probabilities + bounding box coordinates
Normalize the outputs, loss it (where we decrease loss of confidence and focus more on boxes (so it doesn't unstabelize the model))
We assign responsibility of the bounding box to the bounding box with the highest IOU thingy
The loss looks like this:
But it basically just takes the loss of the responsible bounding box, and takes the loss of the objects present in the grid
Then we set some hyperparameters [they go into detail in the paper]
Inference
Grid tries to divy up stuff, but if it's near the border we might get some duplicates - which we can clear up using non-maximal suppression
Limitations:
Spacial constraints means that we can't have lots of boxes per grid. Also it's bad at doing groups of small objects
It's also bad at generalizing to weird aspect ratios
The loss makes it such that we don't penalize the mistakes of boxes on small objects as much
Comparisons:
Deformable parts model: No need for lots of parts, just 1 CNN. Both do feature extraction, bounding box prediction, non-maximal supression, and context reasoning. But we're faster and more accurate
R-CNN: No need for lots of parts, , much much faster. YOLO has less boxes per image but there are other similarities
Other fast detectors: R-CNN = still too slow for real time, only sped up DPM. YOLO is just super fast
Deep-multibox: Multi-box ain't as generalizable. But it's still similar in the way it uses CNNs to predict bounding boxes
Overfeat: It's good for localization, but not detection performance. It's not good at global conext. And need a ton of post-processing
MultiGrasp: It's actually th proto-version of YOLO. Very similar but less complex and worse.
Compared to toehr real-time prediction systems: it's blazing fast and accurate
Compared to other state of the art detectors: It suffers the most in localization errors. But a lot less background errors.
And we can combine Fast R-CNN and YOLO. We literally just compare the outputted bounding boxes. It does much better but it's as slow as a Fast R-CNN
Worse than state of the art, generalizable (to detect a person in artwork)
Real time detection in the wild:
Functions like a tracking system and works!
Conclusion:
YOLO = easy, simple, powerful, fast,, generalizable, close to state of the art accuracy.
A Collection of Summaries of my favourite ML papers!