Postulate is the best way to take and share notes for classes, research, and other learning.
Paper Summary: "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"
Abstract:
State of the art object detection = Region Proposal Algorithms. Advances = speeding it up, but region proposal computation = bottleneck
This paper = uses Convolutional features to overcome that. They use Region Proposal Network to share the Conv features with the detection network.
They even upgrade this further and merge RPN with Fast R-CNN + Attention mechanisms
Introduction:
Right now the state of the art = R-CNNs. They were computationally expensive, but ew could speed it up by sharing convolutions across proposals (proposed in Fast R-CNNs)
But the bottleneck we have right now are the proposals (aka the thing that finds the regions to scan in the first place)
There are several algorithms out there to do it. But they're all slow.
We're going to throw that out and just find the proposals using CNNs - more specifically we share the convolutions of the proposal network with the object detection network
When they "share the convolutions" I think they're saying: Using the same convolutional layers. But What they do is that they take the outputs and then slap on a few extra layers to determine the regions bounds
This method helps with the generalizability (scaled images for example)
Related Work:
Object Proposals:
Lots of literature on it already.
People group super-pixels, or use sliding windows
This is treated like an external module of the detector network part
Deep Networks for Object Detection:
It's just a classifier
We just use CNN's and Linear layers to help us do it
Sharing computation of convolutions = has been picking up steam in the past little bit
Faster R-CNN:
2 modules:
CNN to spit out proposal regions
Fast R-CNN detector
The attention part comes from the RPN "Telling" the detector where to look
RPN (region Proposal Network):
We give it an image, it spits out some rectangular boxes with some "objectness" score
We're sharing layers so we're going to have some shared layers + some RPN specific layers
The specifc RPN Layers = there's going to be an additional conv layer that scans the image. This outputs to an intermediate layer that goes to 2 models. Box regression + Box classification
So each sliding window (how the conv layer goes over the image) can hit multiple bounding boxes at once. It's parameterized by the scale + the aspect ratio.
Now each box will have 4 coordinate outputs (for the box) + 2 score outputs (probability that it is or is not an object)
Cool feature of this model = it doesn't care where the objects are. It will still find it (other methods don't do this). This is called Translation invariant property
To detect bounding boxes, people usually use different size images (slow), have different size filters (also slow). But these guys just uses anchores of multiple sizes. This way, not extra computaiton since our image + our convolution = all the same size
Loss = how close is our box to the ground truth. Close as in the Intersection over the union of the boxes [plus a ton of math going over it]
When training we just take a sub-sample of the anchors to compute the loss (so we don't overwhelm it)
Training them together:
We want them to share the same layers, so we have several ways of doing this
Alternating training:
Train RPN First
RPN Proposals --> Fast R-CNN
This trains the shared convolutions and we continue from there
Approximate joint training:
We merge them into one network and just train the whole thing like it was one
It's easy to implement, but a flaw is that it ignores the coordinates (as a loss)
Produces pretty close results, and reduces training time by 25-50%
Non-approximate Join training:
We fix the flaw in the approximate joint training
It's complex and "beyond the scope of this paper"
4 step alternating training:
Train the RPN (transferred learned)
Use RPN to train the Fast R-CNN (transferred learned)
This point the Conv layers aren't trained. So we use the detector to train the RPN. Except we freeze the shared part and only train the RPN head
Now we switch and train the detector network
[They go over some implementation details and image processing stuff]
Experimentation:
[Also just a bunch of tests and stuff like that. Basically it works really well!]
Conclusion:
[Just does a TL;DR of the paper]
A Collection of Summaries of my favourite ML papers!