Postulate is the best way to take and share notes for classes, research, and other learning.
Paper Summary: "Rich feature hierarchies for accurate object detection and semantic segmentation"
Abstract:
Old models for object detection = take lots of little models and clobber them together to get one model --> Hasn't improved that much in the past few years (2014)
These guys created one simple model which beat up the old models
Using CNNs + pre-training on labeled images
Introduction:
In 2012, CNN's made their debut to the world by being 1st on Imagenet --> We're just going to transfer that to object detection
When using CNN's we can't just use sliding window trick (used for 2 decades!!) --> But it's very hard to do precise localization within the image
Instead we match this with the "recognition using region" paradigm.
How it works;
Generate 2000 regions on the input image (so it will take each part of the image and break it up into little chunks (like a mug --> Handle, round circle part, and the walls etc etc)
CNN will produce a feature vector for each region proposal
SVM (support vector machines) will then classify each region)
That's why this is called R-CNNs. It's Region CNN's.
Another problem we have is training data. We just don't have enough labeled data to train our algorithm. Conventionally we just use un-labeled data to pre-train But we're going to use labeled data!
R-CNN = efficient too. Only smaller operations needed so it's faster..
Object detection with R-CNN:
To recap the three parts of the model are:
Region proposals
CNN to produce feature vectors
SVM to classify
Region Proposals:
They just sited some prior work and said they'll be copying them haha
CNN:
4096 dimensional feature vector output
It's just a 5 conv + 2 linear layers for the CNN - they just went for the simplest model
The regions might be different sizes - so the regions are always jammed into a specific dimension (stretched or squished
At the end: 13s/image on GPU, or 53s/image on CPU
For pre-training. They trained on a huge dataset for classification. The dataset = super general, but they can specify it later to dataset for object detection
Results:
PASCAL VOC = Faster + an improvement from 35.1% to 53.7% mAP
Visualization, ablation, and modes of error + The ILSVRC2013 detection dataset + Semantic segmentation
[they go into great length about this stuff. imo it's not that interesting so I just skipped it haha]
Conclusion:
[they re-write the introduction]
We must combine both Classical Comptuer vision and deep learning together to get the best results. They aren't opposinging fields. They're both perfect partners!
A Collection of Summaries of my favourite ML papers!