Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Paper Summary: "Rich feature hierarchies for accurate object detection and semantic segmentation"

Profile picture of Dickson WuDickson Wu
Jun 8, 20213 min read

Paper Summary: "Rich feature hierarchies for accurate object detection and semantic segmentation"

Abstract:

  • Old models for object detection = take lots of little models and clobber them together to get one model --> Hasn't improved that much in the past few years (2014)

  • These guys created one simple model which beat up the old models

  • Using CNNs + pre-training on labeled images



Introduction:

  • In 2012, CNN's made their debut to the world by being 1st on Imagenet --> We're just going to transfer that to object detection

  • When using CNN's we can't just use sliding window trick (used for 2 decades!!) --> But it's very hard to do precise localization within the image

  • Instead we match this with the "recognition using region" paradigm.

  • How it works;

    • Generate 2000 regions on the input image (so it will take each part of the image and break it up into little chunks (like a mug --> Handle, round circle part, and the walls etc etc)

    • CNN will produce a feature vector for each region proposal

    • SVM (support vector machines) will then classify each region)

  • That's why this is called R-CNNs. It's Region CNN's.

  • 

  • Another problem we have is training data. We just don't have enough labeled data to train our algorithm. Conventionally we just use un-labeled data to pre-train But we're going to use labeled data!

  • R-CNN = efficient too. Only smaller operations needed so it's faster..



Object detection with R-CNN:

  • To recap the three parts of the model are:

    • Region proposals

    • CNN to produce feature vectors

    • SVM to classify

  • Region Proposals:

    • They just sited some prior work and said they'll be copying them haha

  • CNN:

    • 4096 dimensional feature vector output

    • It's just a 5 conv + 2 linear layers for the CNN - they just went for the simplest model

    • The regions might be different sizes - so the regions are always jammed into a specific dimension (stretched or squished

  • At the end: 13s/image on GPU, or 53s/image on CPU

  • For pre-training. They trained on a huge dataset for classification. The dataset = super general, but they can specify it later to dataset for object detection

  • Results:

    • PASCAL VOC = Faster + an improvement from 35.1% to 53.7% mAP



Visualization, ablation, and modes of error + The ILSVRC2013 detection dataset + Semantic segmentation

  • [they go into great length about this stuff. imo it's not that interesting so I just skipped it haha]



Conclusion:

  • [they re-write the introduction]

  • We must combine both Classical Comptuer vision and deep learning together to get the best results. They aren't opposinging fields. They're both perfect partners!


Comments (loading...)

Sign in to comment

ML Paper Collection

A Collection of Summaries of my favourite ML papers!