Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Summary of "Going Deeper with Convolutions - GoogLeNet"

Profile picture of Bagavan SivamBagavan Sivam
Jun 4, 2021Last updated Jun 4, 20216 min read

This is a summary of the paper: Going Deeper with Convolutions - GoogLeNet

Abstract

This paper proposes a deep convolutional neural network architecture, known as "inception," which is competitive with many state-of-the-art models in the field of classification and detection. The main trait of this network is the improved utilization of computational resources inside the network, as its design allows for an increase in the depth and width of the network while still maintaining a computational budget.

The model is known as GoogLeNet and is a 22-layer deep network assessed through classification and detection.

Introduction

Over the last couple of years, image recognition and object detection have been progressing rapidly. This is mainly due to new ideas, algorithms, and architectures (rather than better hardware, datasets, and bigger models).

When it comes to object detection, the biggest gains were from utilizes deep networks, deep architectures, and classical computer vision (like R-CNN algorithms).

The design of the deep architecture in this paper allows it to perform in real-world applications as well as use large datasets with a lower chance of overfitting. It focuses on the architecture known as "inception," which was from the network-in-network paper.

We need to go deeper

The deep part of this saying has two different meanings for our architecture.

  1. Introducing a new level of organization - which is our inception module

  2. Increase in network depth

After training this model, it significantly outperformed many current state-of-the-art classification and object detection models and architectures.



Related Work

Convolutional Neural Networks (CNNs) CNNs have standard structures, which are stacked convolutional layers followed by one or more fully connected layers. This model has been proven to be the best for classification challenges as it also uses dropout layers to prevent overfitting. CNNs can also be used in localization, object detection, and human pose estimation.

The inception model consists of many repeated convolutional layers, which leads to a 22-layer deep model.

Network-in-Network The Network-in-Network approach was proposed in order to increase the representational power of neural networks. If we apply this to CNNs, it can be viewed as just an additional 1x1 convolutional followed by a ReLU activation. This enables easy integration of this model into CNN pipelines.

This approached is used significantly in our inception architecture. However, the additional convolutional layer has another purpose too. It is also primarily used as a dimension reduction model to remove computational "bottlenecks".

Region-Based Convolutional Neural Networks (R-CNNs) R-CNNs are currently the leading approach in object detection, as it breaks down the overall detection problem into two subproblems:

  1. Utilizes low-level cues - like colours, superpixel consistency, etc. - for objects in a category-agnostic fashion

  2. Uses CNN classifiers to identify the category of the object and its location.

Our architecture adopted this idea and continued to explore more enhancements in both stages.

Motivation and High-Level Considerations

The straightforward way to improve a model's performance is by increasing its depth and width.

Depth = Number of levels in the network Width = Number of units at each level

This makes it easy and safe for training models; however, there are still two main drawbacks:

  1. A bigger size generally means a more significant number of parameters - making it more prone to overfitting.

  2. Increases the computational resources used to train - making it less efficient.

To solve these problems, we can replace the fully connected architecture with a sparsely connected one. Utilizing a large dataset and a sparse network is optimal as the network can be created layer by layer through analyzing the correlation of activation function, which leads to higher correlated output.

Architecture Details

The main idea of the architecture was to find the optimal local sparse structure in the convolutional vision networks. Some of the main parts of the architecture include:

  • An average pooling layer with a 5x5 filter size and a stride of 2. Which results in a 4x4x512 output and 4x4x528 for the stage.

  • A 1x1 convoultion with 128 filters follwed by a ReLU activaiton - for dimension reduction

  • A fully connected layer with 1024 units and a ReLU activation

  • A dropout layer with a 70% ratio of dropped outputs

  • A linear layer with softmax loss and classifier - for predicting the same 100 classes as the primary classifier

Of course, there are many other parts to the architecture; however, this is the basic concept. You can find a picture of the full GoogLeNet architecture here.



Training Methodology

The network was trained using the DistBelief distributed machine learning system, with a low model and data parallelism. The training used was asynchronous stochastic gradient descent with a momentum of 0.9.

The image sampling methods were changed over months of iterations; however, some models were mainly trained on smaller crops of images, while others on larger ones. The sampled images were on various-sized patches of the image; however, they still had an even size distribution between 8-100% of the image area.



Experiments/Results

The network was independently trained for seven versions of the same GoogLeNet model and performed an ensemble of predictions with them. During the tests, the model adopted more aggressive approaches to cropping.

A comparison of different models in a classification competition:



A comparison of different models in a detection competition:





Conclusion

The results from the model yielded more evidence that can approximate the expected optimal sparse structure for readily available building blocks, which is a viable method. The main advantage of this method is the significant quality gain at an increase of computational requirements compared to shallower/less wide networks.

The detection results for our model were still competitive even though the model didn't utilize context or perform bounding box regression.

The quality of this model's results can also be achieved with other networks; however, they would be more expensive in terms of network depth and width. This shows us that this model is much more feasible and valuable in many real-world applications!



Those were some of the key concepts/points covered in this paper.


Comments (loading...)

Sign in to comment

AI Paper Summaries

A bunch of summaries of AI research papers.