
By Vihar Kurama and Shaoni Mukherjee

Deep learning has completely transformed the way computers view and recognize images. Deep learning architectures help machines classify images, predict objects in an image, and even generate realistic visuals. The base of these powerful deep learning models lies in dense neural network architectures, which are designed for optimal performance, accuracy, and efficiency. In this article, we will explore the three most widely used Convolutional Neural Networks: ResNet, InceptionV3, and SqueezeNet. We will learn how they work, their key innovations, and their impact on deep learning applications.
ResNet: ResNet, also known as Residual Networks, was proposed by Microsoft in 2015 and brought a breakthrough to address the vanishing/exploding gradient issue in deep learning networks. Traditional deep learning models struggled with training as they became increasingly dense due to gradient degradation, making it extremely difficult to learn effectively as the number of layers increased. ResNet introduced skip connections (residual connections) that allow the gradient to bypass multiple layers and form residual blocks, making it possible to train extremely deep networks like ResNet-50 and ResNet-101 without significant performance loss. InceptionV3: Google developed Inceptionv3, an improved version of the Inception architecture, in 2016. The architecture is designed to optimize computation using parallel convolutional filters of different sizes.
As deep neural networks are both time-consuming to train and prone to overfitting, a team at Microsoft introduced a residual learning framework to improve the training of networks that are substantially deeper than those used previously. This research was published in the Deep Residual Learning for Image Recognition 2015 paper. And so, the famous ResNet (short for “Residual Network”) was born. When training deep networks, there comes a point where an increase in depth causes accuracy to saturate and then degrade rapidly. This is called the “degradation problem.” It highlights that not all neural network architectures are equally easy to optimize. ResNet uses a technique called “residual mapping” to combat this issue. Instead of hoping that every few stacked layers directly fit a desired underlying mapping, the Residual Network explicitly lets these layers fit a residual mapping. Below is the building block of a Residual network.
The formulation of F(x)+x can be realized by feedforward neural networks with shortcut connections.
ResNets can address many problems. They are easy to optimize and achieve higher accuracy when the depth of the network increases, producing better results than previous networks. ResNet was first trained and tested on ImageNet’s over 1.2 million training images belonging to 1000 different classes.
Compared to the conventional neural network architectures, ResNets are relatively easy to understand. Below is the image of a VGG network, a plain 34-layer neural network, and a 34-layer residual neural network. In the plain network, the layers have the same number of filters for the same output feature map. If the size of output features is halved, the number of filters is doubled, making the training process more complex.
Meanwhile, in the Residual Neural Network, as we can see, there are far fewer filters and lower complexity during the training with respect to VGG. A shortcut connection is added that turns the network into its counterpart residual version. This shortcut connection performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no additional parameter. The projection shortcut is mathematically represented as F(x{W}+x), which is used to match dimensions computed by 1×1 convolutions.
The table below shows the layers and parameters of the different ResNet architectures.
Each ResNet block is either two layers deep (used in small networks like ResNet 18 or 34) or 3 layers deep (ResNet 50, 101, or 152).
The samples from the ImageNet dataset are re-scaled to 224 × 224 and are normalized by a per-pixel mean subtraction. Stochastic gradient descent is used for optimization with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error increases, and the models are trained up to 60 × 104 iterations. The weight decay and momentum are set to 0.0001 and 0.9, respectively. Dropout layers are not used.
ResNet performs extremely well with deeper architectures. Below is an image showing the error rate of two 18- and 34-layer neural networks. The graph on the left shows plain networks, while the graph on the right shows their ResNet equivalents. The thin red curve in the image represents the training error, and the bold curve represents the validation error.
Below is the table showing the Top-1 error (%, 10-crop testing) on ImageNet validation.
ResNet has played a significant role in defining the field of deep learning as we know it today.
Below are a few important links if you’re interested in implementing a ResNet yourself:
The Wide Residual Network recently improved the original Deep Residual Networks. Rather than relying on increasing the depth of a network to improve its accuracy, it was shown that a network could be made shallower and wider without compromising its performance. This ideology was presented in the paper Wide Residual Networks, which was published in 2016 (and updated in 2017 by Sergey Zagoruyko and Nikos Komodakis.
Deep Residual Networks have shown remarkable accuracies that have aided in performing tasks like image recognition, among many others, with relative ease. Nonetheless, a deep network still faces the challenges of network degradation and exploding or vanishing gradients. A deep residual network doesn’t guarantee the usage of all residual blocks either; a few could be skipped, or only a few make it to the larger contributing blocks (i.e., extract useful information). This problem could be solved by disabling random blocks–this is a dropout in a broader sense. Taking a cue from this idea, the authors of Wide ResNet have proved that a wide residual network can perform even better than a deep one.
A Wide ResNet has a group of ResNet blocks stacked together, each following the BatchNormalization-ReLU-Conv structure. This structure is depicted as follows:
Five groups comprise a wide ResNet. The block here refers to the residual block B(3, 3). Conv1 remains intact in any network, whereas conv2, conv3, and conv4 vary according to k, which defines the width. An average-pool layer and a classification layer succeed the convolutional layers.
The following variable measures could be tweaked to come up with various representations of the residual blocks: Convolution type: B(3, 3) shown above has 3 × 3 convolutional layers in the residual block. Other possibilities, such as integrating 3 × 3 convolutional layers with 1 × 1 convolutions, could also be explored.
Wide ResNet was trained on CIFAR-10. The following metrics resulted in the lowest error rates:
The following table compares the complexity and performance of Wide ResNet with several other models, including the original ResNet, on both CIFAR-10 and CIFAR-100:
Below are a few important links for implementing Wide ResNet yourself:
Inception v3 mainly focuses on burning less computational power by modifying the previous Inception architectures. This idea was proposed in the paper Rethinking the Inception Architecture for Computer Vision, published in 2015. It was co-authored by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, and Jonathon Shlens.
In comparison to VGGNet, Inception Networks (GoogLeNet/Inception v1) have proved to be more computationally efficient, both in terms of the number of parameters generated by the network and the economical cost incurred (memory and other resources). If any changes are to be made to an Inception Network, care needs to be taken to make sure that the computational advantages aren’t lost. Thus, the adaptation of an Inception network for different use cases turns out to be a problem due to the uncertainty of the new network’s efficiency. In an Inception v3 model, several techniques for optimizing the network have been put suggested to loosen the constraints for easier model adaptation. The techniques include factorized convolutions, regularization, dimension reduction, and parallelized computations.
The architecture of an Inception v3 network is progressively built, step-by-step, as explained below:
In the middle we see a 3x3 convolution, and below a fully-connected layer. Since both 3x3 convolutions can share weights among themselves, the number of computations can be reduced. 3. Asymmetric convolutions: A 3 × 3 convolution could be replaced by a 1 × 3 convolution followed by a 3 × 1 convolution. If a 3 × 3 convolution is replaced by a 2 × 2 convolution, the parameters would be slightly higher than the asymmetric convolution proposed.
 4. Auxiliary classifier: An auxiliary classifier is a small CNN inserted between layers during training, and the loss incurred is added to the main network loss. In GoogLeNet, auxiliary classifiers were used for a deeper network, whereas in Inception v3, an auxiliary classifier acts as a regularizer.
4. Auxiliary classifier: An auxiliary classifier is a small CNN inserted between layers during training, and the loss incurred is added to the main network loss. In GoogLeNet, auxiliary classifiers were used for a deeper network, whereas in Inception v3, an auxiliary classifier acts as a regularizer.
5. Grid size reduction: Grid size reduction is usually done by pooling operations. However, to combat the bottlenecks of computational cost, a more efficient technique is proposed:
All the above concepts are consolidated into the final architecture.
Inception v3 was trained on ImageNet and compared with other contemporary models, as shown below.
As shown in the table, when augmented with an auxiliary classifier, factorization of convolutions, RMSProp, and Label Smoothing, Inception v3 can achieve the lowest error rates compared to its contemporaries.
Below are a few relevant links if you’re looking to implement Inception v3 yourself:
SqueezeNet is a smaller network designed as a more compact replacement for AlexNet. It has almost 50x fewer parameters than AlexNet, yet it performs 3x faster. Researchers at DeepScale, The University of California, Berkeley, and Stanford University in 2016 proposed this architecture. It was first published in their paper titled SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.
Below are the key ideas behind SqueezeNet:
The SqueezeNet architecture comprises “squeeze” and “expand” layers. A squeeze convolutional layer has only a 1 × 1 filter. As shown below, these are fed into an expand layer with a mix of 1 × 1 and 3 × 3 convolution filters.
The paper’s authors use the term “fire module” to describe a combination of a squeeze and expand layers.
An input image is first sent into a standalone convolutional layer. This layer is followed by 8 “fire modules” named “fire2-9”, according to Strategy One above. The illustration of the resulting SqueezeNet is shown below.
From left to right: SqueezeNet, SqueezeNet with simple bypass, and SqueezeNet with complex bypass.
Following Strategy Two, the filters per fire module are increased with a “simple bypass.” Lastly, SqueezeNet performs max-pooling with a stride of 2 after layers conv1, fire4, fire8, and conv10. According to Strategy Three, pooling is given a relatively late placement, resulting in SqueezeNet with a “complex bypass” (the rightmost architecture in the image above.)
Below is an image showing how SqueezeNet compares with the original AlexNet.
As we can observe, the weights of the compressed model for AlexNet were 240MB and achieved 80.3% accuracy. Meanwhile, a Deep Compression SqueezeNet consumes 0.47MB of memory and achieves the same performance.
Below are the details of other parameters used in the network:
SqueezeNet makes the deployment process easier due to its small size. Initially this network was implemented in Caffe, but the model has since gained in popularity and has been adopted to many different platforms.
Below are a few relevant links for implementing SqueezeNet on your own, or further investigating the original implementation:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.