Pooling In Convolutional Neural Networks

Updated on August 29, 2025

AI/ML

Deep Learning

Neural Network

Computer Vision

By Oreolorun Olu-Ipinlaye and Shaoni Mukherjee

Pooling In Convolutional Neural Networks

Introduction

Convolutional Neural Networks (CNNs) have become the backbone of modern computer vision, powering applications from image classification and object detection to medical imaging and self-driving cars. At the heart of CNNs lies not just convolutional layers, but also pooling layers, which play a crucial role in how these networks learn to recognize patterns effectively.

Pooling can be thought of as a data reduction step. After convolutions extract detailed feature maps from an image, pooling compresses this information into a smaller, more manageable form. By doing so, it helps the model focus on the most important patterns, such as edges, textures, or shapes, while removing less critical details. This makes training more efficient, reduces overfitting, and ensures that the model can recognize objects even if they appear in slightly different positions or scales within an image.

There are several types of pooling, such as Max Pooling, which preserves the strongest features, Average Pooling, which smooths out representations, and Global Pooling, often used in the final layers of CNNs. Pooling not only improves computational efficiency but also introduces translation invariance, allowing CNNs to generalize better to real-world data.

In this article, we’ll break down how pooling works, explore different pooling techniques, discuss their advantages and limitations, and look at how modern CNN architectures are adapting or even replacing traditional pooling strategies. By the end, you’ll understand why pooling remains a fundamental concept in deep learning, even as architectures evolve toward more sophisticated alternatives.

Key Points

Pooling layers reduce the spatial dimensions of feature maps, making CNNs more efficient and less computationally expensive.
The most common pooling methods are Max Pooling, Average Pooling, and Global Pooling, each with different effects on feature preservation.
Pooling introduces translation invariance, helping models recognize objects even when they appear in different positions within an image.
By downsampling, pooling helps control overfitting, though too much pooling may cause loss of important details.

Prerequisites

To understand pooling in Convolutional Neural Networks (CNNs), we should be familiar with the following concepts:

Convolution Operation: An understanding of how convolution layers work in CNNs, including how filters (kernels) are applied to input data to create feature maps. Understand concepts like stride, padding, and how convolutions help extract spatial hierarchies of features from images.
Feature Maps: An understanding of how feature maps are created from input images using filters, and how these maps represent various detected patterns (like edges or textures) at different layers.
Activation Functions: The way the activation functions (like ReLU) are applied to introduce non-linearity into the network, which allows it to learn complex patterns.

Having a solid grasp of these fundamentals will help you understand the role of pooling in reducing spatial dimensions, enhancing feature extraction, and increasing the efficiency of CNNs.

The Pooling Process

Similar to convolution, the pooling process also utilizes a filter/kernel, albeit one without any elements (sort of an empty array). It essentially involves sliding this filter over sequential patches of the image and processing pixels caught in the kernel in some kind of way; basically the same as a convolution operation.

Strides In Computer Vision

In deep learning frameworks, there exists a not-too-popular, yet highly fundamental, parameter that dictates the behavior of convolution and pooling classes. In a more general sense, it controls the behavior of any class that employs a sliding window for one reason or the other. That parameter is termed ‘stride.’ It is referred to as the sliding window because scanning over an image with a filter resembles sliding a small window over the image’s pixels.

The stride parameter determines how much a filter is shifted in either dimension when performing sliding window operations like convolution and pooling.

Strides in a sliding window operation using a kernel/filter of size (2, 2)

In the image above, filters are slid in both dim 0 (horizontal) and dim 1 (vertical) on the (6, 6) image. When stride=1, the filter is slid by one pixel. However, when stride=2, the filter is slid by two pixels; three pixels when stride=3. This has an interesting effect when generating a new image via a sliding window process; as a stride of 2 in both dimensions essentially generates an image which is half the size of its original image. Likewise, a stride of 3 will produce an image that is a third of the size of its reference image, and so on.

When stride > 1, a representation that is a fraction of the size of its reference image is produced.

When performing pooling operations, it is important to note that the stride is always equal to the size of the filter by default. For instance, if a (2, 2) filter is to be used, the stride is defaulted to a value of 2.

Types of Pooling

There are mainly two types of pooling operations used in CNNs, they are, Max Pooling and Average Pooling. The global variants of these two pooling operations also exist, but they are outside the scope of this particular article (Global Max Pooling and Global Average Pooling).

Max Pooling

Max pooling entails scanning over an image using a filter and at each instance returning the maximum pixel value caught within the filter as a pixel of its own in a new image.

The max pooling operation

From the illustration, an empty (2, 2) filter is slid over a (4, 4) image with a stride of 2 as discussed in the section above. The maximum pixel value at every instance is returned as a distinct pixel of its own to form a new image. The resulting image is said to be a max-pooled representation of the original image (Note that the resulting image is half the size of the original image due to a default stride of 2 as discussed in the previous section).

Average Pooling

Just like Max Pooling, an empty filter is also slid over the image, but in this case, the average/mean value of all the pixels caught in the filter is returned to form an average-pooled representation of the original image, as illustrated below.

The average pooling operation

Max Pooling Vs Average Pooling

From the illustrations in the previous section, one can clearly see that pixel values are much larger in the max pooled representation compared to the average pooled representation. In more simple terms, this simply means that representations resulting from max pooling are often sharper than those derived from average pooling.

Essence of Pooling

Convolutional Neural Networks extract features as edges from an image via the process of convolution. These extracted features are termed feature maps. Pooling then acts on these feature maps and serves as a kind of principal component analysis (permit me to be quite liberal with that concept) by looking through the feature maps and producing a small-sized summary in a process called down-sampling.

In less technical terms, pooling generates small-sized images that retain all the essential attributes (pixels) of a reference image. Basically, one could produce a (25, 25) pixel image of a car, which would retain all the general details and makeup of a reference image sized (400, 400) by iteratively pooling 4 times using a (2, 2) kernel. It does this by utilizing strides which are greater than 1, allowing for the production of representations which are a fraction of the original image.

Going back to CNNs, as convolution layers get deeper, the number of feature maps (representations resulting from convolution) increases. If the feature maps are of the same size as the image provided to the network, computation speed would be severely hampered due to the large volume of data present in the network, particularly during training. By progressively down-sampling these feature maps, the amount of data in the network is effectively kept in check even as feature maps increase in number. What this means is that the network will progressively have reasonable amounts of data to deal with without losing any of the essential features extracted by the previous convolution layer, resulting in faster computation.

Another effect of pooling is that it allows Convolutional Neural Networks to be more robust as they become translation invariant. This means the network will be able to extract features from an object of interest regardless of the object’s position in an image (more on this in a future article).

In this section, we will use manually written pooling functions to visualize the pooling process and better understand what actually goes on. Two functions are provided, one for max pooling and the other for average pooling. Using the functions, we will attempt to pool the image of size (446, 550) pixels below.

#  import these dependencies
import torch
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
from tqdm import tqdm

Don’t forget to import these dependencies

Max Pooling Behind The Scenes

def max_pool(image_path, kernel_size=2, visualize=False, title=''):
      """
      This function replicates the maxpooling
      process
      """
      
      #  assessing image parameter
      if type(image_path) is np.ndarray and len(image_path.shape)==2:
        image = image_path
      else:
        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
      
      #  creating an empty list to store convolutions
      pooled = np.zeros((image.shape[0]//kernel_size, 
                        image.shape[1]//kernel_size))
                        
      #  instantiating counter
      k=-1
      #  maxpooling
      for i in tqdm(range(0, image.shape[0], kernel_size)):
        k+=1
        l=-1
        if k==pooled.shape[0]:
          break
        for j in range(0, image.shape[1], kernel_size):
          l+=1
          if l==pooled.shape[1]:
            break
          try:
            pooled[k,l] = (image[i:(i+kernel_size), 
                                j:(j+kernel_size)]).max()
          except ValueError:
            pass
            
      if visualize:
        #  displaying results
        figure, axes = plt.subplots(1,2, dpi=120)
        plt.suptitle(title)
        axes[0].imshow(image, cmap='gray')
        axes[0].set_title('reference image')
        axes[1].imshow(pooled, cmap='gray')
        axes[1].set_title('maxpooled')
      return pooled

Max Pooling Function shown above.

The function above replicates the max pooling process. Using the function, let’s attempt to max pool the reference image using a (2, 2) kernel.

max_pool('image.jpg', 2, visualize=True)

Producing a max pooled representation

Looking at the number-lines on each axis it is clear to see that the image has reduced in size but has kept all of its details intact. Its almost like the process has extracted the most salient pixels and produced a summarized representation which is half the size of the reference image (half because a (2, 2) kernel was used).

The function below allows for the visualization of several iterations of the max pooling process.

def visualize_pooling(image_path, iterations, kernel=2):
      """
      This function helps to visualise several
      iterations of the pooling process
      """
      image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

      #  creating empty list to hold pools
      pools = []
      pools.append(image)

      #  performing pooling
      for iteration in range(iterations):
        pool = max_pool(pools[-1], kernel)
        pools.append(pool)
      
      #  visualisation
      fig, axis = plt.subplots(1, len(pools), dpi=700)
      for i in range(len(pools)):
        axis[i].imshow(pools[i], cmap='gray')
        axis[i].set_title(f'{pools[i].shape}', fontsize=5)
        axis[i].axis('off')
      pass

Using this function, we can visualize 3 generations of max-pooled representation using a (2, 2) filter as seen below. The image goes from a size of (446, 450) pixels to a size of (55, 56) pixels (essentially a 1.5% summary), whilst maintaining its general makeup.

visualize_pooling('image.jpg', 3)

Reference image through 3 progressive iterations of max pooling using a (2, 2) kernel.

The effects of using a larger kernel (3, 3) are seen below. As expected, the reference image reduces to 1/3 its preceding size for every iteration. By the third iteration, a pixelated (16, 16) down-sampled representation is produced (a 0.1% summary). Although pixelated, the overall idea of the image is somewhat still maintained.

visualize_pooling('image.jpg', 3, kernel=3)

Reference image through 3 iterations of max pooling using a (3, 3) kernel.

To properly try to imitate what the max pooling process might look like in a Convolutional Neural Network, let’s run a couple of iterations over vertical edges detected in the image using a Prewitt operator.

Max pooling over detected edges.

Average Pooling Behind The Scenes

def average_pool(image_path, kernel_size=2, visualize=False, title=''):
      """
      This function replicates the averagepooling
      process
      """
      
      #  assessing image parameter
      if type(image_path) is np.ndarray and len(image_path.shape)==2:
        image = image_path
      else:
        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
      
      #  creating an empty list to store convolutions
      pooled = np.zeros((image.shape[0]//kernel_size, 
                        image.shape[1]//kernel_size))
                        
      #  instantiating counter
      k=-1
      #  averagepooling
      for i in tqdm(range(0, image.shape[0], kernel_size)):
        k+=1
        l=-1
        if k==pooled.shape[0]:
          break
        for j in range(0, image.shape[1], kernel_size):
          l+=1
          if l==pooled.shape[1]:
            break
          try:
            pooled[k,l] = (image[i:(i+kernel_size), 
                                j:(j+kernel_size)]).mean()
          except ValueError:
            pass
            
      if visualize:
        #  displaying results
        figure, axes = plt.subplots(1,2, dpi=120)
        plt.suptitle(title)
        axes[0].imshow(image, cmap='gray')
        axes[0].set_title('reference image')
        axes[1].imshow(pooled, cmap='gray')
        axes[1].set_title('averagepooled')
      return pooled

Average pooling function shown above.

The function above replicates the average pooling process. Note that this is identical code to the max pooling function, with the distinction of using the mean() method as the kernel is slid over the image. An average pooled representation of our reference image is visualized below.

average_pool('image.jpg', 2, visualize=True)

Producing an average pooled representation.

Similar to max pooling, the image has been reduced to half its size while keeping its most important attributes. This is quite interesting because, unlike max pooling, average pooling does not directly use the pixels in the reference image; rather, it combines them, basically creating new attributes (pixels), yet details in the reference image are preserved.

Let’s see how the average pooling process progresses through 3 iterations using the visualization function below.

def visualize_avg_pooling(image_path, iterations, kernel=2):
      """
      This function helps to visualise several
      iterations of the pooling process
      """
      image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

      #  creating empty list to hold pools
      pools = []
      pools.append(image)

      #  performing pooling
      for iteration in range(iterations):
        pool = average_pool(pools[-1], kernel)
        pools.append(pool)
      
      #  visualisation
      fig, axis = plt.subplots(1, len(pools), dpi=700)
      for i in range(len(pools)):
        axis[i].imshow(pools[i], cmap='gray')
        axis[i].set_title(f'{pools[i].shape}', fontsize=5)
        axis[i].axis('off')
      pass

Using this function, we can visualize 3 generations of average-pooled representation using a (2, 2) filter as seen below. The image goes from a size of (446, 450) pixels to a size of (55, 56) pixels (essentially a 1.5% summary), whilst maintaining its general makeup.

visualize_avg_pooling('image.jpg', 3)

Reference image through 3 iterations of average pooling using a (2, 2) kernel.

Average pooling using a (3, 3) kernel yields the following result. As expected, image size is reduced to 1/3 of its preceding value. Heavy pixelation sets in by the 3 iteration just like in max pooling but the overall attributes of the image are reasonably intact.

visualize_avg_pooling('image.jpg', 3, kernel=3)

Reference image through 3 iterations of average pooling using a (3, 3) kernel.

Running (2, 2) average pooling over vertical edges detected using a Prewitt operator produces the results below. Just as in max pooling, the image features (edges) become more pronounced with progressive average pooling.

Average pooling over detected edges.

Max Pooling or Average Pooling?

After learning about both the max and average pooling processes, one naturally wonders which is superior for computer vision applications. The truth is, arguments can be made either way.

On one hand, since max pooling selects the highest pixel values caught in a kernel, it produces a much sharper representation.

Comparing representations produced using both methods.

On the other hand, an argument could be made in favor of average pooling that it produces more generalized feature maps. Consider our reference image of size (444, 448), when pooling with a kernel of size (2, 2), its pooled representation is of a size (222, 224), basically 25% of the total pixels in the reference image. Because max pooling basically selects pixels, some argue that it results in a loss of data, which might be detrimental to the network’s performance. In opposing fashion, instead of selecting pixels, average pooling combines pixels into one by computing their mean value; therefore, some are of the belief that average pooling simply compresses pixels by 75% instead of explicitly removing them, which would yield more generalized feature maps, thereby doing a better job at combating overfitting.

Comparing effect on edges.

What side of the divide do I belong to? Personally, I believe max pooling’s ability to further highlight edges in feature maps gives it an edge in computer vision/deep learning applications, hence why it is more popular.

FAQ’s

Q1: Why is pooling necessary in CNNs?

Pooling reduces the spatial dimensions of feature maps, which:

Decreases computational load and memory usage.
Helps prevent overfitting by reducing parameters.
Creates translation invariance, making the model more robust to slight position changes in input images.
Allows the network to focus on the most important features.

Q2: How do Max Pooling and Average Pooling differ?

Max Pooling selects the highest value in a region, preserving the most prominent features like edges and textures.
Average Pooling calculates the mean of values in a region, resulting in smoother feature maps. Max pooling is more commonly used because it tends to highlight key features better.

Q3: Can CNNs work without pooling layers?

Yes, modern architectures sometimes replace traditional pooling with:

Strided convolutions that reduce dimensions while learning features.
Dilated/atrous convolutions that expand receptive fields without pooling.
Attention mechanisms that focus on important regions without explicit downsampling.

Q4: How are pooling strategies evolving in modern CNN architectures?

Pooling is still widely used, but some modern CNN architectures (like ResNet and Inception) rely more on strided convolutions or adaptive pooling layers. These methods give more control over feature reduction while preserving critical spatial details.

Q5: What are the downsides of excessive pooling?

Too much pooling can lead to:

Loss of fine-grained details and spatial information.
Reduced ability to precisely localize objects (important for tasks like segmentation).
Potential underfitting if too much information is discarded.

Conclusion

In this article, we’ve developed an intuition of what pooling entails in Convolutional Neural Networks. We’ve looked at two main types of pooling and the difference in the pooled representations produced by each one. For all the talk about pooling in CNNs, bear in mind that most architectures these days tend to favor strided convolution layers over pooling layers for down-sampling, as they reduce the network’s complexity. Regardless, pooling remains an essential component of Convolutional Neural Networks. If you’re looking to experiment with CNN architectures, train models, or scale deep learning workloads, you can leverage DigitalOcean Gradient™ AI GPU Droplets for accelerated compute. They provide the performance needed for tasks like image recognition, object detection, and other computer vision applications.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Oreolorun Olu-Ipinlaye

Author

Shaoni Mukherjee

Editor

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: