Introduction to Convolutional Neural Networks

Let’s learn about convolutional neural networks. I’m assuming you’re mostly familiar with what CNNs are, and this tutorial is meant to refresh your memory on concepts you might have missed out on or forgotten.

Core Concepts

Kernels and Filters

In a CNN, kernels (or filters) are small weight matrices that slide over the input. They compute a dot product between the kernel's weights and the local input patches they cover. They eventually learn to extract things like edges, textures, etc. The same kernel is applied to all spatial locations (this is called shared weights), which makes CNNs more parameter-efficient than feedforward networks. The larger the kernel, the more expensive it is computationally. You can apply multiple kernels per layer to learn multiple features and channels.

Stride

Stride controls how far the kernel moves at each step. A larger stride results in fewer steps and acts as a form of downsampling.

Padding

Padding inserts zeros around the input to preserve its size. Common types are:

Valid: No padding.
Constant: Adding a fixed number of zeros around the input.
Same: Adding enough padding to keep the output size the same as the input size.

The formula for the output size is:

$$ \text{Output Size} = \bigg\lfloor \frac{(\text{Input Size} - \text{Kernel Size} + 2 \times \text{Padding})}{\text{Stride}} \bigg\rfloor + 1 $$

Receptive Field

The receptive field of a neuron is the region of the input image that can influence that neuron’s value. How much of the original input does this neuron get to see? It determines how much context the model has. A small receptive field sees local details, like edges, while a large receptive field sees global structures, like shapes and objects.

With stride, you increase the spacing between the receptive fields of adjacent neurons, not the size of the receptive field itself; each step just jumps farther. Stride controls the resolution and indirectly spreads the influence of each neuron over a wider input area.

Padding lets the kernel reach the edge of the input. It doesn’t directly impact the receptive field of a single layer. However, without padding the input shrinks. Padding preserves the size, so when more layers are stacked, the receptive field grows. It allows for deeper models without the input shrinking too fast.

Calculating the Receptive Field

Here’s how you can calculate the receptive field. Let’s define for layer i:

RF[i]: Receptive field size at layer i.
K[i]: Kernel size at layer i.
S[i]: Stride at layer i.
J[i]: Jump (the step size in terms of input pixels).

For the input layer, RF[0]=1 (each pixel sees itself) and J[0]=1 (the input step size). Think of the jump as how many input pixels separate two adjacent outputs in the current layer. When a layer aggregates across k inputs, and each of those inputs is already spread apart, the receptive field grows faster than just k.

At each layer i, we update the receptive field and jump as follows:

$$ \text{RF}[i] = \text{RF}[i-1] + (K[i]-1) \times J[i-1] $$ $$ J[i] = J[i-1] \times S[i] $$

The term K[i]-1 is how many new "jumps" you’re adding, and J[i-1] is how far apart those inputs are in the input space.

Example:

Assume for a layer:

RF[1] = 3 (from a previous convolution)
J[1] = 2 (Layer 1 outputs are 2 input pixels apart)
K = 3 (the next layer's kernel)

Then the receptive field for the next layer is:

$$ \text{RF}[2] = 3 + (3-1) \times 2 = 7 $$

You're adding 2 jumps of 2 pixels each because the kernel sees 3 inputs. The jump tells us how stretched the receptive field becomes. The more stride you use, the more your output is coarsely spread across the input. So, when a new kernel sees k inputs, the total receptive field includes the jumps between those inputs.

In vanilla CNNs with small kernels, the receptive field grows linearly, and you might need dozens of layers to reach a large enough receptive field. Even if theory says the receptive field is large, in practice, most of the signal comes from a much smaller region. This is called the effective receptive field, and it’s usually smaller and Gaussian-shaped. There are a number of techniques we can apply to make sure the receptive field grows fast, which will be covered over the course of this tutorial.

Advanced Techniques & Implementations

Convolution in Practice

Let’s take a look at how kernels are applied in a convolution layer using NumPy. We’re sliding the kernel over the input, where i goes down the rows and j goes across the columns.


import numpy as np

def conv2d(input, kernel, stride=1, padding=0):
    H_in, W_in = input.shape
    K = kernel.shape[0]  # Assume square kernel for simplicity

    # Pad input
    if padding > 0:
        input = np.pad(input, ((padding, padding), (padding, padding)), mode='constant')

    H_out = (input.shape[0] - K) // stride + 1
    W_out = (input.shape[1] - K) // stride + 1

    output = np.zeros((H_out, W_out))

    for i in range(H_out):  # Move down the rows
        for j in range(W_out):  # Move across the columns
            i_start = i * stride
            j_start = j * stride
            patch = input[i_start:i_start + K, j_start:j_start + K]
            output[i, j] = np.sum(patch * kernel)

    return output

Now, consider this example: let’s say you want to get a receptive field of 5. You can get that with one 5x5 kernel layer. Or, you can get it with two 3x3 kernels with a stride and padding of 1. The two 3x3 kernels would actually have fewer parameters and can add more non-linearity, increasing expressiveness. For shallow models, a 5x5 might benefit from wide kernels, but even then, a dilated 3x3 can do the same task.

Dilation

Dilation means inserting spaces between kernel elements when sliding over the input, so the kernel reaches further without increasing its size or parameter counts. Dilation is like using a "zoomed-out" kernel. A regular 3x3 kernel covers a 3x3 patch. If the dilation is 2, we’re inserting 1 zero between each kernel element, so now it covers a 5x5 region even though the kernel is still 3x3.

The effective kernel size is:

$$ \text{Effective Kernel Size} = K + (K - 1) \times (\text{Dilation} - 1) $$

This method skips over input positions, increasing the receptive field without adding parameters. The receptive field with dilation (D[i]) is calculated as:

$$ \text{RF}[i] = \text{RF}[i-1] + (K[i]-1) \times J[i-1] \times D[i] $$

So, if dilation seems great, why not use it all the time? There’s a limit to how far you can push it. This is called the gridding effect or blind spots. A large dilation causes the kernel to skip over much of the input. Some input positions are never touched by the neuron. The same is true for stride. Stride widens the spacing between sampled input positions, especially when combined with dilation, and can cause blind spots. It’s important to use a stride of 1 in early layers to avoid this and periodically reset to a standard convolution to restore coverage.

An example of effective kernel coverage with a dilation of 4:

[ x   0   0   0   x   0   0   0   x ]

We’re only looking at inputs 0, 4, and 8. If you stack multiple layers with the same dilation, each new layer sees a sparse subset of the previous sparse subset. So, if layer 1 sees inputs 0, 4, 8, 12, then layer 2 sees the same spots at multiples of 4, and its receptive field might cover 0, 16, 32, completely missing inputs like 1, 2, 3... They will never influence the output. Hence, you should always vary dilation across layers.

Use smaller dilation in early layers so it densely covers the input. Mix with non-dilated convolutions and use parallel branches with different dilation rates, such as in Atrous Spatial Pyramid Pooling (ASPP). ASPP is a CNN module designed to capture multi-scale context using parallel dilated convolutions. Small dilation captures fine and local detail, while larger dilation captures global and semantic context. Together, they create richer features without increasing depth.

Pooling and Downsampling

Pooling is a form of downsampling achieved by sliding a window over the input and taking a statistic of that window. It doesn’t learn; it’s hard-coded.

Max-Pooling: Keeps only the maximum value in the window. It is translationally invariant, keeps strong activations, ignores noise, and highlights dominant signals. However, max-pooling can lose spatial detail.
Average Pooling: Takes the average of the values. It smooths features, reduces noise, preserves low-frequency info, and keeps more information from the input. For audio and time-series signals where signal energy is more meaningful, average pooling may be more useful.
Global Pooling: Reduces the entire spatial or temporal dimension to one value per channel. You get one average or max per channel. Global pooling removes dimensions, reduces parameters, can replace fully-connected layers, and helps reduce overfitting.
Adaptive Pooling: Adjusts the window size and stride automatically so that the output has a target shape, no matter the input size.

A regular convolution with a stride > 1 also causes downsampling, just like pooling. It’s a learnable kernel versus the hardcoded pooling, although it is not as translation-invariant. Strided convolution may lose spatial sharpness and can cause aliasing if not tuned carefully. Aliasing happens when you downsample a signal without removing high-frequency components first, causing those components to be misrepresented as incorrect lower-frequency artifacts.

There are anti-aliasing kernels like BlurPool that apply a fixed blur kernel before downsampling. This gives the convolution a chance to extract features before reducing resolution and also helps the layer be more shift-invariant. A model is shift or translation invariant if shifting the input slightly does not change the output much. CNNs can lose this property due to operations like strided convolution, max-pooling, and zero-padding, which adds artificial borders and causes asymmetry. Sometimes using reflect or replicate padding is better than zero-padding.

Specialized Convolution Types

1x1 Convolution

A 1x1 convolution is similar to a linear layer. It applies a single learned weight to each spatial position but across all input channels. It sees one pixel location at a time across all C_in input channels and applies C_out different linear combinations (dot products) to the vector of channels. It’s like applying a linear layer with a weight shape of [C_out, C_in] to each pixel independently.

A 1x1 convolution preserves the spatial grid, as weights are shared across all locations. It mixes channel features at each pixel, whereas a linear layer would collapse the feature maps to 1D. Neither expands the receptive field. A 1x1 convolution can often be more efficient via convolution engines than matrix multiplication over flattened inputs.

Let’s talk about this weight-sharing impact. For each pixel (h,w), the input is x[:, h, w] = [R, G, B]. Then a 1x1 convolution output is:

$$ \text{output}[:, h, w] = W \cdot x[:, h, w] $$

It applies the same weight matrix W to each pixel's [R, G, B] vector to create new channels:

$$ \text{out}_1(h,w) = w_{11}R + w_{12}G + w_{13}B $$ $$ \text{out}_2(h,w) = w_{21}R + w_{22}G + w_{23}B $$

It mixes those three channels linearly using the same weights at every location. It’s not a convolution in the spatial sense; it’s applying a linear layer to the channel vector for each pixel, which is why it’s called a pointwise operation. In contrast, a linear layer flattens the input (e.g., to [1, 3 x H x W]) and applies a large weight matrix to get the output channels. It mixes all pixels and all channels together into a single vector. This is typically okay for the final classification layer, but for tasks requiring spatial awareness, pointwise convolution is preferred.

In summary, a 1x1 convolution performs channel mixing at each spatial location, reduces or expands the dimensionality of channels without touching H x W, and acts as a learned, spatially-aware MLP over the channels.

Depthwise Separable Convolution

A regular convolution mixes spatial information across channels. Each output channel is a weighted sum of all input channels. Depthwise convolution decouples this: it applies one filter per input channel independently, with no channel mixing.

In PyTorch, if you set groups = in_channels, you get one filter per channel. Each channel gets its own kernel, and there is no mixing.


def depthwise_conv2d(x, kernels, padding=1):
    """
    x: shape [C, H, W]
    kernels: shape [C, kH, kW]
    returns: shape [C, H, W]
    """
    C, H, W = x.shape
    kH, kW = kernels.shape[1:]
    padded = np.pad(x, ((0, 0), (padding, padding), (padding, padding)))
    out = np.zeros_like(x)

    for c in range(C):
        for i in range(H):
            for j in range(W):
                patch = padded[c, i:i + kH, j:j + kW]
                out[c, i, j] = np.sum(patch * kernels[c])

    return out

If depthwise convolution doesn’t do channel mixing, what does? A pointwise convolution makes a great pair. And that’s called depthwise separable convolution. The depthwise convolution outputs [B, C_in, H, W], and the pointwise convolution takes that and applies a 1x1 kernel across all channels to produce [B, C_out, H, W].

The total number of parameters for C_in=32, C_out=64, and kernel_size=3x3:

Standard Convolution: 32 × 64 × 3 × 3 = 18,432
Depthwise Separable: (32 × 3 × 3) + (32 × 64 × 1 × 1) = 288 + 2,048 = 2,336 (about 9x fewer parameters!)


import torch.nn as nn

class SeparableConv2D(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, padding):
        super().__init__()
        self.depthwise = nn.Conv2d(in_channels, in_channels, kernel_size,
                                   padding=padding, groups=in_channels, bias=False)
        self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

Batch Normalization

We haven’t talked about a really important block that convolution layers tend to employ: batch normalization. BatchNorm normalizes the activation for each feature/channel across a batch, keeping the distribution stable during training. This typically helps with speeding up convergence, training deeper networks, and allowing higher learning rates. It computes one mean and variance over the batch and spatial dimensions for each channel. It then scales and shifts the normalized activations with learnable parameters.


class BatchNorm2d:
    def __init__(self, num_channels, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.momentum = momentum
        self.num_channels = num_channels
        self.gamma = np.ones((num_channels,), dtype=np.float32) # Learnable scale
        self.beta = np.zeros((num_channels,), dtype=np.float32)  # Learnable shift
        self.running_mean = np.zeros((num_channels,), dtype=np.float32) # For inference
        self.running_var = np.ones((num_channels,), dtype=np.float32)  # For inference
        self.training = True

    def __call__(self, x):
        return self.forward(x)

    def forward(self, x):
        # x shape: [B, C, H, W]
        B, C, H, W = x.shape

        if self.training:
            # Compute mean and var over (B, H, W) for each channel
            x_reshaped = x.transpose(1, 0, 2, 3).reshape(C, -1)  # [C, B*H*W]
            mean = x_reshaped.mean(axis=1)
            var = x_reshaped.var(axis=1)

            # Update running stats
            self.running_mean = (self.momentum * mean + (1 - self.momentum) * self.running_mean)
            self.running_var = (self.momentum * var + (1 - self.momentum) * self.running_var)
        else:
            # Use running stats in eval mode
            mean = self.running_mean
            var = self.running_var

        # Normalize
        x_norm = (x - mean[None, :, None, None]) / np.sqrt(var[None, :, None, None] + self.eps)
        out = self.gamma[None, :, None, None] * x_norm + self.beta[None, :, None, None]
        return out

CNNs pass activations through many layers. Without normalization, the distribution of activations drifts (internal covariate shift), and later layers get unstable input. BatchNorm keeps the activations centered and scaled, making training smoother and more predictable. A typical convolution block has BatchNorm right after the convolution layer and before the activation function.


self.block = nn.Sequential(
    nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, padding=padding, bias=False),
    nn.BatchNorm2d(out_ch),
    nn.ReLU(inplace=True),
    nn.Dropout2d(p=dropout_prob)
)

Temporal Modeling (TCNs)

Now let’s talk about applying CNNs to temporal modeling. You can use Conv1D over time to model sequential dependencies. By applying a filter over the time dimension, it learns temporal patterns like motion, trends, and signal rhythms. Each output time step sees a window of size K from the past or future.

In a causal task, you can’t use future inputs to predict the current output. A standard Conv1D is non-causal. To fix this, you must control padding carefully. For a kernel of size 3 where you want the output at time t to depend on [t-2, t-1, t], you pad only on the left by kernel_size - 1.


import torch.nn.functional as F

class CausalConv1d(nn.Module):
    def __init__(self, in_ch, out_ch, kernel_size, dilation=1):
        super().__init__()
        self.pad = (kernel_size - 1) * dilation
        self.conv = nn.Conv1d(
            in_ch, out_ch, kernel_size=kernel_size,
            padding=0, dilation=dilation
        )

    def forward(self, x):
        # x: [B, C, T]
        x = F.pad(x, (self.pad, 0))  # Pad left only
        return self.conv(x)

The next step is to use dilation to allow for exponentially larger receptive fields with fewer layers, letting you capture long-term dependencies. With L layers, kernel size k, and dilation doubling at each layer, the receptive field becomes:

$$ \text{Receptive Field} = 2^L \times (k-1) + 1 $$

If dilation increases too fast, some intermediate time steps are skipped, and the TCN may miss finer-grained dependencies. Stacking blocks with repeated dilation cycles can help.

TCNs also need residual connections to work reliably. With a deep stack of dilated convolutions, gradients can vanish. With a residual connection, the model learns modifications on top of the input. It’s a shortcut that lets the input skip one or more layers.


class TCNBlock(nn.Module):
    def __init__(self, in_ch, out_ch, kernel_size, dilation, dropout=0.2):
        super().__init__()
        self.conv1 = CausalConv1d(in_ch, out_ch, kernel_size, dilation)
        self.relu1 = nn.ReLU()
        self.drop1 = nn.Dropout(dropout)
        self.conv2 = CausalConv1d(out_ch, out_ch, kernel_size, dilation)
        self.relu2 = nn.ReLU()
        self.drop2 = nn.Dropout(dropout)
        # 1x1 conv to match dimensions if needed
        self.downsample = nn.Conv1d(in_ch, out_ch, kernel_size=1) if in_ch != out_ch else nn.Identity()

    def forward(self, x):
        out = self.conv1(x)
        out = self.relu1(out)
        out = self.drop1(out)
        out = self.conv2(out)
        out = self.relu2(out)
        out = self.drop2(out)
        res = self.downsample(x)
        return out + res

For multi-sensor data (like EEG), you can first apply a Conv2D to extract temporal patterns shared across all channels. This is done by treating the channels C as height, so you get an input of x: [B, 1, C, T]. A kernel of size (C, k) covers all channels and a temporal window.

A 2D TCN is a Conv2D with dilation and causal padding applied across the temporal axis. The convolution layer slides across [C, T] causally. Stacking these layers with increasing dilation expands the receptive field.


class Conv2DTemporalTCN(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, dilation=1, dropout=0.2):
        super().__init__()
        self.conv = nn.Conv2d(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=(1, kernel_size),  # (no spatial span, only temporal)
            dilation=(1, dilation),
            padding=(0, (kernel_size - 1) * dilation),  # causal padding only in time
        )
        self.bn = nn.BatchNorm2d(out_channels)
        self.elu = nn.ELU()
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        x: [B, in_channels, C, T]
        returns: [B, out_channels, C, T]
        """
        x = self.conv(x)
        x = x[..., :x.shape[-1]]  # Trim future steps for causality
        x = self.bn(x)
        x = self.elu(x)
        x = self.dropout(x)
        return x

Key CNN Architectures

Now, with this summary, let’s try to understand a couple of very well-known CNN architectures.

AlexNet

AlexNet made CNNs famous. It introduced ReLU, overlapping max-pooling, and made heavy use of strided convolution and pooling. With overlapping max-pooling, the goal was to reduce spatial resolution to make training feasible while keeping more information than non-overlapping pooling.

VGG

VGG is an architecture that stacks many 3x3 convolution layers with max-pooling to keep the receptive field large while extracting key features. This is a form of smart compression and a signal booster for important patterns.

ResNet

ResNet (Residual Network) uses skip connections to allow gradients to flow through very deep networks. Instead of learning H(x), the model learns the change from the input x, F(x) = H(x) - x, so H(x) = F(x) + x. It’s often easier for the model to learn what needs to be added to x than a full transformation of x. Some variants use a bottleneck block to reduce the number of channels before a 3x3 convolution and then expand it back to the original size with a 1x1 convolution.

MobileNet

MobileNet was designed for mobile and embedded devices. It introduced depthwise separable convolutions to drastically reduce parameters and compute. MobileNetV2 uses a clever twist on ResNet's skip connection called inverted residuals. Instead of shrinking, processing, and expanding, it expands, processes, and then shrinks. It also uses global average pooling instead of a fully-connected layer for classification.

EfficientNet

EfficientNet is an extremely powerful CNN built on the ideas of ResNet and MobileNet, using neural architecture search and compound scaling to achieve state-of-the-art accuracy with an order of magnitude fewer parameters and FLOPs. Instead of scaling depth, width, or resolution individually, EfficientNet scales them together with a fixed ratio. It uses a block similar to MobileNetV2's but adds a Squeeze-and-Excitation (SE) block. The SE block provides channel-wise attention, allowing the network to reweigh each feature map based on its importance.


class SEBlock(nn.Module):
    def __init__(self, channels, reduction=16):
        super().__init__()
        self.pool = nn.AdaptiveAvgPool2d(1)  # -> [B, C, 1, 1]
        self.fc = nn.Sequential(
            nn.Linear(channels, channels // reduction),
            nn.ReLU(),
            nn.Linear(channels // reduction, channels),
            nn.Sigmoid()
        )

    def forward(self, x):
        B, C, H, W = x.shape
        s = self.pool(x).view(B, C)      # [B, C]
        w = self.fc(s).view(B, C, 1, 1)  # [B, C, 1, 1]
        return x * w                      # Channel-wise scale

The SE block is placed inside the MBConv block, right after the depthwise convolution and before the final projection layer.


class MBConv(nn.Module):
    def __init__(self, in_ch, out_ch, expansion=6, kernel_size=3, stride=1, se_reduction=4):
        super().__init__()
        mid_ch = in_ch * expansion
        self.use_residual = (in_ch == out_ch) and (stride == 1)

        self.expand = nn.Sequential(
            nn.Conv2d(in_ch, mid_ch, kernel_size=1, bias=False),
            nn.BatchNorm2d(mid_ch),
            nn.SiLU()
        ) if expansion != 1 else nn.Identity()

        self.depthwise = nn.Sequential(
            nn.Conv2d(mid_ch, mid_ch, kernel_size, stride, padding=kernel_size // 2,
                      groups=mid_ch, bias=False),
            nn.BatchNorm2d(mid_ch),
            nn.SiLU()
        )

        self.se = SEBlock(mid_ch, reduction=se_reduction)

        self.project = nn.Sequential(
            nn.Conv2d(mid_ch, out_ch, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_ch)
        )

    def forward(self, x):
        identity = x
        x = self.expand(x)
        x = self.depthwise(x)
        x = self.se(x)
        x = self.project(x)
        if self.use_residual:
            x = x + identity
        return x

Inception

The design of CNNs can feel like trial and error. The Inception architecture's answer is: instead of picking one kernel size, let’s compute multiple receptive fields in parallel and let the network figure out what matters. It has four branches:

Branch 1: A 1x1 convolution.
Branch 2: A 1x1 convolution followed by a 3x3 convolution.
Branch 3: A 1x1 convolution followed by a 5x5 convolution.
Branch 4: A 3x3 max-pooling layer followed by a 1x1 convolution.

The channels from all branches are then concatenated. Inception also uses factorized convolutions (e.g., using two 3x3s instead of one 5x5). Newer versions combine Inception blocks with residual connections.

Inception Architecture Layout:


[Input: 224x224 RGB]
      ↓
[Stem Block]
      ↓
[Inception(3a-3b)] → MaxPool
      ↓
[Inception(4a-4e)] → MaxPool
      ↓
[Inception(5a-5b)]
      ↓
[GlobalAvgPool + Dropout + FC]
      ↓
[Output: 1000 classes]

We have learned about the basics of CNNs: kernels, dilation, and more. We learned about the receptive field and how to calculate it. We learned about depthwise, pointwise, and separable convolutions. We learned about pooling and strided convolution. We learned about batch normalization and temporal/causal convolution blocks. And we looked at a number of popular CNN architectures and how they employ these techniques.