1D Convolutions for Time-Series Data

Stride, Receptive Field, and Aliasing

With a 1D conv layer, stride is how many steps you move the filter at each slide along the sequence. A stride of 1 is max resolution; every neighboring point contributes. Anything greater than 1 produces fewer output points, performing downsampling. The receptive field is how much of the input each output sees. A larger stride doesn’t change the receptive field of a single filter—that’s set by the kernel size—but it changes the spacing of the outputs. As you stack layers, striding effectively makes the receptive field grow faster because each deeper unit corresponds to bigger chunks of the original signal.

A stride > 1 reduces the output points, so there are fewer multiplications, leading to lower latency and memory cost. That’s why striding is often used instead of pooling for downsampling. However, with any downsampling, it risks aliasing.

At a sampling rate $f_s$, any frequency content above $f_s/2$ (the Nyquist frequency) folds back into the lower band; that’s aliasing. We typically low-pass filter to attenuate those high frequencies before downsampling.

If you stride by $s$, you downsample by $s$. If your original sampling rate is 1kHz, then after a stride of 2 the effective sampling rate is 500 Hz ($f_s/2$), with the new Nyquist frequency being 250 Hz ($f_s/4$). Now, if the input content is above 250 Hz, it will alias unless the convolution filter itself is low-passed before the downsampling. In a CNN, the convolutional filter can act as a crude low-pass filter if it has averaging-like weights, but this is not guaranteed.

In the early layers of a CNN, we use a stride of 1 to preserve fine timing. As we go deeper, we can stride more to capture a larger context without exploding the compute cost. If you’re working with sensory data, you use 1D conv layers to extract temporal filters.

Padding and Causality

When we apply a convolution, the filter slides across the input, and the filter would hang off at the end unless you pad it. If you pass a filter of size $k$, then $L_{out} = L_{in} - k + 1$.

With zero-padding, typically 'same' padding, you pad both sides with zeros so the output length matches the input length at a stride of 1. Without it, the output shrinks and you might lose information at the edge.

Putting it together:

$$L_{out} = \frac{(L_{in} + 2P - k)}{s} + 1$$

Causal Padding

Typical 'same' padding looks at both backward and forward in time. With causal padding, the filter looks only at the current and past inputs. Causal padding pads only the left side. With real-time inference, you can’t peek into the future; causal convolution ensures that predictions at time $t$ depend only on data up to time $t$. If you use future sampling, you might get better performance on an offline batch than you would in real-time.

In PyTorch/TensorFlow, when you use `padding = ‘same’`, it pads the kernel to be centered at each output position, so it’s already peeking at the input. When you put $k-1$ zeros at the left of the input, now you’re only looking back.

Dilated Causal Convolutions and Receptive Field

Now, to make causal convolution more powerful, we can use dilation, which means skipping points at a rate $d$. When dilation is 1, the RF is just the kernel size. Now if we introduce a dilation > 1, then we can look farther back in time. The total span of the kernel becomes $(k-1)d + 1$.

Now let’s stack two causal dilated conv layers. In the first layer, you usually don’t want to dilate, as you don’t want to miss samples in the input by introducing blind spots. When you stack two layers, the new layer sees the entire receptive field of the layer below.

Let’s define jump: the effective stride in raw input units. Jump is how far apart two adjacent outputs at layer $l$ are in terms of raw input indices. At the input, each step is one sample. At each layer:

$$ \text{jump}_l = \text{jump}_{l-1} \cdot s_l \cdot d_l $$

Because stride skips inputs in layer $l-1$, it multiplies the step size in the raw input. Dilation spaces out taps inside the kernel, which also multiplies the step size in the raw input.

$$ RF_l = RF_{l-1} + (k_l - 1) \cdot \text{jump}_{l-1} $$

Example: Designing for a 300ms Receptive Field

Let’s consider sensory data coming in at a 1kHz sampling rate. Let’s say we know that in our trial window, the activity corresponding to our label is about 300 ms long. So, we would want our model’s receptive field to see the entire structure.

We want to design a CNN model that has a receptive field (RF) greater than 300. We will mix standard convolutions, striding, and dilations. We start with an input receptive field of 1 and a jump of 1.

Layer 1 (Input):
- RF = 1
- Jump = 1
Layer 2 (Conv Block 1):
- Parameters: Kernel (k)=7, Stride (s)=1, Dilation (d)=1
- New Jump: $1 \times 1 \times 1 = \mathbf{1}$
- New RF: $1 + (7-1) \times 1 = \mathbf{7}$
Layer 3 (Conv Block 2):
- Parameters: k=5, s=2, d=1
- New Jump: $1 \times 2 \times 1 = \mathbf{2}$
- New RF: $7 + (5-1) \times 1 = \mathbf{11}$
Layer 4 (Conv Block 3):
- Parameters: k=3, s=1, d=2
- New Jump: $2 \times 1 \times 2 = \mathbf{4}$
- New RF: $11 + (3-1) \times 2 = \mathbf{15}$
Layer 5 (Conv Block 4):
- Parameters: k=3, s=1, d=4
- New Jump: $4 \times 1 \times 4 = \mathbf{16}$
- New RF: $15 + (3-1) \times 4 = \mathbf{23}$
Layer 6 (Conv Block 5):
- Parameters: k=3, s=1, d=8
- New Jump: $16 \times 1 \times 8 = \mathbf{128}$
- New RF: $23 + (3-1) \times 16 = \mathbf{55}$
Layer 7 (Conv Block 6):
- Parameters: k=3, s=1, d=16
- New Jump: $128 \times 1 \times 16 = \mathbf{2048}$
- New RF: $55 + (3-1) \times 128 = \mathbf{311}$

Design Considerations for Sensory Data CNNs

We have to make some model decisions.

Do we pool early or late? If we pool early, the sequence length reduces quickly, leading to faster compute and smaller memory, but we could be throwing away fine details. Early pooling can also reduce latency. If we pool late, there’s a good chance to capture those fine details, but it would be more expensive. For noisy, high-frequency signals, you want to do pooling early to tame the sequence length; for low-rate signals, you can postpone it.
Dilation can help too, but too much dilation in the early layers can risk holes in the signal and missing local details. This comes at no extra cost, but you want to apply it once features are stabilized.
Small kernels are typically standard, as they stack well to approximate large filters. Sometimes, large kernels can be used in the first layer to capture spectral features too.

Pooling and Final Layers

Pooling is also a downsampling operation. It summarizes a local region into a single value. Each pooled output corresponds to a wider chunk of the input, increasing the receptive field of later layers. Max pooling is robust to timing jitter, and average pooling smooths out noise.

Both pooling and stride downsample time resolution, but stride is a learnable filter, and pooling is a fixed function. With early layers, you want to keep stride 1 and no pooling. With mid-layers, you can start to compress time.

The final layers of a CNN model often use Global Average Pooling (GAP) across time before classification tasks. This squeezes the sequence dimension and gives you one feature per channel. GAP produces a fixed-size representation regardless of input length. This provides translation invariance in time, where exactly in the window a class occurs matters less than the fact that its pattern exists somewhere. Instead of flattening a $C \times T$ tensor into a huge vector, pooling collapses it to just $C$, which prevents the head from exploding in size.

Pooling is not the only way; you can learn a weight per time step and let the model focus on the most informative parts of the signal.

$$ Z = \sum_t(\alpha_t h_t) \quad \text{and} \quad \alpha_t = \text{softmax}(W h_t) $$

We can also feed the CNN features into an LSTM or Transformer to model temporal dependencies. This can either take the last hidden state or perform attention pooling. This can be stronger if the label dynamics matter. Instead of averaging across all channels, you can use Squeeze-and-Excite to learn channel weights before pooling.

Normalization for Time-Series Data

Now let’s talk about normalization. BatchNorm (BN) is typically used for images and CNN models; it reduces internal covariate shift and stabilizes training. For time-series signals with long windows, GPU memory is tight, so the batch size might be small. If the signals are non-stationary and drift over time, then BatchNorm can wash out useful information. Let’s talk about this a bit.

In images (N, C, H, W), BN computes mean/variance per channel across the batch N and spatial dimensions H, W. For a 1D conv layer and a time-series signal (N, C, T), BN computes per channel across the batch and time. If you normalize a signal with weak and strong amplitudes using BN, both would get mapped to an extreme range of the normalized space; the absolute difference in amplitude is gone. It loses scale cues that indicate a burst was stronger. For images, we don’t care if an image is brighter or darker. For time-series signals, however, amplitude carries information, usually. And at test time, BatchNorm needs batch statistics. If input comes one trial at a time, the running stats may not match the training distribution.

There are other forms of normalization.

InstanceNorm (IN): Normalizes per sample, per channel, across time only. That is, each trial and channel will have its own mean and standard deviation. This removes per-trial baseline shifts. But it also loses absolute amplitude information within a trial. This can be a problem. If you have a non-active trial and an active one, InstanceNorm would center both around zero, making them more similar than they actually are. It really depends on the label here. If amplitude information is important, then these norms will hurt the model.
LayerNorm (LN): Normalizes per sample, across channels at each time step. This works for RNNs and Transformers, small batches, and preserves differences between trials. But if one channel is dominant, normalization would suppress that cue. It’s more helpful for later layers after fusion across channels and is good for stacked LSTMs on top of CNN features.
GroupNorm (GN): Splits channels into groups, then normalizes each group. If you set the number of groups to 1, you get LayerNorm, and if you set it to the total number of channels, you get InstanceNorm. It’s a middle ground between the two.

Depthwise Separable Convolutions in 1D

The number of parameters in a standard convolution is $k \times C_{in} \times C_{out}$. Each filter mixes time and channels jointly.

Depthwise separable convolution breaks it into two cheaper steps by setting groups to the number of input channels.

A depthwise convolution applies one filter per input channel independently. Each filter has a size of $k \times 1$. The number of parameters is $k \times C_{in}$. It captures temporal patterns within each channel but not cross-channel mixing.
It is followed by a pointwise convolution (a 1x1 convolution). It mixes across channels and captures cross-channel correlations. The number of parameters is $C_{in} \times C_{out}$.

The total number of parameters is $k \times C_{in} + C_{in} \times C_{out}$. The depthwise part learns temporal filters per channel, and then the pointwise part learns how channels interact.


import torch
import torch.nn as nn

class SeparableConv1d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        super().__init__()
        
        # Depthwise: one conv per channel (groups=in_channels)
        self.depthwise = nn.Conv1d(
            in_channels, 
            in_channels, 
            kernel_size=kernel_size, 
            stride=stride, 
            padding=padding, 
            groups=in_channels,  # <- key!
            bias=False
        )
        
        # Pointwise: 1x1 conv to mix channels
        self.pointwise = nn.Conv1d(
            in_channels, 
            out_channels, 
            kernel_size=1, 
            bias=False
        )

    def forward(self, x):
        x = self.depthwise(x)  # per-channel temporal filtering
        x = self.pointwise(x)  # cross-channel mixing
        return x

Advanced Architectures for Temporal Signals

Time-Depth Separable (TDS) Block

A Time-Depth Separable (TDS) block is inspired by depthwise separable convolution and is tailored for temporal signals. It has three main steps:

A time-channel separable convolution (a large convolution along time applied independently to each channel).
A pointwise convolution that mixes across channels, expanding and then projecting the feature dimensionality back down.
A residual connection with normalization and dropout to avoid vanishing gradients.

It’s basically a temporal pattern extractor + channel mixer + residual. LayerNorm is used because it normalizes per sample across channels, not across the batch; it respects the time-series structure and doesn’t erase trial-level amplitude differences like BN would.


import torch.nn.functional as F

class TDSBlock(nn.Module):
    """
    Time-Depth Separable (TDS) block for temporal signals (EMG/IMU/Audio).
    """
    def __init__(self, channels, kernel_size=5, dropout=0.1):
        super().__init__()
        
        self.depthwise = nn.Conv1d(
            in_channels=channels,
            out_channels=channels,
            kernel_size=kernel_size,
            groups=channels,
            padding=kernel_size // 2,
            bias=False
        )
        
        self.pointwise1 = nn.Conv1d(channels, channels, kernel_size=1, bias=False)
        self.pointwise2 = nn.Conv1d(channels, channels, kernel_size=1, bias=False)
        
        self.norm = nn.LayerNorm(channels)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x: (batch, channels, time)
        residual = x
        
        out = self.depthwise(x)
        out = F.relu(out)
        
        out = self.pointwise1(out)
        out = F.relu(out)
        out = self.pointwise2(out)
        
        out = self.dropout(out)
        out = out + residual
        
        # LayerNorm expects (batch, time, channels)
        out = out.transpose(1, 2)
        out = self.norm(out)
        out = out.transpose(1, 2)
        
        return out

Multi-Scale TDS Scaling Architectures 📈

You can use Multi-Scale TDS blocks, which are TDS blocks in parallel with increasing temporal scales achieved through downsampling and dilation. The network can capture both fine-grained bursts and long-range temporal structure. This gives the network multiple receptive field sizes; fine-scale branches with no downsampling and coarse-scale branches that cover hundreds of milliseconds. The model gets a temporal pyramid view of the signal. Using powers of 2^s gives exponentially increasing context.

We use an average pooling that acts as a low-pass filter before downsampling to prevent aliasing. Max-pooling would throw away too many fine details and also has issues with aliasing. What makes each branch different in the multi-scale design isn’t the TDS block itself, but the input resolution it operates on due to the pooling.

Pooling is typically done on the last dimension, so in the code, the pooling is done on the time dimension. We then upsample the output from each block to the original time resolution so it can work with residual connections.


class MultiScaleTDS(nn.Module):
    """
    Multiscale TDS design (s = 0..5).
    """
    def __init__(self, channels, kernel_size=5, num_scales=6, dropout=0.1):
        super().__init__()
        
        self.scales = nn.ModuleList()
        for s in range(num_scales):
            block = TDSBlock(channels, kernel_size=kernel_size, dropout=dropout)
            self.scales.append(block)
        
        self.num_scales = num_scales

    def forward(self, x):
        # x: (batch, channels, time)
        outputs = []
        T = x.size(-1)
        
        for s, block in enumerate(self.scales):
            # Downsample by 2^s
            pooled = F.avg_pool1d(x, kernel_size=2**s, stride=2**s, ceil_mode=True)
            
            # Apply TDS block
            out = block(pooled)
            
            # Upsample back to original time length
            out = F.interpolate(out, size=T, mode="linear", align_corners=False)
            
            outputs.append(out)
        
        # Fuse multiscale outputs
        return sum(outputs) / self.num_scales