Introduction to Transformers

Older sequence models like RNNs process one token at a time, so they struggle with long dependencies and cannot parallelize well. CNNs can look at a local neighborhood, but stacking many layers is needed to capture long-range relations.

Transformers replace recurrence and convolutions with self-attention, a mechanism where every token can directly interact with every other token in a single step. There are two main parts inside a Transformer: an encoder and a decoder. Both the encoder and decoder contain two primary sub-layers:

These parts also employ residual connections, layer normalization, and positional encodings to function effectively.


The Self-Attention Mechanism

Self-attention is the heart of the Transformer. Suppose you have an input sequence of length $n$ with an embedding of dimension $d$. We typically map each discrete token to a continuous vector using a learned embedding matrix, which is essentially a big lookup table.

For example, if you have a vocabulary of 30,000 tokens and you choose an embedding size of 512, then the embedding matrix will have a shape of $30,000 \times 512$. When you feed a word to the model, it looks up the corresponding row in this matrix, and that becomes the embedding vector of size 512 for that word.

The input sequence has a length of $n$ (the number of tokens), and each token has an embedding dimension of $d_{model}$. After the embedding lookup, the input tensor $X$ has a shape of $n \times d_{model}$.

Query, Key, and Value

First, you project each input embedding into three distinct vectors:

We obtain these vectors by multiplying the input embedding matrix $X$ by learned weight matrices:

$$ Q = X W_Q $$ $$ K = X W_K $$ $$ V = X W_V $$

These weight matrices project the input embeddings from dimension $d_{model}$ into a dimension $d_k$. The Query and Key vectors must always have the same dimension, $d_k$, because their dot product is computed later. The Value vector typically has the same dimension, but some implementations allow it to be different. At this point, our $Q$, $K$, and $V$ matrices each have a shape of $n \times d_k$.

Attention Score Calculation

Next, we compute the attention scores. This is done by taking the dot product of every query with every key. The resulting score between token $i$ and token $j$ determines how much token $i$ should attend to token $j$.

$$ \text{score}(i,j) = Q_i \cdot K_j $$

This is vectorized to compute the entire $n \times n$ score matrix at once:

$$ \text{Scores} = Q K^T $$

The dot product is used as a measure of similarity. Since $Q$ and $K$ are in the same vector space, their dot product measures both magnitude and directional alignment. If a query and a key point in the same direction, the model considers them relevant.

We then scale these scores by dividing by $\sqrt{d_k}$. Let’s understand why. If you assume the components of $Q$ and $K$ are normally distributed with a mean of 0 and a variance of 1, then their dot product will have a mean of 0 and a variance of $d_k$. The magnitude of this dot product would then be on the order of $\sqrt{d_k}$. By dividing by $\sqrt{d_k}$, we are normalizing the scores to have a variance of 1, which helps stabilize training. Without this scaling, the scores would grow with the dimension $d_k$.

We want these scores to behave like attention weights—they should be non-negative and sum to 1 to form a probability distribution. That’s what the softmax function does. For very large scores (i.e., if we did not normalize them), the softmax output would collapse to a one-hot distribution where one weight is 1 and all others are 0. This would cause the query to "latch on" to a single token, ignoring all other context. This not only kills the attention mechanism's ability to softly mix information but also introduces vanishing gradients, as small changes in extreme values barely change the output, preventing the network from adjusting the attention weights. The normalization by $\sqrt{d_k}$ is therefore crucial for keeping the softmax in a smooth, trainable zone.

The final attention weights, which we can call $\alpha$, are calculated as:

$$ \text{Attention Weights} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) $$

This gives us an $n \times n$ matrix.

Weighted Sum and Final Projection

Now, we use our attention weights to combine the Value vectors. For each token $i$, the output is a weighted sum of all other value vectors:

$$ \text{Output}_i = \sum_j \alpha_{ij} V_j $$

The entire output matrix is computed as $(\text{Attention Weights}) \times V$. The output has the shape $n \times d_k$.

Finally, a learned linear projection matrix, $W_O$, is applied. This projection re-mixes the features after the attention step. Without it, the output would just be a weighted average of the values.

$$ \text{Final Output} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V W_O $$

The matrix $W_O$ (with shape $d_k \times d_{model}$) adapts the output back to the model’s main feature space, $d_{model}$, so that it can be passed to subsequent feed-forward blocks consistently.


Multi-Head Attention

So far, for each token, we have learned a single representation of size $d_k$, meaning the model learns one type of relationship at a time. We can introduce multiple heads to allow the model to learn multiple types of relationships in parallel. Each head projects $Q$, $K$, and $V$ into a different learned subspace, enabling the model to look at the sequence in different ways simultaneously.

We split the model dimension, $d_{model}$, into $h$ smaller chunks (the heads), each with dimension $d_k = d_{model} / h$. For instance, we can take a 512-dimensional embedding and split it into 8 heads, each with 64 dimensions. Each head then runs its own scaled dot-product attention on its respective chunk.

Instead of one set of $W_Q, W_K, W_V$, we now learn separate projection matrices for each head $i$:

$$ Q^i = X W_Q^i $$ $$ K^i = X W_K^i $$ $$ V^i = X W_V^i $$

Each head $i$ computes its own attention output, $Z^i$, with shape $n \times d_k$. After each head has independently analyzed the sequence through its own "lens," we concatenate the outputs:

$$ Z = \text{concat}(Z^1, Z^2, \dots, Z^h) $$

The resulting matrix $Z$ has a shape of $n \times (h \times d_k)$, which is equivalent to $n \times d_{model}$. We then apply a final learned projection matrix $W_O$ (with shape $d_{model} \times d_{model}$) to re-mix the information across all the heads, allowing the model to combine the different views each head produced.

$$ O = Z W_O $$

Masking in Attention

Masking is a concept in attention that prevents the mechanism from looking at every token. There are two main types of masking.

Padding Mask

A padding mask is typically applied when batching sequences where some are shorter than others. We pad these sequences with dummy tokens to make them the same shape, but we don’t want the model to attend to these dummy tokens. So, we mask these positions by setting their attention scores to $-\infty$ before applying the softmax.

After computing the attention scores $S = QK^T / \sqrt{d_k}$, we apply a mask matrix $M$:

$$ S' = S + M $$

where $M_{ij} = 0$ for tokens that are allowed to attend and $-\infty$ for those that are blocked.

For the padding mask, we want to mask out some keys, not queries. This means if key $j$ is padding, all queries $i$ should ignore it. It’s kind of like queries are the "listeners" and keys are the "speakers." If a padding token is a "speaker," that’s bad because a real "listener" might listen to it and get confused. We need both attention masking (to prevent confusion during context creation) and loss masking (to not penalize the model for its predictions at padded positions).

Causal (Look-Ahead) Mask

There is also a causal, or look-ahead mask. In decoder models, when generating the next token, we don’t want the model to peek at future tokens. For a position $t$, we block attention to all positions $> t$ to enforce autoregressive behavior. This is achieved by adding an upper-triangular matrix of $-\infty$ values (where the diagonal is 0) to the score matrix.

$$ M_{ij} = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases} $$

At position 1, a token can only attend to itself. At position 2, it can attend to token 1 and itself. At position 3, it can attend to tokens 1, 2, and 3, and so on.

This is an upper triangular matrix of $-\infty$ with a shape of n x n that gets broadcast across the batch. Each row corresponds to a query at position i, which can now only attend to columns where j <= i. At position 1, a token can only attend to itself; at position 2, it can attend to token 1 and itself; at position 3, it can attend to tokens 1, 2, and 3.

So, after we compute the attention score and before applying the softmax:

$$ S = \frac{QK^T}{\sqrt{d_k}} \quad (\text{shape: } n \times n) $$

We want to apply a mask matrix M:

$$ S' = S + M $$

Where $M_{ij} = 0$ allows tokens to attend and $-\infty$ blocks them.

Loss vs. Attention Mask

For a padding mask (a sequence-level mask), suppose we have a batch size of $B$. Some tokens are real and some are padding. We want to mask out some keys, not queries. That means if key $j$ is a padding token, all queries $i$ should ignore it. So, the masking needs to be along the key dimension.

$$ M_{bij} = \begin{cases} 0 & \text{if key position } j \text{ is real} \\ -\infty & \text{if key position } j \text{ is padding} \end{cases} \quad (\text{shape: } B \times 1 \times n) $$

The 1 in the middle dimension means the same mask applies to all queries $i$. This will get broadcasted during addition: (B x 1 x n) + (B x n x n) -> (B x n x n).

Broadcasting always confuses me, so let’s discuss it again. Broadcasting compares shapes element-wise from the right. If the dimensions match or one of them is 1, they’re compatible. The dimension of size 1 is stretched to match the other. The mask only told us which keys (columns) are real or padding. So, we apply the same restriction to every query row by repeating the mask along that dimension.

What about a padding query token? Say we have a categorical cross-entropy for next-token prediction. We don’t want to penalize the model for predicting something at <pad> positions, so we zero out the contribution of the padding tokens. We have to mask out key padding because it gets integrated into the context vector for that token. With the query padding, well, the model would actually predict nothing for that position, because padding positions are masked in the loss function as well. We just don’t want the model to waste time attending to padding keys here.

It’s kind of like queries are the "listeners" and keys are the "speakers." If queries hear non-sense, that’s fine because we would never ask for their opinion anyway. But, if a padding token is a "speaker," that’s bad because a real listener might listen to it and get confused. We need both attention masking and loss masking.


The Encoder-Decoder Architecture

So far, we had our input sequence of n x d_model with some positional encoding added. We compute multi-head attention (MHA) and output a tensor of shape n x d_model, which is the same shape as the input.

The Encoder Block

Then in our encoder block, we have an "add and norm" block, which uses a residual connection. It adds the input X to the attention output and applies a LayerNorm to stabilize it.

$$ H_1 = \text{LayerNorm}(X + \text{MHA}(X)) \quad (\text{shape: } n \times d_{model}) $$

A 2-layer MLP, called a Position-wise Feed-Forward Network (FFN), is then applied independently to each position.

$$ \text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2 $$

This expands the dimension from $d_{model}$ to maybe $4 \times d_{model}$ and then projects it back down to $d_{model}$ while preserving the sequence length. Then we add and norm again with another residual connection.

$$ H_2 = \text{LayerNorm}(H_1 + \text{FFN}(H_1)) $$

Why do this? Deep networks suffer from vanishing gradients. The residual path provides a shortcut for gradients to flow backward, making optimization easier and allowing the whole block to safely act like an identity function at the start of training. The non-linear transformation is more expressive in a higher dimension, so the expansion allows the model to process richer combinations of features before compressing them back. We always try to hold on to the original information using residual connections.

Pre-LN vs. Post-LN

Now, you might have heard of Pre-LN and Post-LN. In the original Vaswani paper, LayerNorm was applied *after* the residual connection was added (Post-LN, as shown above). In modern practice, LayerNorm is often applied *before* the sublayer, and the residual is added afterward (Pre-LN). This keeps the residual path clean and helps gradients remain stable.

$$ H = X + \text{MHA}(\text{LayerNorm}(X)) $$

This ordering is important. If the input X is too large, we don’t want it to drown out the MHA output, so we apply LayerNorm to the input, not the output. This keeps the relative balance between the residual and the sublayer stable. So, Pre-Layer Norm addresses two things: a clean residual path and a strong, stable gradient flow.

With a clean residual path $H = X + f(X)$, the gradient is:

$$ \frac{dH}{dX} = I + \frac{df(X)}{dX} $$

Where $I$ is the identity matrix. This is the magic of residuals. With Post-Layer Norm, $H = \text{LayerNorm}(X + f(X))$, the gradient has to go through the LayerNorm's Jacobian, which complicates the gradient flow.

Why LayerNorm?

Here, X is our n x d_model tensor, and LayerNorm is applied to the feature dimension ($d_{model}$) for each token. So, why not use something like BatchNorm? Why LayerNorm?

On most sequential problems, sequences can have varying lengths. BatchNorm assumes consistent spatial dimensions, but LayerNorm doesn’t care about that; it works token by token. With big Transformer models, BatchNorm is only accurate if batch sizes are large, but due to memory constraints, we typically use smaller batches, making it noisy. Furthermore, with autoregressive models, we generate one token at a time at inference, so a batch size of 1 would make BatchNorm undefined.

LayerNorm computes the mean and standard deviation of a single token's feature vector. There is no mixing across tokens or the batch.

$$ \text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sigma} + \beta $$

The Decoder Block

The decoder block is similar to the encoder block but uses causal masking in its first attention layer. It ensures each position can only attend to itself and earlier tokens. Crucially, we also need to add information from the encoder block to the decoder block using cross-attention.

The output of the encoder is a sequence of contextualized embeddings of shape n_source x d_model. The input to the decoder is the target sequence (e.g., the Farsi words in a translation task). At training time, we shift the ground-truth target sequence one step to the right. At inference, we autoregressively feed the tokens generated so far.

Inside the decoder block, we first apply masked self-attention on the target sequence input, Y_in.

$$ H_1 = Y_{in} + \text{MHA}_{\text{causal}}(\text{LayerNorm}(Y_{in})) $$

This hidden state $H_1$ is now contextualized using only past target tokens. Now comes cross-attention. The **queries** come from the decoder, representing what the target tokens need. The **keys and values** come from the encoder, representing the information saved from the source sentence.

Say the encoder processes an English sentence. The decoder query for the word "pishi" (Farsi for kitten) asks which English words are relevant to this translation step. The model compares the query with the encoder keys and pulls in the value for the word "cat" into the decoder’s hidden state.

We take $H_1$ from the decoder and use it to form the queries for the cross-attention layer. We take the encoder output, $H_{enc}$, which is fixed across all decoder layers, to form the keys and values.

The score matrix $QK^T$ will have the shape n_target x n_source. Basically, for each decoder position (query), the attention weights across all encoder positions (keys) sum to 1. Each target query decides how to distribute its attention over the encoder output.

Each decoder query pulls information from all encoder values, weighted by relevance. This is followed by another Add & Norm and the FFN.

$$ H_2 = H_1 + \text{MHA}_{\text{cross}}(Q=\text{LayerNorm}(H_1), K=H_{enc}, V=H_{enc}) $$ $$ H_3 = H_2 + \text{FFN}(\text{LayerNorm}(H_2)) $$

This is the output of one decoder block. If we had multiple decoder blocks, they would use the previous block’s output as input, but all decoder blocks would use the same encoder output $H_{enc}$ as their keys and values.

The encoder output is the memory of the source sequence. The decoder needs consistent access to this memory. What changes are the decoder's queries, which are progressively refined.

If you’re like me, you might ask why we use $H_1$ as the decoder’s queries and not wait until after the FFN. One explanation is that $H_1$ has already integrated information about the past target tokens via self-attention. The FFN only adds a position-wise MLP; it doesn’t add any new context across tokens. Therefore, it doesn’t make the queries any more informative about what to attend to in the encoder. That’s why $H_1$ is the right place to ask questions of our source. Once this context is added, we can then refine the token’s representation locally with the FFN.


Positional Encoding

So, we skipped a pretty important step: positional encoding. Attention takes tokens as a set, not a sequence. The model is permutation invariant. But the whole point of Transformers is that sequential ordering matters.

$$ \text{Attention}(X) = \text{Attention}(\text{shuffle}(X)) $$

Once we get our token embeddings of size d_model, we add a positional vector P(t) to each token that encodes its position t. This way, the token knows both its semantic meaning and its place in the sequence. In the original paper, they used fixed sinusoidal encodings. For each position t and dimension i:

$$ P_{t,2i} = \sin(t / 10000^{2i/d_{model}}) $$ $$ P_{t,2i+1} = \cos(t / 10000^{2i/d_{model}}) $$

This preserves the relative position between t and t + k. Let’s unpack this. We want each position t to be represented as a unique vector but also in a way that captures relative offsets. Let’s review some trigonometry.

The term $\sin(\omega t)$ means the function repeats every $2\pi$. The periodicity is inversely proportional to the frequency $\omega$.

$$ T = \frac{2\pi}{\omega} $$

A larger $\omega$ means shorter periods (faster oscillation), and a smaller $\omega$ means slower oscillation. In the original paper, they defined the frequency for each dimension i as:

$$ \omega_i = \frac{1}{10000^{2i / d_{model}}} $$

When i is 0, that’s our shortest range; $\omega_0$ is 1 and our period is $2\pi$. As i gets larger, $\omega_i$ gets smaller, resulting in a longer period. Each dimension in a token's vector at position t gets a sinusoid of a different frequency. Every token’s position gets encoded in multiple frequency bands. This way, each position gets a unique, multi-frequency signature. We use sine and cosine to encode positions as a point on a circle. That way, we avoid ambiguity and can also derive relative positions using trigonometric identities. In the original paper, they assign high-frequency dimensions to local order and low-frequency dimensions to long-range order.

Output Layer and Weight Tying

So, let’s take a look at the input and output of Transformers. At the input, we took discrete tokens and mapped them to $d_{model}$ using an embedding matrix of shape |V| x d_model, where |V| is the vocabulary size.

Now, at the decoder’s output, after the last block, we have hidden states of shape n_target x d_model. To predict the next token, we need logits over the vocabulary.

$$ Z = H_{dec} W_{out} + b $$

Where $W_{out}$ is the output projection matrix of shape d_model x |V|. The input embedding matrix and our output projection can be the transpose of each other. So, instead of learning two separate matrices, we can set $W_{out} = E^T$. This technique, called weight tying, has consistently improved the perplexity of language models.

In a translation task, you don’t want to tie the weights to the encoder's embedding matrix, but you could tie it to the decoder’s input embedding, as both are in the same language.

Computational Complexity

Let’s talk complexity.

For RNNs

The core computation is:

$$ h_t = f(W_h h_{t-1} + W_x x_t) $$

Where:

The complexity of the matrix multiplications are:

The cost for each hidden state is $O(d^2)$. For each time step across a sequence of length $n$, the total cost is $O(n \cdot d^2)$, and memory for storing hidden states is $O(n \cdot d)$. This is linear over the sequence length, but it doesn’t parallelize.

For CNNs

In a 1D convolution over sequences with a kernel size of $k$ and hidden dimension $d$, for each of the $n$ positions, you have to compute the weighted sum of $k$ neighboring positions. Each token has $d_{in}$ features, so one window has $k \cdot d_{in}$ features in total. The filter is a weight vector of size $k \cdot d_{in}$. To produce $d_{out}$ channels at position $t$, we need $d_{out}$ such filters. So, the weight matrix is $d_{out} \times (k \cdot d_{in})$. The output vector at position $t$ is calculated as $W \times \text{window}(t)$, which has a complexity of $O(k \cdot d_{in} \cdot d_{out})$.

So the cost per position is $O(k \cdot d_{in} \cdot d_{out})$ and the cost per layer is $O(n \cdot k \cdot d^2)$. This is limited by the receptive field; you need depth for a global view.

Now, the complexity of Transformers

In attention, the expensive step is computing the score matrix.

Let’s look at matrix multiplication complexity quickly. If A is $m \times d$ and B is $d \times n$, then C = AB is $m \times n$. For every element $C_{ij}$, we take the dot product between row $i$ of A and column $j$ of B, where $d$ is the length of each dot product. There are $m \times n$ elements, so the total cost is $O(m \cdot n \cdot d)$.

In attention, computing Q, K, and V has a complexity of roughly $O(n \cdot d_{model} \cdot d_k)$ for each, since X is $n \times d_{model}$ and the weight matrices are $d_{model} \times d_k$.

The score matrix is $S = Q K^T$. Multiplying Q ($n \times d_k$) by $K^T$ ($d_k \times n$) results in an $n \times n$ matrix. Each of the $n$ queries must dot product with all $n$ keys. For $h$ heads, the complexity is $O(h \cdot n^2 \cdot d_k)$, with $O(n^2)$ memory for storing the score matrix for each head.

Applying the $n \times n$ score matrix to V (an $n \times d_v$ matrix) is $O(n^2 \cdot d_v)$. So for short sequences and large hidden sizes, the complexity is dominated by the feed-forward layers ($O(n \cdot d^2)$), but for long sequences, it’s quadratic in the sequence length ($O(n^2 \cdot d)$). That’s why attention is generally considered expensive; $n^2$ grows quickly. This global receptive field comes with a cost, and it’s a fundamental bottleneck.


Architectural Variants and Streaming Models

Decoder-Only Transformers

The decoder-only Transformer that’s now the new family of modern large language models like GPTs and LLaMA throws away the encoder and cross-attention. This is perfect for autoregressive next-token prediction. For conditional generation across modalities you need the encoder-decoder, but for text generation, decoder-only models typically outperform encoder-decoder models because they can be scaled bigger.

Some actually use a decoder-only model for language translation and simply use a <SEP> token before the prompt/context and the target after this token. In a decoder-only model, everything is just context; source or target don’t matter. As long as you mark the boundary, the model learns what’s source and what’s target.

Streaming Transformers with Windowed Attention

Let’s now switch gears a bit. Consider streamable models like online ASR or real-time translation. They can’t see the entire sequence at once, so instead of a full self-attention, you use a restricted window. In a normal attention model, attention is global—each query token attends to all past and future tokens. With windowed attention, a token only attends to nearby tokens in a fixed range, and it is also causal because we can’t look into the future.

Say you have 10 ms frames and you want the model to make predictions in streaming chunks every 500 ms. We might want each query frame to look back at the last 50 frames. The **attention window** is how many past frames each frame is allowed to see. So, if we feed our attention mechanism 100 frames but only want it to look back 50 frames, we restrict how many keys a query can see using a sliding window. This is done by masking the attention matrix, setting the score outside the window to $-\infty$, and forcing the model to ignore keys/values beyond the window.


# Create window mask (causal + limited)
mask = torch.full((n, n), float("-inf"))
for i in range(n):
    start = max(0, i - self.window_size + 1)
    mask[i, start:i+1] = 0 # allow only last `window_size` frames

You might ask, why not feed the model overlapping windows? Why use masking with a sliding window? With overlapping windows, tokens are recomputed multiple times, and the model has no way of knowing the previous context.

This notion of context might seem a bit fuzzy. Context is the information a token carries from other tokens it has attended to; it’s the weighted sum of the value vectors. Imagine at layer 1, you look 10 frames back, so each output token is a blend of 10 input tokens. Now, in layer 2, the enriched tokens from layer 1 can add their context. Your token at position 19 isn’t just token 19 anymore; it’s a representation of tokens 10 through 19 from the previous layer. And since token 10 from the previous layer saw tokens 0 through 9, the context has propagated. It’s kind of like the receptive field growing in each layer; by stacking enough layers, you grow the attention span.

If you were to just send 10 frames at a time in an overlapping fashion, the issue is that the new chunk has no memory of the past. There’s no connection between the chunks because the context gets reset at each chunk boundary. The sliding window with masking allows for the context to propagate forward through the overlap, even if the window size itself is small.

The choice of window size is a balance between accuracy, latency, and compute. You want a large enough window to disambiguate context, but you have to wait longer to fill up a larger window, and a longer window means more memory and FLOPs.

Say you’re feeding your model 1 second of audio data (100 frames at 10 ms per frame), and you care about phonemes, which are about 20 to 50 ms long (about 2 to 5 frames per phoneme). How do you choose the attention window? We would want to integrate the whole phoneme span and maybe a little extra because the boundaries can be blurry, so you might choose 10 frames as your context window. You don’t need the entire 100 frames, because then you would be modeling at the word-level context, which increases computation and latency. So, think about the natural dependencies in your data when you select the attention window in a streaming fashion. How long is the smallest unit you need to capture? A phoneme, a gesture, a word, a spike? Convert that to milliseconds and figure out how many frames/tokens it needs to be. You can increase the window if validation accuracy keeps improving and stop if latency and compute become unacceptable.

And that's it! Bye bye.