193 lines
5.9 KiB
Markdown
193 lines
5.9 KiB
Markdown
---
|
|
title: "Understanding Transformer Architecture: A Deep Dive"
|
|
description: "An in-depth technical exploration of transformer models, the architecture behind modern NLP breakthroughs."
|
|
author: "HEC IA Technical Team"
|
|
pubDatetime: 2026-01-28T10:00:00Z
|
|
tags: ["transformers", "nlp", "deep-learning", "architecture"]
|
|
difficulty: "advanced"
|
|
readingTime: "25 min"
|
|
featured: true
|
|
---
|
|
|
|
## Introduction
|
|
|
|
Transformers have revolutionized natural language processing since their introduction in the landmark paper "Attention is All You Need" (Vaswani et al., 2017). This deep dive explores the architecture, mechanisms, and implementations that make transformers so powerful.
|
|
|
|
## The Attention Mechanism
|
|
|
|
At the heart of transformers lies the self-attention mechanism, which allows the model to weigh the importance of different words in a sequence.
|
|
|
|
### Self-Attention Formula
|
|
|
|
The self-attention mechanism computes:
|
|
|
|
```
|
|
Attention(Q, K, V) = softmax(QK^T / √d_k)V
|
|
```
|
|
|
|
Where:
|
|
|
|
- Q (Query): What we're looking for
|
|
- K (Key): What we're matching against
|
|
- V (Value): The actual content we want to retrieve
|
|
- d_k: Dimension of the key vectors (for scaling)
|
|
|
|
## Multi-Head Attention
|
|
|
|
Instead of performing a single attention function, multi-head attention runs multiple attention operations in parallel:
|
|
|
|
```python
|
|
class MultiHeadAttention(nn.Module):
|
|
def __init__(self, d_model, num_heads):
|
|
super().__init__()
|
|
self.d_model = d_model
|
|
self.num_heads = num_heads
|
|
self.head_dim = d_model // num_heads
|
|
|
|
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
|
|
self.out_proj = nn.Linear(d_model, d_model)
|
|
|
|
def forward(self, x):
|
|
batch_size, seq_len, _ = x.shape
|
|
|
|
# Project and split into Q, K, V
|
|
qkv = self.qkv_proj(x)
|
|
qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
|
|
q, k, v = qkv.unbind(2)
|
|
|
|
# Compute attention
|
|
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
|
|
attn = F.softmax(scores, dim=-1)
|
|
|
|
# Apply attention to values
|
|
out = torch.matmul(attn, v)
|
|
out = out.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
|
|
|
|
return self.out_proj(out)
|
|
```
|
|
|
|
## Positional Encoding
|
|
|
|
Since transformers don't have inherent sequence order, we add positional encodings:
|
|
|
|
```python
|
|
def positional_encoding(seq_len, d_model):
|
|
position = torch.arange(seq_len).unsqueeze(1)
|
|
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
|
|
|
|
pe = torch.zeros(seq_len, d_model)
|
|
pe[:, 0::2] = torch.sin(position * div_term)
|
|
pe[:, 1::2] = torch.cos(position * div_term)
|
|
|
|
return pe
|
|
```
|
|
|
|
## Feed-Forward Networks
|
|
|
|
Each transformer layer includes a position-wise feed-forward network:
|
|
|
|
```python
|
|
class FeedForward(nn.Module):
|
|
def __init__(self, d_model, d_ff, dropout=0.1):
|
|
super().__init__()
|
|
self.linear1 = nn.Linear(d_model, d_ff)
|
|
self.linear2 = nn.Linear(d_ff, d_model)
|
|
self.dropout = nn.Dropout(dropout)
|
|
|
|
def forward(self, x):
|
|
return self.linear2(self.dropout(F.relu(self.linear1(x))))
|
|
```
|
|
|
|
## Complete Transformer Block
|
|
|
|
Putting it all together:
|
|
|
|
```python
|
|
class TransformerBlock(nn.Module):
|
|
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
|
|
super().__init__()
|
|
self.attention = MultiHeadAttention(d_model, num_heads)
|
|
self.feed_forward = FeedForward(d_model, d_ff, dropout)
|
|
|
|
self.norm1 = nn.LayerNorm(d_model)
|
|
self.norm2 = nn.LayerNorm(d_model)
|
|
|
|
self.dropout1 = nn.Dropout(dropout)
|
|
self.dropout2 = nn.Dropout(dropout)
|
|
|
|
def forward(self, x):
|
|
# Self-attention with residual connection
|
|
attn_out = self.attention(x)
|
|
x = self.norm1(x + self.dropout1(attn_out))
|
|
|
|
# Feed-forward with residual connection
|
|
ff_out = self.feed_forward(x)
|
|
x = self.norm2(x + self.dropout2(ff_out))
|
|
|
|
return x
|
|
```
|
|
|
|
## Training Considerations
|
|
|
|
### Learning Rate Scheduling
|
|
|
|
Transformers typically use warm-up scheduling:
|
|
|
|
```python
|
|
def lr_schedule(step, d_model, warmup_steps=4000):
|
|
arg1 = step ** -0.5
|
|
arg2 = step * (warmup_steps ** -1.5)
|
|
return d_model ** -0.5 * min(arg1, arg2)
|
|
```
|
|
|
|
### Label Smoothing
|
|
|
|
To prevent overconfidence:
|
|
|
|
```python
|
|
class LabelSmoothingLoss(nn.Module):
|
|
def __init__(self, smoothing=0.1):
|
|
super().__init__()
|
|
self.smoothing = smoothing
|
|
|
|
def forward(self, pred, target):
|
|
n_class = pred.size(1)
|
|
one_hot = torch.zeros_like(pred).scatter(1, target.unsqueeze(1), 1)
|
|
one_hot = one_hot * (1 - self.smoothing) + self.smoothing / n_class
|
|
log_prob = F.log_softmax(pred, dim=1)
|
|
return -(one_hot * log_prob).sum(dim=1).mean()
|
|
```
|
|
|
|
## Practical Applications
|
|
|
|
Transformers excel at:
|
|
|
|
1. **Machine Translation**: Translating between languages
|
|
2. **Text Summarization**: Generating concise summaries
|
|
3. **Question Answering**: Understanding context to answer queries
|
|
4. **Text Generation**: Creating coherent text (GPT models)
|
|
5. **Sentiment Analysis**: Understanding emotional tone
|
|
|
|
## Optimization Tips
|
|
|
|
1. **Gradient Checkpointing**: Save memory during training
|
|
2. **Mixed Precision Training**: Use FP16 for faster computation
|
|
3. **Efficient Attention**: Implement sparse or local attention for long sequences
|
|
4. **Model Parallelism**: Distribute layers across GPUs
|
|
|
|
## Conclusion
|
|
|
|
Transformers represent a paradigm shift in sequence modeling. Understanding their architecture is crucial for anyone working in modern NLP and beyond.
|
|
|
|
## References
|
|
|
|
- Vaswani et al. (2017) - "Attention is All You Need"
|
|
- Devlin et al. (2018) - "BERT: Pre-training of Deep Bidirectional Transformers"
|
|
- Brown et al. (2020) - "Language Models are Few-Shot Learners" (GPT-3)
|
|
|
|
## Further Reading
|
|
|
|
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
|
|
- [Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
|
|
- [Transformers from Scratch](https://peterbloem.nl/blog/transformers)
|