wiki/transformer-architecture-deep-dive.md at e230e431cfbb866dbe7c495eae0d5c0d2ad47dff

hec-ia/wiki

Fork 0

Files

vorpax 8cf29fc0b5 edit content

2026-02-02 07:05:30 +01:00

5.9 KiB

Raw Blame History

title, description, author, pubDatetime, tags, difficulty, readingTime, featured

title

description

author

pubDatetime

Introduction

Transformers have revolutionized natural language processing since their introduction in the landmark paper "Attention is All You Need" (Vaswani et al., 2017). This deep dive explores the architecture, mechanisms, and implementations that make transformers so powerful.

The Attention Mechanism

At the heart of transformers lies the self-attention mechanism, which allows the model to weigh the importance of different words in a sequence.

Self-Attention Formula

The self-attention mechanism computes:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where:

Q (Query): What we're looking for
K (Key): What we're matching against
V (Value): The actual content we want to retrieve
d_k: Dimension of the key vectors (for scaling)

Multi-Head Attention

Instead of performing a single attention function, multi-head attention runs multiple attention operations in parallel:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        # Project and split into Q, K, V
        qkv = self.qkv_proj(x)
        qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.unbind(2)

        # Compute attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn = F.softmax(scores, dim=-1)

        # Apply attention to values
        out = torch.matmul(attn, v)
        out = out.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)

        return self.out_proj(out)

Positional Encoding

Since transformers don't have inherent sequence order, we add positional encodings:

def positional_encoding(seq_len, d_model):
    position = torch.arange(seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))

    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

Feed-Forward Networks

Each transformer layer includes a position-wise feed-forward network:

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

Complete Transformer Block

Putting it all together:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x):
        # Self-attention with residual connection
        attn_out = self.attention(x)
        x = self.norm1(x + self.dropout1(attn_out))

        # Feed-forward with residual connection
        ff_out = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_out))

        return x

Training Considerations

Learning Rate Scheduling

Transformers typically use warm-up scheduling:

def lr_schedule(step, d_model, warmup_steps=4000):
    arg1 = step ** -0.5
    arg2 = step * (warmup_steps ** -1.5)
    return d_model ** -0.5 * min(arg1, arg2)

Label Smoothing

To prevent overconfidence:

class LabelSmoothingLoss(nn.Module):
    def __init__(self, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing

    def forward(self, pred, target):
        n_class = pred.size(1)
        one_hot = torch.zeros_like(pred).scatter(1, target.unsqueeze(1), 1)
        one_hot = one_hot * (1 - self.smoothing) + self.smoothing / n_class
        log_prob = F.log_softmax(pred, dim=1)
        return -(one_hot * log_prob).sum(dim=1).mean()

Practical Applications

Transformers excel at:

Machine Translation: Translating between languages
Text Summarization: Generating concise summaries
Question Answering: Understanding context to answer queries
Text Generation: Creating coherent text (GPT models)
Sentiment Analysis: Understanding emotional tone

Optimization Tips

Gradient Checkpointing: Save memory during training
Mixed Precision Training: Use FP16 for faster computation
Efficient Attention: Implement sparse or local attention for long sequences
Model Parallelism: Distribute layers across GPUs

Conclusion

Transformers represent a paradigm shift in sequence modeling. Understanding their architecture is crucial for anyone working in modern NLP and beyond.

References

Vaswani et al. (2017) - "Attention is All You Need"
Devlin et al. (2018) - "BERT: Pre-training of Deep Bidirectional Transformers"
Brown et al. (2020) - "Language Models are Few-Shot Learners" (GPT-3)

5.9 KiB Raw Blame History