What is the self-attention mechanism in the Transformer architecture?

The self-attention mechanism is a key component of the Transformer architecture, introduced in the "Attention is All You Need" paper by Vaswani et al. It allows the model to selectively focus on different parts of the input sequence when generating an output. Here's a simplified breakdown:

Scaled Dot-Product Attention: The core of self-attention is the scaled dot-product attention mechanism. Given a query (Q), key (K), and value (V), the attention scores are calculated as: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ where $d_k$ is the dimension of the keys and $K^T$ denotes the transpose of K.
Multi-Head Attention: Instead of a single attention head, the Transformer uses multiple attention heads in parallel, allowing the model to focus on different parts of the sequence simultaneously. Each head has its own query, key, and value projections, and the outputs are concatenated and projected again: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$ where $h$ is the number of heads, and $W^O$ is a learned parameter matrix.
Self-Attention: In the self-attention mechanism, the query, key, and value are all derived from the same input. This allows the model to capture dependencies between different positions in the input sequence. Formally, given an input $X$ , the self-attention mechanism is defined as: $\text{SelfAttention}(X) = \text{MultiHead}(X, X, X)$