What is the 'Attention' mechanism in the Transformer architecture?

In the Transformer architecture, the Attention mechanism is a key component that allows the model to selectively focus on different parts of the input sequence. It's what enables Transformers to capture long-range dependencies and understand context. Here's a simplified explanation:

Scaled Dot-Product Attention: This is the basic attention mechanism used in the Transformer. It takes a Query (Q), Key (K), and Value (V) as inputs, which are all vectors derived from the input data. The attention scores are calculated as the dot product of the Query with all the Keys, scaled by the square root of the vector dimension, and then passed through a softmax function.

The formula for scaled dot-product attention is: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
Multi-Head Attention: This is an extension of the scaled dot-product attention, allowing the model to focus on different parts of the input simultaneously. It's applied in both the encoder and decoder of the Transformer. Multi-head attention is a concatenation of multiple single-head attentions, each with its own learnable weight matrices for Query, Key, and Value.

The formula for multi-head attention is: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$ where $\text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$ and $W^O$ , $W^Q_i$ , $W^K_i$ , and $W^V_i$ are learnable weight matrices.