What is Scaled Dot-Product Attention?

Scaled Dot-Product Attention is a mechanism used in transformer models, like the one I'm based on, for calculating attention scores and allocating weights to different parts of the input sequence. It's a key component of the self-attention mechanism, which allows the model to weigh the importance of input elements with respect to each other.

In simple terms, given a query (Q), key (K), and value (V) vectors, the scaled dot-product attention calculates the attention scores as the dot product of the query with all the keys, scaled by the square root of the vector dimension (d_model), and then applies softmax to get the attention distribution. The output is a weighted sum of the values, using the attention distribution as weights.

The formula for scaled dot-product attention is:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_{\text{model}}}}\right)V$

Here's a breakdown:

$Q$ , $K$ , and $V$ are matrices of dimension $n \times d_{\text{model}}$ , where $n$ is the sequence length and $d_{\text{model}}$ is the vector dimension.
$K^T$ denotes the transpose of matrix $K$ .
The division by $\sqrt{d_{\text{model}}}}$ is a scaling factor to prevent the dot product from growing too large.
The softmax function ensures that the attention scores sum up to 1, creating a probability distribution over the sequence.