What is residual connection?

In the context of the Transformer architecture, a residual connection, also known as a skip connection, is a connection that allows the gradient to be directly backpropagated to earlier layers, mitigating the vanishing gradient problem. Here's how it works:

  1. Input: Let's denote the input to a layer as x.
  2. Processing: The layer applies some function F(x) to x. This could be a combination of operations like linear transformations, activation functions, etc.
  3. Residual Connection: Instead of just using F(x) as the output, we add the input x to it. Mathematically, this is represented as y = F(x) + x, where y is the output.
  4. Layer Normalization: After the addition, layer normalization (Norm in 'Add & Norm') is applied to ensure the activations have zero mean and unit variance, stabilizing the learning process.

This residual connection allows the gradient to flow directly from the output back to the input, helping to prevent the vanishing gradient problem. It also allows the model to learn an identity mapping, ensuring that the output is at least as good as the input, which can help with training stability.

What is residual connection? — LLM Engineering | Unlo