Self-Attention

advanced
ArchitecturesLast updated: 2025-01-15

What is Self-Attention?


Self-attention is an attention mechanism where a sequence attends to itself, allowing each element to gather information from all other elements in the same sequence. Unlike attention mechanisms that connect two different sequences (like encoder-decoder attention), self-attention computes relationships within a single sequence, enabling the model to capture dependencies, identify relevant context, and build richer representations that incorporate information from across the entire input.


The mechanism computes three vectors (query, key, value) for each position in the sequence, then uses dot products between queries and keys to determine attention weights that specify how much each position should attend to every other position. These weights are applied to the values and combined to produce the output representation for each position. Crucially, all positions are processed in parallel, making self-attention computationally efficient on modern hardware despite computing all pairwise interactions.


Self-attention is the core component of transformer architectures that power modern LLMs. It enables models to identify which parts of the input are relevant to each token, capture long-range dependencies without the sequential processing limitations of RNNs, and build context-aware representations. Multiple layers of self-attention allow the model to capture increasingly abstract relationships and patterns. Understanding self-attention is fundamental to understanding how transformers process language and why they've become the dominant architecture for natural language processing.


Related Terms