What is Multi-Head Attention?
Multi-head attention is a mechanism in transformer architectures that runs multiple attention operations in parallel, each learning to focus on different aspects of the input. Rather than using a single attention mechanism, multi-head attention splits the embedding dimension into multiple "heads," processes each head through separate attention operations, then concatenates and projects the results. This allows the model to simultaneously attend to different positions and represent different types of relationships.
Each attention head learns its own query, key, and value transformations and computes attention independently. Different heads often specialize in capturing different patterns: some might focus on syntactic relationships, others on semantic associations, and others on positional patterns. By running multiple heads in parallel and combining their outputs, the model builds richer representations that capture multiple aspects of the input simultaneously, improving the model's ability to understand complex relationships.
Multi-head attention is a fundamental component of transformer models that power modern LLMs. The number of heads is a key architectural parameter, with models typically using 8-96 heads depending on model size. The parallel nature of multi-head attention also contributes to transformers' computational efficiency, as all heads can be processed simultaneously on modern hardware. Understanding multi-head attention is essential for grasping how transformers process and represent information.