Florian explains the benefits of Grouped-query attention (GQA) in autoregressive decoding. GQA is a mechanism that balances the speed of Multi-Query attention (MQA) with the quality of Multi-Head attention (MHA). Despite being a newcomer, GQA has gained popularity among models like Llama2 and Mistral 7B. Read the full blog for free on Medium.
source update: Grouped-Query Attention(GQA) Explained – Towards AI