FlashAttention and FlashAttention-2 are techniques developed by Stanford University to scale the context of large language models (LLMs). These techniques involve reordering attention computation and utilizing classical techniques like tiling and recomputation to improve speed and reduce memory usage. FlashAttention-2 makes further improvements by minimizing non-matmul FLOPs and introducing additional dimensions of parallelism. The Stanford team evaluated these techniques and found notable improvements over the original version and other alternatives. This research represents a significant breakthrough in expanding the capacity of LLMs.
source update: The Path to… – Towards AI