Dr. Mandar Karhade, MD. PhD. wrote an article about a clever way of using a GPU-CPU hybrid interface to achieve impressive speeds. The article discusses PowerInfer, which distributes the workload between CPU and GPU to speed up local inference workloads. PowerInfer exploits the high locality inherent in LLM inference, and evaluation shows it attains an average token generation rate of 13.20 tokens/s. The full blog can be found on Medium.
source update: 11x Speed up LLaMA II Inference On a Local GPU – Towards AI