High-Speed Inference with llama.cpp and Vicuna on CPU

<p>For inference with large language models, we may think that we need a very big GPU or that it can&rsquo;t run on consumer hardware. This is rarely the case.</p> <p>Nowadays, we have many tricks and frameworks at our disposal, such as&nbsp;device mapping&nbsp;or&nbsp;QLoRa, that make inference possible at home, even for very large language models.</p> <p>And now, thanks to Georgi Gerganov, we don&rsquo;t even need a GPU.&nbsp;Georgi Gerganov&nbsp;is well-known for his work on implementing in plain C++ high-performance inference.</p> <p>He has implemented, with the help of many contributors, the inference for LLaMa, and other models, in plain C++.</p> <p>All these implementations are optimized to run without a GPU. In other words, you just need enough CPU RAM to load the models. Then your CPU will take care of the inference.</p> <p>In this blog post, I show how to set up a llama. cpp&nbsp;on your computer with very simple steps. I focus on Vicuna, a chat model behaving like ChatGPT, but I also show how to run llama.cpp for other language models. After reading this post, you should have a state-of-the-art chatbot running on your computer.</p> <p><a href="https://medium.com/towards-artificial-intelligence/high-speed-inference-with-llama-cpp-and-vicuna-on-cpu-136d28e7887b"><strong>Read More</strong></a></p>