High-Speed Inference with llama.cpp and Vicuna on CPU
<p>For inference with large language models, we may think that we need a very big GPU or that it can’t run on consumer hardware. This is rarely the case.</p>
<p>Nowadays, we have many tricks and frameworks at our disposal, such as device mapping or QLoRa, that make inference possible at home, even for very large language models.</p>
<p>And now, thanks to Georgi Gerganov, we don’t even need a GPU. Georgi Gerganov is well-known for his work on implementing in plain C++ high-performance inference.</p>
<p>He has implemented, with the help of many contributors, the inference for LLaMa, and other models, in plain C++.</p>
<p>All these implementations are optimized to run without a GPU. In other words, you just need enough CPU RAM to load the models. Then your CPU will take care of the inference.</p>
<p>In this blog post, I show how to set up a llama. cpp on your computer with very simple steps. I focus on Vicuna, a chat model behaving like ChatGPT, but I also show how to run llama.cpp for other language models. After reading this post, you should have a state-of-the-art chatbot running on your computer.</p>
<p><a href="https://medium.com/towards-artificial-intelligence/high-speed-inference-with-llama-cpp-and-vicuna-on-cpu-136d28e7887b"><strong>Read More</strong></a></p>