Run Llama 2 on Your CPU with Rust

<p>A new <a href="https://github.com/srush/llama2.rs" rel="noopener ugc nofollow" target="_blank">one-file Rust implementation of Llama 2</a> is now available thanks to Sasha Rush. It’s a Rust port of Karpathy’s <a href="https://github.com/karpathy/llama2.c" rel="noopener ugc nofollow" target="_blank">llama2.c</a>. It already supports the following features:</p> <ul> <li>Support for 4-bit GPT-Q Quantization</li> <li>SIMD support for fast CPU inference</li> <li>Support for Grouped Query Attention (needed for big Llamas)</li> <li>Memory mapping, loads 70B instantly.</li> <li>Static size checks, no pointers</li> </ul> <p>While this project is clearly in an early development phase, it’s already very impressive. It achieves 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B, both quantized with GPTQ.</p> <p>You can learn about GPTQ for LLama 2 here:</p> <h2>Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer</h2> <h3>Llama 2 but 75% smaller</h3> <p>kaitchup.substack.com</p> <p>Sasha claimed on X (Twitter…) that he could run the 70B version of Llama 2 using only the CPU of his laptop. But of course, it’s very slow (5 tokens/min). With an Intel i9, you can get a much higher speed of 1 token/sec.</p> <p>If you understand Rust, I recommend reading the code. It gives a lot of ideas to efficiently deal with the quantization and dequantization of LLMs.</p> <p><a href="https://medium.com/@bnjmn_marie/run-llama-2-on-your-cpu-with-rust-33ccb74752c3">Click Here</a></p>