Run Llama 2 on Your CPU with Rust
<p>A new <a href="https://github.com/srush/llama2.rs" rel="noopener ugc nofollow" target="_blank">one-file Rust implementation of Llama 2</a> is now available thanks to Sasha Rush. It’s a Rust port of Karpathy’s <a href="https://github.com/karpathy/llama2.c" rel="noopener ugc nofollow" target="_blank">llama2.c</a>. It already supports the following features:</p>
<ul>
<li>Support for 4-bit GPT-Q Quantization</li>
<li>SIMD support for fast CPU inference</li>
<li>Support for Grouped Query Attention (needed for big Llamas)</li>
<li>Memory mapping, loads 70B instantly.</li>
<li>Static size checks, no pointers</li>
</ul>
<p>While this project is clearly in an early development phase, it’s already very impressive. It achieves 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B, both quantized with GPTQ.</p>
<p>You can learn about GPTQ for LLama 2 here:</p>
<h2>Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer</h2>
<h3>Llama 2 but 75% smaller</h3>
<p>kaitchup.substack.com</p>
<p>Sasha claimed on X (Twitter…) that he could run the 70B version of Llama 2 using only the CPU of his laptop. But of course, it’s very slow (5 tokens/min). With an Intel i9, you can get a much higher speed of 1 token/sec.</p>
<p>If you understand Rust, I recommend reading the code. It gives a lot of ideas to efficiently deal with the quantization and dequantization of LLMs.</p>
<p><a href="https://medium.com/@bnjmn_marie/run-llama-2-on-your-cpu-with-rust-33ccb74752c3">Click Here</a></p>