GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2
<p>As large language models (LLM) got bigger with more and more parameters, new techniques to reduce their memory usage have also been proposed.</p>
<p>One of the most effective methods to reduce the model size in memory is <strong>quantization</strong>. You can see quantization as a compression technique for LLMs. In practice, the main goal of quantization is to lower the precision of the LLM’s weights, typically from 16-bit to 8-bit, 4-bit, or even 3-bit, with minimal performance degradation.</p>
<p>There are two popular quantization methods for LLMs: GPTQ and bitsandbytes.</p>
<p>In this article, I discuss what the main differences between these two approaches are. They both have their own advantages and disadvantages that make them suitable for different use cases. I present a comparison of their memory usage and inference speed using Llama 2. I also discuss their performance based on experiments from previous work.</p>
<p><em>Note: If you want to know more about quantization, I recommend reading this excellent introduction by </em></p>
<p><a href="https://medium.com/u/dc89da634938?source=post_page-----f79bc03046dc--------------------------------" rel="noopener" target="_blank"><em>Maxime Labonne</em></a></p>
<p><em>:</em></p>
<h2><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c?source=post_page-----f79bc03046dc--------------------------------" rel="noopener follow" target="_blank">Introduction to Weight Quantization</a></h2>
<h3><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c?source=post_page-----f79bc03046dc--------------------------------" rel="noopener follow" target="_blank">Reducing the size of Large Language Models with 8-bit quantization</a></h3>
<p><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c?source=post_page-----f79bc03046dc--------------------------------" rel="noopener follow" target="_blank">towardsdatascience.com</a></p>
<h1>GPTQ: Post-training quantization for lightweight storage and fast inference</h1>
<p><a href="https://arxiv.org/abs/2210.17323" rel="noopener ugc nofollow" target="_blank">GPTQ (Frantar et al., 2023)</a> was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size.</p>
<p><a href="https://towardsdatascience.com/gptq-or-bitsandbytes-which-quantization-method-to-use-for-llms-examples-with-llama-2-f79bc03046dc"><strong>Website</strong></a></p>