GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs

As large language models (LLM) got bigger with more and more parameters, new techniques to reduce their memory usage have also been proposed. One of the most effective methods to reduce the model size in memory is quantization. You can see quantization as a compression technique for LLMs. In practice, the main goal of quantization is to lower the precision of the LLM’s weights, typically from 16-bit to 8-bit, 4-bit, or even 3-bit, with minimal performance degradation. There are two popular quantization methods for LLMs: GPTQ and bitsandbytes. In this article, I discuss what the main differences between these two approaches are. They both have their own advantages and disadvantages that make them suitable for different use cases. I present a comparison of their memory usage and inference speed using Llama 2. I also discuss their performance based on experiments from previous work. Note: If you want to know more about quantization, I recommend reading this excellent introduction by  <a href="https://medium.com/u/dc89da634938?source=post_page-----f79bc03046dc--------------------------------" rel="noopener" target="_blank">Maxime Labonne</a> : <h2><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c?source=post_page-----f79bc03046dc--------------------------------" rel="noopener follow" target="_blank">Introduction to Weight Quantization</a></h2> <h3><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c?source=post_page-----f79bc03046dc--------------------------------" rel="noopener follow" target="_blank">Reducing the size of Large Language Models with 8-bit quantization</a></h3> <a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c?source=post_page-----f79bc03046dc--------------------------------" rel="noopener follow" target="_blank">towardsdatascience.com</a> <h1>GPTQ: Post-training quantization for lightweight storage and fast inference</h1> <a href="https://arxiv.org/abs/2210.17323" rel="noopener ugc nofollow" target="_blank">GPTQ (Frantar et al., 2023)</a> was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size. <a href="https://towardsdatascience.com/gptq-or-bitsandbytes-which-quantization-method-to-use-for-llms-examples-with-llama-2-f79bc03046dc">Website</a>

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2