4-bit Quantization with GPTQ

Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA-30B model on an RTX 3090 GPU. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like <a href="https://arxiv.org/abs/2210.17323" rel="noopener ugc nofollow" target="_blank">GPTQ</a>, <a href="https://github.com/ggerganov/ggml" rel="noopener ugc nofollow" target="_blank">GGML</a>, and <a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes" rel="noopener ugc nofollow" target="_blank">NF4</a>. In the <a href="https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c" rel="noopener">previous article</a>, we introduced naïve 8-bit quantization techniques and the excellent LLM.int8(). In this article, we will explore the popular GPTQ algorithm to understand how it works and implement it using the <a href="https://github.com/PanQiWei/AutoGPTQ" rel="noopener ugc nofollow" target="_blank">AutoGPTQ</a> library. You can find the code on <a href="https://colab.research.google.com/drive/1lSvVDaRgqQp_mWK_jC9gydz6_-y6Aq4A?usp=sharing" rel="noopener ugc nofollow" target="_blank">Google Colab</a> and GitHub. <h1> Optimal Brain Quantization</h1> Let’s start by introducing the problem we’re trying to solve. For every layer ℓ in the network, we want to find a quantized version Ŵₗ of the original weights Wₗ. This is called the layer-wise compression problem. More specifically, to minimize performance degradation, we want the outputs (ŴᵨXᵨ) of these new weights to be as close as possible to the original ones (WᵨXᵨ). In other words, we want to find: <img alt="" src="https://miro.medium.com/v2/resize:fit:630/1*02PIN1yLtRRKiQlAJbg5ZQ.png" style="height:56px; width:700px" /> Different approaches have been proposed to solve this problem, but we’re interested in the Optimal Brain Quantizer (OBQ) framework here. <a href="https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34">Click Here</a>