4-bit Quantization with GPTQ
<p>Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA-30B model on an RTX 3090 GPU. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like <a href="https://arxiv.org/abs/2210.17323" rel="noopener ugc nofollow" target="_blank">GPTQ</a>, <a href="https://github.com/ggerganov/ggml" rel="noopener ugc nofollow" target="_blank">GGML</a>, and <a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes" rel="noopener ugc nofollow" target="_blank">NF4</a>.</p>
<p>In the <a href="https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c" rel="noopener">previous article</a>, we introduced naïve 8-bit quantization techniques and the excellent LLM.int8(). In this article, we will explore the popular <strong>GPTQ algorithm</strong> to understand how it works and implement it using the <a href="https://github.com/PanQiWei/AutoGPTQ" rel="noopener ugc nofollow" target="_blank">AutoGPTQ</a> library.</p>
<p>You can find the code on <a href="https://colab.research.google.com/drive/1lSvVDaRgqQp_mWK_jC9gydz6_-y6Aq4A?usp=sharing" rel="noopener ugc nofollow" target="_blank">Google Colab</a> and GitHub.</p>
<h1> Optimal Brain Quantization</h1>
<p>Let’s start by introducing the problem we’re trying to solve. For every layer ℓ in the network, we want to find a quantized version <strong>Ŵₗ</strong><em> of the original weights </em><strong>Wₗ</strong>. This is called the <strong>layer-wise compression problem</strong>. More specifically, to minimize performance degradation, we want the outputs (<strong>Ŵ</strong>ᵨ<strong>X</strong>ᵨ) of these new weights to be as close as possible to the original ones (<strong>W</strong>ᵨ<strong>X</strong>ᵨ). In other words, we want to find:</p>
<p><img alt="" src="https://miro.medium.com/v2/resize:fit:630/1*02PIN1yLtRRKiQlAJbg5ZQ.png" style="height:56px; width:700px" /></p>
<p>Different approaches have been proposed to solve this problem, but we’re interested in the <strong>Optimal Brain Quantizer</strong> (OBQ) framework here.</p>
<p><a href="https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34">Click Here</a></p>