4-bit Quantization with GPTQ

<p>Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA-30B model on an RTX 3090 GPU. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like&nbsp;<a href="https://arxiv.org/abs/2210.17323" rel="noopener ugc nofollow" target="_blank">GPTQ</a>,&nbsp;<a href="https://github.com/ggerganov/ggml" rel="noopener ugc nofollow" target="_blank">GGML</a>, and&nbsp;<a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes" rel="noopener ugc nofollow" target="_blank">NF4</a>.</p> <p>In the&nbsp;<a href="https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c" rel="noopener">previous article</a>, we introduced na&iuml;ve 8-bit quantization techniques and the excellent LLM.int8(). In this article, we will explore the popular&nbsp;<strong>GPTQ algorithm</strong>&nbsp;to understand how it works and implement it using the&nbsp;<a href="https://github.com/PanQiWei/AutoGPTQ" rel="noopener ugc nofollow" target="_blank">AutoGPTQ</a>&nbsp;library.</p> <p>You can find the code on&nbsp;<a href="https://colab.research.google.com/drive/1lSvVDaRgqQp_mWK_jC9gydz6_-y6Aq4A?usp=sharing" rel="noopener ugc nofollow" target="_blank">Google Colab</a>&nbsp;and&nbsp;GitHub.</p> <h1>&nbsp;Optimal Brain Quantization</h1> <p>Let&rsquo;s start by introducing the problem we&rsquo;re trying to solve. For every layer ℓ in the network, we want to find a quantized version&nbsp;<strong>Ŵₗ</strong><em>&nbsp;of the original weights&nbsp;</em><strong>Wₗ</strong>. This is called the&nbsp;<strong>layer-wise compression problem</strong>. More specifically, to minimize performance degradation, we want the outputs (<strong>Ŵ</strong>ᵨ<strong>X</strong>ᵨ) of these new weights to be as close as possible to the original ones (<strong>W</strong>ᵨ<strong>X</strong>ᵨ). In other words, we want to find:</p> <p><img alt="" src="https://miro.medium.com/v2/resize:fit:630/1*02PIN1yLtRRKiQlAJbg5ZQ.png" style="height:56px; width:700px" /></p> <p>Different approaches have been proposed to solve this problem, but we&rsquo;re interested in the&nbsp;<strong>Optimal Brain Quantizer</strong>&nbsp;(OBQ) framework here.</p> <p><a href="https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34">Click Here</a></p>