Introduction to Weight Quantization
<p>Large Language Models (LLMs) are known for their extensive computational requirements. Typically, the size of a model is calculated by multiplying the number of parameters (<strong>size</strong>) by the precision of these values (<strong>data type</strong>). However, to save memory, weights can be stored using lower-precision data types through a process known as quantization.</p>
<p>We distinguish two main families of weight quantization techniques in the literature:</p>
<ul>
<li><strong>Post-Training Quantization</strong> (PTQ) is a straightforward technique where the weights of an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation.</li>
<li><strong>Quantization-Aware Training</strong> (QAT) incorporates the weight conversion process during the pre-training or fine-tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data.</li>
</ul>
<p>In this article, we focus on PTQ to reduce the precision of our parameters. To get a good intuition, we will apply both naïve and more sophisticated techniques to a toy example using a GPT-2 model.</p>
<p>The entire code is freely available on Google Colab and GitHub.</p>
<p><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c">Click Here</a></p>