Introduction to Weight Quantization

<p>Large Language Models (LLMs) are known for their extensive computational requirements. Typically, the size of a model is calculated by multiplying the number of parameters (<strong>size</strong>) by the precision of these values (<strong>data type</strong>). However, to save memory, weights can be stored using lower-precision data types through a process known as quantization.</p> <p>We distinguish two main families of weight quantization techniques in the literature:</p> <ul> <li><strong>Post-Training Quantization</strong>&nbsp;(PTQ) is a straightforward technique where the weights of an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation.</li> <li><strong>Quantization-Aware Training</strong>&nbsp;(QAT)&nbsp;incorporates the weight conversion process during the pre-training or fine-tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data.</li> </ul> <p>In this article, we focus on PTQ to reduce the precision of our parameters. To get a good intuition, we will apply both na&iuml;ve and more sophisticated techniques to a toy example using a GPT-2 model.</p> <p>The entire code is freely available on&nbsp;Google Colab&nbsp;and&nbsp;GitHub.</p> <p><a href="https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c">Click Here</a></p>