Tag: Quantization

Quantization Method to Use for LLMs

As large language models (LLM) got bigger with more and more parameters, new techniques to reduce their memory usage have also been proposed. One of the most effective methods to reduce the model size in memory is quantization. You can see quantization as a compression technique for LLMs. In...

Introduction to Weight Quantization

Large Language Models (LLMs) are known for their extensive computational requirements. Typically, the size of a model is calculated by multiplying the number of parameters (size) by the precision of these values (data type). However, to save memory, weights can be stored using lower-precision data t...

Boosting PyTorch Inference on CPU: From Post-Training Quantization to Multithreading

Welcome to another edition of “The Kaggle Blueprints”, where we will analyze Kaggle competitions’ winning solutions for lessons we can apply to our own data science projects. This edition will review the techniques and approaches from the “BirdCLEF 2023&rdqu...

4-bit Quantization with GPTQ

Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA-30B model on an RTX 3090 GPU. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like GPTQ, GGML, and NF4. I...