Efficient Deep Learning: Unleashing the Power of Model Compression

<h2>Introduction</h2> <p>When a Machine Learning model is deployed into production there are often requirements to be met that are not taken into account in a prototyping phase of the model. For example, the model in production will have to handle lots of requests from different users running the product. So you will want to optimize for instance latency and/o throughput.</p> <ul> <li><strong>Latency</strong>: is the time it takes for a task to get done, like how long it takes to load a webpage after you click a link. It’s the waiting time between starting something and seeing the result.</li> <li><strong>Throughput</strong>: is how much requests a system can handle in a certain time.</li> </ul> <p>This means that the Machine Learning model has to be very fast at making its predictions, and for this there are various techniques that serve to increase the speed of model inference, let’s look at the most important ones in this article.</p> <h1>Model Compression</h1> <p>There are techniques that aim to make <strong>models smaller</strong>, which is why they are called <strong>model compression</strong> techniques, while others that focus on making models<strong> faster at inference</strong> and thus fall under the field of <strong>model optimization</strong>.<br /> But often making models smaller also helps with inference speed, so it is a very blurred line that separates these two fields of study.</p> <h2>Low Rank Factorization</h2> <p>This is the first method we see, and it is being studied a lot, in fact many papers have recently come out concerning it.</p> <p><a href="https://towardsdatascience.com/efficient-deep-learning-unleashing-the-power-of-model-compression-7b5ea37d4d06"><strong>Visit Now</strong></a></p>