Lessons Learned while doing Machine Learning at scale using Python and Google Cloud.
<h1>I recently worked on a number of projects for implementing machine learning pipelines at scale while optimizing them to save costs. Here are my titbits to make your ML projects time and cost-efficient.</h1>
<p><strong>1. Perform Feature Generation in the Database Query itself.</strong></p>
<p>This is probably the biggest and most important advice, perform your feature generation in the query stage of your Machine Learning pipeline. 90% of your features can be generated in SQL/Mongo/Bigquery or any other database that you use. Since databases and database languages are optimized to perform these operations it reduces the overhead of performing these heavy operations in-memory.</p>
<p><strong>2. Quality over Quantity</strong></p>
<p>Drop columns/Dataset copies that are no longer required in the process. There is a term I use for it called “variable hoarding”. I have noticed in code reviews that developers keep saving copies in different variables instead of doing in-place operations. In a single process, a thumb rule: a duplicate of a dataset should only exist if there is a specific purpose for it. If there is a code smell where a dataset copy was created but not deleted after it has served its purpose should be immediately handled. Dataset copies increase memory overhead, reduce speed, and result in increased cost of processing time on the cloud.</p>
<p><strong>3. Use a Model Versioning System</strong></p>
<p>Most of you will be familiar with the concept of versioning, but I still want to reiterate that model versioning can really help not just in pushing ML models to production but also in dividing the training operation into separate batches. If you divide your dataset into smaller batches and retrain the same model using smaller datasets will result in the same output compared to feeding it a billion training data points at once and keeping the process alive for numerous hours.</p>
<p><strong>4. Setting up read and load operations is more time-consuming than train/predict operations.</strong></p>
<p>Yes, you read it correctly. Coding the Read and Load operations is more time-consuming than the training and prediction pipelines themselves. Defining your read and load architectures before starting to implement your machine learning pipeline is very important. Having parallelized and concurrent read/load architectures helps the pipeline to run smoothly and save your results efficiently.</p>
<p><a href="https://medium.com/@rohanaswani41/lessons-learned-while-doing-machine-learning-at-scale-using-python-and-google-cloud-5968936af26c">Website</a></p>