Sklearn Pipelines for the Modern ML Engineer: 9 Techniques You Can’t Ignore

<p>Today, this is what I am selling:</p> <pre> awesome_pipeline.fit(X, y)</pre> <p><code>awesome_pipeline</code>&nbsp;may look just like another variable, but here is what it does to poor&nbsp;<code>X</code>&nbsp;and&nbsp;<code>y</code>&nbsp;under the hood:</p> <ol> <li>Automatically isolates numerical and categorical features of&nbsp;<code>X</code>.</li> <li>Imputes missing values in numeric features.</li> <li>Log-transforms skewed features while normalizing the rest.</li> <li>Imputes missing values in categorical features and one-hot encodes them.</li> <li>Normalizes the target array&nbsp;<code>y</code>&nbsp;for good measure.</li> </ol> <p>Apart from collapsing almost 100 lines worth of unreadable code into a single line,&nbsp;<code>awesome_pipeline</code>&nbsp;can now be inserted into cross-validators or hyperparameter tuners, guarding your code from data leakage and making everything reproducible, modular, and headache-free.</p> <p>Let&rsquo;s see how to build the thing.</p> <h2>0. Estimators vs transformers</h2> <p>First, let&rsquo;s get the terminology out of the way.</p> <p>A transformer in Sklearn is any class or function that accepts features of a dataset, applies transformations, and returns them. It has&nbsp;<code>fit_transform</code>&nbsp;and&nbsp;<code>transform</code>&nbsp;methods.</p> <p>An example is the&nbsp;<code>QuantileTransformer</code>, which takes numeric input(s) and makes them normally distributed. It is especially useful for features with&nbsp;outliers.</p> <p><a href="https://towardsdatascience.com/sklearn-pipelines-for-the-modern-ml-engineer-9-techniques-you-cant-ignore-637788f05df5">Website</a></p>