Sklearn Pipelines for the Modern ML Engineer: 9 Techniques You Can???t Ignore

Today, this is what I am selling:

awesome_pipeline.fit(X, y)

awesome_pipeline may look just like another variable, but here is what it does to poor X and y under the hood:

Automatically isolates numerical and categorical features of X.
Imputes missing values in numeric features.
Log-transforms skewed features while normalizing the rest.
Imputes missing values in categorical features and one-hot encodes them.
Normalizes the target array y for good measure.

Apart from collapsing almost 100 lines worth of unreadable code into a single line, awesome_pipeline can now be inserted into cross-validators or hyperparameter tuners, guarding your code from data leakage and making everything reproducible, modular, and headache-free.

Let’s see how to build the thing.

0. Estimators vs transformers

First, let’s get the terminology out of the way.

A transformer in Sklearn is any class or function that accepts features of a dataset, applies transformations, and returns them. It has fit_transform and transform methods.

An example is the QuantileTransformer, which takes numeric input(s) and makes them normally distributed. It is especially useful for features with outliers.

Website

Sklearn Pipelines for the Modern ML Engineer: 9 Techniques You Can???t Ignore

0. Estimators vs transformers

Related posts

Recent posts