Sklearn Pipelines for the Modern ML Engineer: 9 Techniques You Can???t Ignore

Today, this is what I am selling:

awesome_pipeline.fit(X, y)

awesome_pipeline may look just like another variable, but here is what it does to poor X and y under the hood:

  1. Automatically isolates numerical and categorical features of X.
  2. Imputes missing values in numeric features.
  3. Log-transforms skewed features while normalizing the rest.
  4. Imputes missing values in categorical features and one-hot encodes them.
  5. Normalizes the target array y for good measure.

Apart from collapsing almost 100 lines worth of unreadable code into a single line, awesome_pipeline can now be inserted into cross-validators or hyperparameter tuners, guarding your code from data leakage and making everything reproducible, modular, and headache-free.

Let’s see how to build the thing.

0. Estimators vs transformers

First, let’s get the terminology out of the way.

A transformer in Sklearn is any class or function that accepts features of a dataset, applies transformations, and returns them. It has fit_transform and transform methods.

An example is the QuantileTransformer, which takes numeric input(s) and makes them normally distributed. It is especially useful for features with outliers.

Website