CatBoost Regression: Break It Down For Me

CatBoost, short for Categorical Boosting, is a powerful machine learning algorithm that excels in handling categorical features and producing accurate predictions. Traditionally, dealing with categorical data is pretty tricky— requiring one-hot encoding, label encoding, or some other preprocessing technique that can distort the data’s inherent structure. To tackle this issue, CatBoost employs its own built-in encoding system called Ordered Target Encoding.

Let’s see how CatBoost works in practice by building a model to predict how someone might rate the book Murder, She Texted based on their average book rating on Goodreads and their favorite genre.

We asked 6 people to rate Murder, She Texted and collected the other relevant information about them.

This is our current training dataset, which we will use to train (duh) the data.

Step 1: Shuffle the dataset and Encode the Categorical Data Using Ordered Target Encoding

The way we preprocess categorical data is central to the CatBoost algorithm. In this case, we only have one categorical column — Favorite Genre. This column is encoded (aka converted to a discrete integer) and the way it is done varies depending on whether it is a Regression or Classification problem. Since we are dealing with a Regression problem (because the variable we want to predict Murder, She Texted Rating is continuous) we follow the following steps.

CatBoost Regression: Break It Down For Me

Step 1: Shuffle the dataset and Encode the Categorical Data Using Ordered Target Encoding

Related posts

Recent posts