CatBoost Regression: Break It Down For Me

CatBoost, short for Categorical Boosting, is a powerful machine learning algorithm that excels in handling categorical features and producing accurate predictions. Traditionally, dealing with categorical data is pretty tricky— requiring one-hot encoding, label encoding, or some other preprocessing technique that can distort the data’s inherent structure. To tackle this issue, CatBoost employs its own built-in encoding system called Ordered Target Encoding. Let’s see how CatBoost works in practice by building a model to predict how someone might rate the book Murder, She Texted based on their average book rating on <a href="https://www.goodreads.com/" rel="noopener ugc nofollow" target="_blank">Goodreads</a> and their favorite genre. We asked 6 people to rate Murder, She Texted and collected the other relevant information about them. <img alt="" src="https://miro.medium.com/v2/resize:fit:630/1*EcjFdW44ZI2e6HN9GE0mXw.png" style="height:571px; width:700px" /> This is our current training dataset, which we will use to train (duh) the data. <h2>Step 1: Shuffle the dataset and Encode the Categorical Data Using Ordered Target Encoding</h2> The way we preprocess categorical data is central to the CatBoost algorithm. In this case, we only have one categorical column — Favorite Genre. This column is encoded (aka converted to a discrete integer) and the way it is done varies depending on whether it is a Regression or Classification problem. Since we are dealing with a Regression problem (because the variable we want to predict Murder, She Texted Rating is continuous) we follow the following steps. <a href="https://towardsdatascience.com/catboost-regression-break-it-down-for-me-16ed8c6c1eca">Read More</a>