CatBoost Regression: Break It Down For Me

<p>CatBoost, short for Categorical Boosting, is a powerful machine learning algorithm that excels in handling categorical features and producing accurate predictions. Traditionally, dealing with categorical data is pretty tricky&mdash; requiring one-hot encoding, label encoding, or some other preprocessing technique that can distort the data&rsquo;s inherent structure. To tackle this issue, CatBoost employs its own built-in encoding system called&nbsp;<strong>Ordered Target Encoding</strong>.</p> <p>Let&rsquo;s see how CatBoost works in practice by building a model to predict how someone might rate the book&nbsp;<em>Murder, She Texted</em>&nbsp;based on their average book rating on&nbsp;<a href="https://www.goodreads.com/" rel="noopener ugc nofollow" target="_blank">Goodreads</a>&nbsp;and their favorite genre.</p> <p>We asked 6 people to rate&nbsp;<em>Murder, She Texted</em>&nbsp;and collected the other relevant information about them.</p> <p><img alt="" src="https://miro.medium.com/v2/resize:fit:630/1*EcjFdW44ZI2e6HN9GE0mXw.png" style="height:571px; width:700px" /></p> <p>This is our current training dataset, which we will use to train (duh) the data.</p> <h2>Step 1: Shuffle the dataset and Encode the Categorical Data Using&nbsp;<strong>Ordered Target Encoding</strong></h2> <p>The way we preprocess categorical data is central to the CatBoost algorithm. In this case, we only have one categorical column &mdash;&nbsp;<em>Favorite Genre</em>. This column is encoded (aka converted to a discrete integer) and the way it is done varies depending on whether it is a Regression or Classification problem. Since we are dealing with a Regression problem (because the variable we want to predict&nbsp;<em>Murder, She Texted Rating</em>&nbsp;is continuous) we follow the following steps.</p> <p><a href="https://towardsdatascience.com/catboost-regression-break-it-down-for-me-16ed8c6c1eca">Read More</a></p>