CatBoost Regression: Break It Down For Me
<p>CatBoost, short for Categorical Boosting, is a powerful machine learning algorithm that excels in handling categorical features and producing accurate predictions. Traditionally, dealing with categorical data is pretty tricky— requiring one-hot encoding, label encoding, or some other preprocessing technique that can distort the data’s inherent structure. To tackle this issue, CatBoost employs its own built-in encoding system called <strong>Ordered Target Encoding</strong>.</p>
<p>Let’s see how CatBoost works in practice by building a model to predict how someone might rate the book <em>Murder, She Texted</em> based on their average book rating on <a href="https://www.goodreads.com/" rel="noopener ugc nofollow" target="_blank">Goodreads</a> and their favorite genre.</p>
<p>We asked 6 people to rate <em>Murder, She Texted</em> and collected the other relevant information about them.</p>
<p><img alt="" src="https://miro.medium.com/v2/resize:fit:630/1*EcjFdW44ZI2e6HN9GE0mXw.png" style="height:571px; width:700px" /></p>
<p>This is our current training dataset, which we will use to train (duh) the data.</p>
<h2>Step 1: Shuffle the dataset and Encode the Categorical Data Using <strong>Ordered Target Encoding</strong></h2>
<p>The way we preprocess categorical data is central to the CatBoost algorithm. In this case, we only have one categorical column — <em>Favorite Genre</em>. This column is encoded (aka converted to a discrete integer) and the way it is done varies depending on whether it is a Regression or Classification problem. Since we are dealing with a Regression problem (because the variable we want to predict <em>Murder, She Texted Rating</em> is continuous) we follow the following steps.</p>
<p><a href="https://towardsdatascience.com/catboost-regression-break-it-down-for-me-16ed8c6c1eca">Read More</a></p>