From Data to Clusters; When is Your Clustering Good Enough?

With unsupervised cluster analysis, we can group observations with similar patterns, and reveal (hidden) trends in the data. The use of cluster evaluation methods helps to determine the clustering tendency, quality, and optimal number of clusters. In this blog, we will dive into cluster evaluation methods, learn how to interpret the methods and select the appropriate clustering method for your use case. We will start by delving into the fundamentals of clustering and evaluation methods that are used to assess the quality of clusters, including popular techniques like the Silhouette score, the Davies-Bouldin index, and the Derivative method. With the use of toy example data sets, we will investigate the strengths and limitations of each evaluation method, providing practical insights on how to interpret their results. For all analyses, the <a href="https://erdogant.github.io/clusteval/" rel="noopener ugc nofollow" target="_blank">clusteval library</a> is used. <h1>Unsupervised Clustering.</h1> With unsupervised clustering, we aim to determine “natural” or “data-driven” groups in the data without using apriori knowledge about labels or categories. The challenge of using different unsupervised clustering methods is that it will result in different partitioning of the samples and thus different groupings since each method implicitly impose a structure on the data. Thus the question arises; What is a “good” clustering? Figure 1A depicts a set set of samples in a 2-dimensional space. How many clusters do you see? I would state that there are two clusters without using any label information. Why? Because of the small distances between the dots, and the relatively larger “gap” between the cluttered dots. <a href="https://towardsdatascience.com/from-data-to-clusters-when-is-your-clustering-good-enough-5895440a978a">Website</a>