K-Means Clustering Analysis

AI Code for Business


Sometimes drawing a single holistic insight from a dataset is not enough, particularly when the data is large and contains different varieties of similar and dissimilar observations. Among the many methods that can be used to improve this limitation, bringing together common indicators is one sure way to go, and that is where clustering analysis comes in. In this project, we will provide a simple approach to carrying out k-means cluster analysis and show how this can drive business decisions. But before then, let’s take a brief detour to first understand what clustering analysis actually means.

What is Clustering Analysis?

As the name implies, clustering analysis is the grouping of homogenous data points that exist or occur closely together from within a larger heterogeneous set of data points based on some specific characteristics or features. In order words, it is the grouping of different data points into similar categories. This way, useful patterns can be identified in vast, unstructured data sets and organized into structured formats. Aside from grouping data into sensible chunks, it is also an exploratory data analysis toolkit useful in finding underlying connectivity properties within samples of a dataset. Another characteristic is that it is an unsupervised machine learning algorithm that does not require any initial labeling before it can be used.

Benefits of Clustering Analysis for Small Businesses

There are several ways small and medium-scale businesses can use clustering algorithms to drive growth. We discussed a few of them below:

Challenges of K-Means Clustering Analysis

Challenges in clustering analysis usually come from the type of data available and the business problem to be solved. For instance, K-means clustering algorithms work for basic numerical data. A different algorithm entirely has to be used if the features to be clustered are categorical variables. Also, in a case where the data is a mix of categorical and numerical variables, the algorithms have added complexity and require added handling and care.

Another known challenge is that K-means clustering is less effective when the clusters are of varying sizes and density. In such cases, resorting to generalized k-means is usually a better approach. Outliers can also drag the centroid of the cluster and distort the grouping or, alternatively, force the model to create a separate cluster which may not be essential.

Implementation of K-means Clustering Analysis in Python

The dataset for this tutorial (link) contains basic information on 200 mall customers: age, gender, annual income, and spending score. We are going to analyze the spending score of each cluster on annual income. Since our interest is in numerical variables, we can employ the k-means non-hierarchical clustering. 


Follow the steps below and run the code on the colab notebook linked here. (To run the code, click on the round ▶️ next to each cell)

Cell 1: Imports the python libraries needed.

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.cluster import KMeans

Cell 2: Grabs a copy of the dataset.

!wget https://docs.google.com/spreadsheets/d/e/2PACX-1vQL5Lfw_iiwJAvCkhWbFr7jfSJmU6CavaLtiAV6rD80NQaZmpZJQPjlNGFx_hIBzYsN2fweJ8euOZJx/pub?output=csv -O Mall_Customers.csv

Cell 3: Reads and loads in the dataset values into a pandas dataframe.

dataset = pd.read_csv('Mall_Customers.csv')

X = dataset.iloc[:, [3, 4]].values

Cell 4: Uses the Elbow method to find the optimal number of clusters. Plot the results.

wcss = []

for i in range(1, 11):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)

    kmeans.fit(X)

    wcss.append(kmeans.inertia_)

plt.figure(figsize=(12,8))

plt.plot(range(1, 11), wcss)

plt.title('The Elbow Method', weight = "bold")

plt.xlabel('Number of Clusters')

plt.ylabel('WCSS')

plt.show()

Cell 5: Performs K-Means analysis on the data using the optimal number of clusters derived.

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)

y_kmeans = kmeans.fit_predict(X)

Cell 6: Visualizes the clusters around their respective centroid with a scatterplot, setting different color for each cluster.

plt.figure(figsize=(12,8))

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 80, c = 'red', label = 'Cluster 1')

plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 80, c = 'blue', label = 'Cluster 2')

plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 80, c = 'green', label = 'Cluster 3')

plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 80, c = 'cyan', label = 'Cluster 4')

plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 80, c = 'magenta', label = 'Cluster 5')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 200, c = 'black', label = 'Centroids')

plt.title('Clusters of Customers', weight = "bold")

plt.xlabel('Annual Income ($1000)')

plt.ylabel('Spending Score (1-100)')

plt.legend()

plt.show()


Conclusion

A quick analysis reveals two high spend spending score clusters, one with lower and the other with higher overall annual incomes. This finding, thus, provides potential areas of opportunities and a direction on where to dig deeper. While this is a simple case of clustering, the idea presented can be adapted to suit different use cases to create clusters from a large dataset and use the result as guidance to define the next step.

Check out the other articles to see more applications and related code on maching learning. If you need support and would like to find out more, get in touch with the contact link.