Clustering Analysis
AI Code for Business
Clustering is the task of dividing the population or data points into a number of groups such that, data points in the same groups are more similar to other data points in the same group, and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
Why is Clustering in Artificial Intelligence Important?
The primary use of clustering in machine learning is to extract valuable inferences from many unstructured data sets. If you are working with large amounts of data that are also not structured, it is only logical to organize that data to make sense of it; clustering helps us do that. Clustering and classification allow you to take a sweeping glance at your data to then form some logical structures based on what you find there. This provides a direction on where to dig deeper for the nuts-and-bolts analysis. Clustering is a significant component of data science, and its importance is highly significant in providing better machine learning techniques.
Benefits for Business
Marketing
In the field of marketing, clustering can be used to identify various customer groups with existing customer data. Based on that, customers can be provided with similar discounts, offers, promo codes etc.
Real Estate
Clustering can be used to understand and divide various property locations based on value and importance. Clustering algorithms can process through the data and identify various groups of property on the basis of probable price.
Library and Bookstore Management
Libraries and bookstores can use clustering to better manage their books database. This facilitates book ordering and generaloperations.
Document Analysis
Often, there is a need to group together various research texts and documents according to similarity. And in such cases where labels are missing or manually labelling large amounts of data is not feasible, clustering algorithms can process the text and group it into different themes and categories.
Example
In this example, we take a mall's customers data and perform visualization and cluster analysis. This will yield an improved understanding and categorization of the customers, and provide a direction to sales and marketing to plan their strategy accordingly.
Follow the steps below and run the code on the colab notebook linked here. (To run the code, click on the round ▶️ next to each cell)
Cell 1: Imports the python libraries needed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
Cell 2: Grab a copy of the dataset and load this into a pandas dataframe. Display the top and bottom 5 rows of the dataframe.
!wget https://docs.google.com/spreadsheets/d/e/2PACX-1vQL5Lfw_iiwJAvCkhWbFr7jfSJmU6CavaLtiAV6rD80NQaZmpZJQPjlNGFx_hIBzYsN2fweJ8euOZJx/pub?output=csv -O Mall_Customers.csv
df = pd.read_csv("Mall_Customers.csv")
df
Cell 3: Performs statisical calculation of the values in each column and displays this in a table.
df.describe()
Cell 4: A copy of the dataframe is made. This allows for the original data to be undisturbed and kept as a backup.
df_copy = df.copy(deep=True)
df_copy.drop('CustomerID', axis=1, inplace=True)
Cell 5: Using the seaborn visualization libary, display a count bar plot of the gender in the data.
sns.countplot(x='Gender', data=df_copy)
plt.xlabel('Gender')
plt.ylabel('Count')
Observation and Analysis: There is ~10% more female than male customers.
Cell 6: Using the matplotlib visualization libary, display a histogram of the age in the data. We use a bin or bucket of 10. Feel free to change this number to visualize with different bin sizes.
plt.hist(x=df_copy['Age'], bins=10, orientation='vertical', color='red')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()
Observation and Analysis: There top 3 customer age groups are 15-22 years, 30-40 years and 45-50 years.
Cell 7: Using seaborn, perform a scatter plot and a joint plot on age and spending score and differentiate by gender.
sns.scatterplot(data=df_copy, x='Age', y='Spending Score (1-100)', hue='Gender')
sns.jointplot(data=df_copy, x='Age', y='Spending Score (1-100)')
Observation and Analysis: Per the joint plot, customers whose spending score is more than 65 have ages in the range of 15-42 years. From the scatter plot, it is observed that customers whose spending score is more than 65 consists of more Females than Males. Customers having average spending score is comprised of the full age group range from 15-75 years with an approximate Female to Male ratio.
Cell 8: Using seaborn, perform a scatter plot and a joint plot on annual income and spending score and differentiate by gender.
sns.scatterplot(data=df_copy, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Gender')
sns.jointplot(data=df_copy, x='Annual Income (k$)', y='Spending Score (1-100)')
Observation and Analysis: Five clusters are seen and can be categorized as high income high spending (top right cluster), high income low spending (bottom right cluster), average income average spending (center cluster), low income high spending (top left cluster), and low income low spending (bottom left cluster).
Cell 9: We extract the annual income and spending score data and prepare this to be used for k-means clustering analysis. The first step is to find the optimal number of clusters for this dataset using the elbow method.
X = df_copy.iloc[:, [2,3]]
X.columns
wcss = []
for i in range(1,11):
kmeans_model = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans_model.fit(X)
wcss.append(kmeans_model.inertia_)
plt.plot(range(1,11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
Observation and Analysis: Five clusters is optimal for this dataset.
Cell 10: We construct the kmeans model, perform fit and prediction, and visualize the results.
scaler = StandardScaler()
X = scaler.fit_transform(X)
kmeans_model = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_kmeans = kmeans_model.fit_predict(X)
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 30, c = 'yellow', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 30, c = 'cyan', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 30, c = 'lightgreen', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 30, c = 'orange', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 30, c = 'red', label = 'Cluster 5')
plt.scatter(x=kmeans_model.cluster_centers_[:, 0], y=kmeans_model.cluster_centers_[:, 1], s=100, c='black', marker='+', label='Cluster Centers')
plt.legend()
plt.title('Clusters of customers')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
Observation and Recommendations:
a) High Income, High Spending Score (Cluster 5) - Target these customers by sending new product alerts which would lead to increase in the revenue collected by the mall as they are loyal customers.
b) High Income, Low Spending Score (Cluster 3) - Target these customers by asking for feedback and increase advertising with the aim to convert them into Cluster 5 customers.
c) Average Income, Average Spending Score (Cluster 2) - Can target these set of customers by providing them with lower cost financing or other attractive buy options.
d) Low Income, High Spending Score (Cluster 1) - May or may not target these group of customers based on the policy of the mall.
e) Low Income, Low Spending Score (Cluster 4) - Do not target.
Conclusion
Finding and extracting actionable insights from data is what machine learning does best. By employing the vast and growing data science modeling techniques, businesses can minimize sales and marketing expenditure with targeted approaches. Clustering analysis is but only one of the many algorithms readily available today. In consideration of your business, how can clustering of your data help?
Check out the other articles to see more applications and related code on maching learning. If you need support and would like to find out more, get in touch with the contact link.