Data Science Models Prediction

on Banking Dataset

AI Code for Business


There are many freely available data science libraries for the Python programming language with the premier being the scikit-learn package. In this article, we will take a banking dataset made available on kaggle, prepare that data, and feed this into a number of available machine learning algorithms from scikit-learn, and create prediction models.

The resulting models can than be used to to predict if a client is likely to sign up for new services.

How can Data Science Models Help Businesses?

Sales and marketing campaigns are still the go-to approach on how businesses reach their customers today. Each campaign, each customer contact require time and monetary investment with no assurances on the return. Say, for example, a campaign involves sending out 10000 postcards to households in the target market. And on average, the business get a return of 10 customer contacts from these mailers. What if I told you that it is possible to substantially improve the return by first analyzing and modeling your customer data with machine learning and then have the ability to send out target mailers?

Data Science Models Prediction on a Banking Dataset

For this exercise, we will use this banking dataset from kaggle. Refer the linked page for the context, content, and description of the data. We will first perform a quick table visualization of the data, prepare the training and testing datasets, and then derive data science models to address the goal of predicting whether an existing customer is receptive to signing up for additional banking service.


Follow the steps below and run the code on the colab notebook linked here. (To run the code, click on the round ▶️ next to each cell)

Cell 1: Imports the python libraries needed.

import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

Cell 2: Grab a copy of the dataset. Load the train and test data into their respective variables.

gdown --id 1MPMYq6Jo13CZhZ0YknPUjnSM1EP8-SCn

!unzip banking_dataset_version2.zip


train_data = pd.read_csv('train.csv', sep=';')

test_data = pd.read_csv('test.csv', sep=';')

Cell 3: Visualize the train data: consists of 42511 customers info across 17 fields.

train_data

Cell 4: Visualize the test data: consists of 4251 customers info across 17 fields.

test_data

Cell 5: Provide a count for the customers that have education listed as unknown in the train and test data.

print(train_data.education[train_data['education'] == 'unknown'].count())

print(test_data.education[train_data['education'] == 'unknown'].count())

Cell 6: Proceed with minor cleanup of the datasets by removing customers in both train and test with unknown listed as the value for education.

train_data = train_data[train_data['education'] != 'unknown']

test_data = test_data[test_data['education'] != 'unknown']

Cell 7: Define function which will convert default, housing, loan, and y columns into categorical values of 0 and 1 corresponding to no and yes. Subsequently convert primary, seconday, and tertiary values under education, and single, married, divorced values under to marital to categorical 0, 1, and 2. Proceed with the categorical conversion of the train and test data using the defined function.

def data_categorical(data):

    data.y = pd.Categorical(data.y).codes

    data.default = pd.Categorical(data.default).codes

    data.housing = pd.Categorical(data.housing).codes

    data.loan = pd.Categorical(data.loan).codes


    data.education.replace(['primary', 'secondary', 'tertiary'], [0, 1, 2], inplace=True)

    data.marital.replace(['single', 'married', 'divorced'], [0, 1, 2], inplace=True)


data_categorical(train_data)

data_categorical(test_data)    

Cell 8: Ready the train and test variables to corresponding independent (X) and dependent (y)

X_train = train_data[['age', 'marital', 'education', 'default', 'balance', 'housing', 'loan']]

y_train = train_data['y']

X_test = test_data[['age', 'marital', 'education', 'default', 'balance', 'housing', 'loan']]

y_test = test_data['y']

Cell 9: Define the first model, build and fit it to the train data, and use it to predict on the test data. Print out the accuracy of the logistic regression algorithm on this dataset.

logmodel = LogisticRegression()

logmodel.fit(X_train, y_train)

predictions = logmodel.predict(X_test)


print(metrics.accuracy_score(y_test, predictions))

Cell 10: Define the second model, build and fit it to the train data, and use it to predict on the test data. Print out the accuracy of the k-nearest neighbors algorithm on this dataset.

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

predictions = knn.predict(X_test)


print(metrics.accuracy_score(y_test, predictions))

Cell 11: Define the third model, build and fit it to the train data, and use it to predict on the test data. Print out the accuracy of the decision tree algorithm on this dataset.

DT = DecisionTreeClassifier()

DT.fit(X_train, y_train)

predictions = DT.predict(X_test)


print(metrics.accuracy_score(y_test, predictions))

Cell 12: Define the fourth and last model, build and fit it to the train data, and use it to predict on the test data. Print out the accuracy of the random forest algorithm on this dataset.

rfc = RandomForestClassifier(n_estimators=600)

rfc.fit(X_train, y_train)

predictions = rfc.predict(X_test)


print(metrics.accuracy_score(y_test, predictions))


The random forest model slightly edged out the decision tree algorithm in yielding the highest accuracy. Out of every 100 customers in the test data, the model was able to accurately predicting whether 98 of them would be ameniable if approached for a new banking service.


Conclusion

Extracting hidden value and new insights from data is what machine learning can do for businesses today. Sales and marketing campaigns can achieve higher precision and effective through the analysis and modeling of customer data leading to higher rates of return on investment and optimized customer engagement.

Check out the other articles to see more applications and related code on maching learning. If you need support and would like to find out more, get in touch with the contact link.