This project demonstrates the use of the K-Modes clustering algorithm to segment customers of a Portuguese banking institution based on various categorical attributes. The data used in this demonstration is from the UCI Machine Learning Repository, specifically the Bank Marketing dataset.
The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The dataset can be found here.
The dataset contains several attributes, but for this demonstration, we focus only on the categorical attributes:
age
(numeric, converted to categorical)job
: type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')marital
: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)education
: (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')default
: has credit in default? (categorical: 'no','yes','unknown')housing
: has housing loan? (categorical: 'no','yes','unknown')loan
: has personal loan? (categorical: 'no','yes','unknown')contact
: contact communication type (categorical: 'cellular','telephone')month
: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')day_of_week
: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')poutcome
: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
First, we load the dataset and explore its structure to understand the attributes and their data types.
We check for any null values in the dataset and find that there are none. Thus, no additional cleaning is needed.
The age
attribute, which is numeric, is converted into categorical bins. This helps in using K-Modes which works with categorical data.
We use label encoding to convert categorical variables into numerical values suitable for clustering algorithms.
We use two different initialization methods for K-Modes clustering: "Cao" and "Huang". The K-Modes algorithm is applied to segment the customers into different clusters.
We use the cost function to determine the optimal number of clusters by plotting the cost against the number of clusters and choosing the point where the cost significantly drops.
We analyze the resulting clusters by visualizing the distribution of attributes within each cluster. This helps in understanding the characteristics of each cluster.
The clusters are visualized using count plots for various attributes, segmented by the predicted clusters. This helps in identifying the unique characteristics of each cluster, which can be useful for targeted marketing strategies.
The K-Modes clustering algorithm effectively segments the customers based on categorical attributes. This segmentation can be used by the bank to tailor their marketing campaigns to different customer segments, potentially increasing the effectiveness of their marketing efforts.
- Clone the repository.
- Place the
bankmarketing.csv
file in thedata/
directory. - Open the Jupyter notebook
bank_customer_clustering.ipynb
and run the cells to perform clustering and visualize the results. - Alternatively, run the
k_modes_clustering.py
script to execute the clustering process.
- Python 3.x
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- kmodes