Code Monkey home page Code Monkey logo

dsc-market-segmentation-clustering-lab's Introduction

Market Segmentation with Clustering - Lab

Introduction

In this lab, you'll use your knowledge of clustering to perform market segmentation on a real-world dataset!

Objectives

In this lab you will:

  • Use clustering to create and interpret market segmentation on real-world data

Getting Started

In this lab, you're going to work with the Wholesale customers dataset from the UCI Machine Learning datasets repository. This dataset contains data on wholesale purchasing information from real businesses. These businesses range from small cafes and hotels to grocery stores and other retailers.

Here's the data dictionary for this dataset:

Column Description
FRESH Annual spending on fresh products, such as fruits and vegetables
MILK Annual spending on milk and dairy products
GROCERY Annual spending on grocery products
FROZEN Annual spending on frozen products
DETERGENTS_PAPER Annual spending on detergents, cleaning supplies, and paper products
DELICATESSEN Annual spending on meats and delicatessen products
CHANNEL Type of customer. 1=Hotel/Restaurant/Cafe, 2=Retailer. (This is what we'll use clustering to predict)
REGION Region of Portugal that the customer is located in. (This column will be dropped)

One benefit of working with this dataset for practice with segmentation is that we actually have the ground-truth labels of what market segment each customer actually belongs to. For this reason, we'll borrow some methodology from supervised learning and store these labels separately, so that we can use them afterward to check how well our clustering segmentation actually performed.

Let's get started by importing everything we'll need.

In the cell below:

  • Import pandas, numpy, and matplotlib.pyplot, and set the standard alias for each.
  • Use numpy to set a random seed of 0.
  • Set all matplotlib visualizations to appear inline.

Now, let's load our data and inspect it. You'll find the data stored in 'wholesale_customers_data.csv'.

In the cell below, load the data into a DataFrame and then display the first five rows to ensure everything loaded correctly.

raw_df = None

Now, let's go ahead and store the 'Channel' column in a separate variable and then drop both the 'Channel' and 'Region' columns. Then, display the first five rows of the new DataFrame to ensure everything worked correctly.

channels = None
df = None

Now, let's get right down to it and begin our clustering analysis.

In the cell below:

  • Import KMeans from sklearn.cluster, and then create an instance of it. Set the number of clusters to 2
  • Fit it to the data (df)
  • Get the predictions from the clustering algorithm and store them in cluster_preds
k_means = None

cluster_preds = None

Now, use some of the metrics to check the performance. You'll use calinski_harabasz_score() and adjusted_rand_score(), which can both be found inside sklearn.metrics.

In the cell below, import these scoring functions.

Now, start with CH score to get the variance ratio.

Although you don't have any other numbers to compare this to, this is a pretty low score, suggesting that the clusters aren't great.

Since you actually have ground-truth labels, in this case you can use adjusted_rand_score() to check how well the clustering performed. Adjusted Rand score is meant to compare two clusterings, which the score can interpret our labels as. This will tell us how similar the predicted clusters are to the actual channels.

Adjusted Rand score is bounded between -1 and 1. A score close to 1 shows that the clusters are almost identical. A score close to 0 means that predictions are essentially random, while a score close to -1 means that the predictions are pathologically bad, since they are worse than random chance.

In the cell below, call adjusted_rand_score() and pass in channels and cluster_preds to see how well your first iteration of clustering performed.

According to these results, the clusterings were essentially no better than random chance. Let's see if you can improve this.

Scaling our dataset

Recall that k-means clustering is heavily affected by scaling. Since the clustering algorithm is distance-based, this makes sense. Let's use StandardScaler to scale our dataset and then try our clustering again and see if the results are different.

In the cells below:

  • Import and instantiate StandardScaler and use it to transform the dataset
  • Instantiate and fit k-means to this scaled data, and then use it to predict clusters
  • Calculate the adjusted Rand score for these new predictions
scaler = None
scaled_df = None
scaled_k_means = None

scaled_preds = None

That's a big improvement! Although it's not perfect, we can see that scaling our data had a significant effect on the quality of our clusters.

Incorporating PCA

Since clustering algorithms are distance-based, this means that dimensionality has a definite effect on their performance. The greater the dimensionality of the dataset, the greater the total area that we have to worry about our clusters existing in. Let's try using Principal Component Analysis to transform our data and see if this affects the performance of our clustering algorithm.

Since you've already seen PCA in a previous section, we will let you figure this out by yourself.

In the cells below:

  • Import PCA from the appropriate module in sklearn
  • Create a PCA instance and use it to transform our scaled data
  • Investigate the explained variance ratio for each Principal Component. Consider dropping certain components to reduce dimensionality if you feel it is worth the loss of information
  • Create a new KMeans object, fit it to our PCA-transformed data, and check the adjusted Rand score of the predictions it makes.

NOTE: Your overall goal here is to get the highest possible adjusted Rand score. Don't be afraid to change parameters and rerun things to see how it changes.

Question: What was the Highest Adjusted Rand Score you achieved? Interpret this score and determine the overall quality of the clustering. Did PCA affect the performance overall? How many principal components resulted in the best overall clustering performance? Why do you think this is?

Write your answer below this line:


Optional (Level up)

Hierarchical Agglomerative Clustering

Now that we've tried doing market segmentation with k-means clustering, let's end this lab by trying with HAC!

In the cells below, use Agglomerative clustering to make cluster predictions on the datasets we've created and see how HAC's performance compares to k-mean's performance.

NOTE: Don't just try HAC on the PCA-transformed dataset -- also compare algorithm performance on the scaled and unscaled datasets, as well!

Summary

In this lab, you used your knowledge of clustering to perform a market segmentation on a real-world dataset. You started with a cluster analysis with poor performance, and then implemented some changes to iteratively improve the performance of the clustering analysis!

dsc-market-segmentation-clustering-lab's People

Contributors

alexgriff avatar cheffrey2000 avatar h-parker avatar loredirick avatar mathymitchell avatar mike-kane avatar sumedh10 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

dsc-market-segmentation-clustering-lab's Issues

error importing calinski_harabasz_score

When I tried to import calinski_harabasz_score from sklearn.metrics.cluster there is an error:

ImportError: cannot import name 'calinski_harabasz_score'

I tried several variations and checked the sklearn documentation and the spelling with no success.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.