Clustering Algorithms

We study frequently used clustering algorithms in this exercise.

Objectives

To study:

Python openml
Data preparation
Clustering algorithms
Performance measures
Hyper-parameter tuning
joblib library
Data visualisation

Branch

Make sure you know your branch of this module.
We continue to improve the modules based on your feedbacks and submitted outputs. Therefore, we create a branch for you when we assign the module.
Go through the README.md of your branch.

ARE YOU READING THE RIGHT BRANCH?

If there is any doubt contact DataDisca.

This code is hosted in a private repository to regulate access. After completing the module, you can host your work in an open repository under MIT license.

Please help us by reporting all type of errors.

License

This code is hosted in a private repository to regulate access. You can share your data & code under MIT license.

Datasets:

Dataset	Instances	Classes	URL
iris	150	3	https://www.openml.org/d/61
wine	178	3	https://www.openml.org/d/187
glass	214	6	https://www.openml.org/d/41
haberman	306	2	https://www.openml.org/d/43
libras_move	360	15	https://www.openml.org/d/299
satellite_image	6435	6	https://www.openml.org/d/294
isolet	7797	26	https://www.openml.org/d/300
nursery	12960	5	https://www.openml.org/d/26
gas-drift-different-concentrations	13910	6	https://www.openml.org/d/1477
MagicTelescope	19020	2	https://www.openml.org/d/1120
letter	20000	26	https://www.openml.org/d/6
covertype	581012	7	https://www.openml.org/d/150

Instructions

Follow the steps given below.

Step 1

Study the following algorithms.

K-Means
Agglomerative
DBScan
Optics
Gaussian mixtures
Affinity propagation
Mean-shift
Spectral
Ward hierarchical
Birch
Self organising maps

At the end of the excercise, you should be able to answer the following questions.

What are the important parameters in each algorithm?
How and why those parameters affect the results of respective algorithms?

Step 2

Follow the steps given below to write your code.

In your code, download a dataset using Python openml package
Prepare data
1. Identify the data types: boolean/categorical, ordinal, numeric in this case. But there can be many other types as well.
2. Transform categorical variable to numeric as necessary
3. Min-max normalise
Write a joblib code to walk through the parameter grid
Record f1_score, adjusted_rand_score , silhouette_score and execution time against each parameter combination identified in Step 1.
Save the results to CSV files.

Step 3

Execute your code over all the algorithms and all the datasets.
Save your results to CSV files.

Step 4

Create Tableau Dashboards or Plotly visualisations to analyse your results.
With visualisations:
1. for each dataset compare and contrast results produced by each algorithm under optimal parameter settings
2. for each given algorithm, how f1_score, adjusted_rand_score and silhouette_score vary with the different parameter values
3. Discuss the execution times of algorithms and their parameter settings?
What are the most important parameters in each algorithm
Create a presentation or a pdf document or a Jupyter Notebook explaining the theory and applications of f1_score, adjusted_rand_score and silhouette_score.

Quality Standard of Your Work

Code should follow PEP8 Standard
Host your code on your GitHub in a public or private repository as you prefer.

If it is a public repository, send the link for us to evaluate.
If it is a private repository, share (view only) with our GitHub usernames. Please contact us for them.

Send us a notification to start the evaluation. We evaluate your code for your technical progress.

Sponsor

DataDisca Pty Ltd, Melbourne, Australia

https://www.datadisca.com

methmal1997 / cluster-new Goto Github PK

cluster-new's Introduction

Clustering Algorithms

Objectives

Branch

License

Datasets:

Instructions

Step 1

Step 2

Step 3

Step 4

Quality Standard of Your Work

Sponsor

cluster-new's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent