We study frequently used clustering algorithms in this exercise.
To study:
- Python openml
- Data preparation
- Clustering algorithms
- Performance measures
- Hyper-parameter tuning
- joblib library
- Data visualisation
Make sure you know your branch of this module.
We continue to improve the modules based on your feedbacks and submitted outputs.
Therefore, we create a branch for you when we assign the module.
Go through the README.md
of your branch.
ARE YOU READING THE RIGHT BRANCH?
If there is any doubt contact DataDisca.
This code is hosted in a private repository to regulate access. After completing the module, you can host your work in an open repository under MIT license.
Please help us by reporting all type of errors.
This code is hosted in a private repository to regulate access. You can share your data & code under MIT license.
Dataset | Instances | Classes | Missing Values | URL |
---|---|---|---|---|
iris | 150 | 3 | 0 | https://www.openml.org/d/61 |
wine | 178 | 3 | 0 | https://www.openml.org/d/187 |
glass | 214 | 6 | 0 | https://www.openml.org/d/41 |
haberman | 306 | 2 | 0 | https://www.openml.org/d/43 |
libras_move | 360 | 15 | 0 | https://www.openml.org/d/299 |
satellite_image | 6435 | 6 | 0 | https://www.openml.org/d/294 |
isolet | 7797 | 26 | 0 | https://www.openml.org/d/300 |
nursery | 12960 | 5 | 0 | https://www.openml.org/d/26 |
gas-drift-different-concentrations | 13910 | 6 | 0 | https://www.openml.org/d/1477 |
MagicTelescope | 19020 | 2 | 0 | https://www.openml.org/d/1120 |
letter | 20000 | 26 | 0 | https://www.openml.org/d/6 |
covertype | 581012 | 7 | 0 | https://www.openml.org/d/150 |
Follow the steps given below.
Study the following algorithms.
- K-Means
- Agglomerative
- DBScan
- Optics
- Gaussian mixtures
- Affinity propagation
- Mean-shift
- Spectral
- Ward hierarchical
- Birch
- Self organising maps
At the end of the excercise, you should be able to answer the following questions.
- What are the important parameters in each algorithm?
- How and why those parameters affect the results of respective algorithms?
Follow the steps given below to write your code.
- In your code, download a dataset using Python
openml
package - Prepare data
- Identify the data types: boolean/categorical, ordinal, numeric in this case. But there can be many other types as well.
- Transform categorical variable to numeric as necessary
- Min-max normalise
- Write a joblib code to walk through the parameter grid
- Record
f1_score
,adjusted_rand_score
,silhouette_score
and execution time against each parameter combination identified in Step 1. - Save the results to CSV files.
- Execute your code over all the algorithms and all the datasets.
- Save your results to CSV files.
- Create Tableau Dashboards or Plotly visualisations to analyse your results.
- With visualisations:
- for each dataset compare and contrast results produced by each algorithm under optimal parameter settings
- for each given algorithm, how
f1_score
,adjusted_rand_score
andsilhouette_score
vary with the different parameter values - Discuss the execution times of algorithms and their parameter settings?
- What are the most important parameters in each algorithm
- Create a presentation or a pdf document or a Jupyter Notebook explaining the theory and applications of
f1_score
,adjusted_rand_score
andsilhouette_score
.
- Code should follow PEP8 Standard
- Host your code on your GitHub in a public or private repository as you prefer.
-
If it is a public repository, send the link for us to evaluate.
-
If it is a private repository, share (view only) with our GitHub usernames. Please contact us for them.
Send us a notification to start the evaluation. We evaluate your code for your technical progress.
DataDisca Pty Ltd, Melbourne, Australia