Code Monkey home page Code Monkey logo

clustering-penguin-species's Introduction

Clustering-Penguin-Species

Use unsupervised learning skills to reduce dimensionality and identify clusters in the antarctic penguin dataset

Artwork by @allison_horst

source: @allison_horst

Read Further at: https://github.com/allisonhorst/penguins

The dataset consists of 5 columns.

  • culmen_length_mm: culmen length (mm)
  • culmen_depth_mm: culmen depth (mm)
  • flipper_length_mm: flipper length (mm)
  • body_mass_g: body mass (g)
  • sex: penguin sex

Sample rows from the dataset:

culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 39.1 18.7 181.0 3750.0 MALE
1 39.5 17.4 186.0 3800.0 FEMALE
2 40.3 18.0 195.0 3250.0 FEMALE
3 NaN NaN NaN NaN NaN
4 36.7 19.3 193.0 3450.0 FEMALE

While the dataset lacks labeled instances, it presents an opportunity to apply unsupervised learning techniques to cluster the instances and identify potential groups or classes within the dataset. It is given that there are three species of penguins native to the region: Adelie, Chinstrap, and Gentoo, so the task is to leverage my data science skills and apply unsupervised clustering algorithms to help identify the underlying groups or clusters present in the dataset!

Project Approach

1. Loading and examining the dataset

  • create a pandas DataFrame and examine "data/penguins.csv" for data types and missing values.
  • store the DataFrame in df_penguins variable.

2. Dealing with null values and outliers

  • using the information gained in the previous step, identify outliers and null values and remove them.

3. Perform preprocessing steps on the dataset to create dummy variables

  • create dummy variables for the available categorical feature in the dataset, then drop the original column.

4. Perform preprocessing steps on the dataset - scaling

  • utilize an available preprocessing function to standardize the features in the dataset and prepare it for the unsupervised learning algorithms.
  • create a preprocessed DataFrame for the PCA process.

5. Perform PCA

  • perform PCA(), without specifying the number of components, to determine the explained variance ratio versus the number of principal components.
  • detect the number of components that have more than 10% explained variance ratio.
  • finally, create a variable named n_components to store the optimal number of components determined by the analysis, and run the PCA while setting n_components.

6. Detect the optimal number of clusters for k-means clustering

  • perform Elbow analysis to determine the optimal number of clusters for this dataset.
  • store the optimal number of clusters in the n_clusters variable.

7. Run the k-means clustering algorithm

  • using the optimal number of clusters obtained from the previous step, run the k-means clustering algorithm once more on the preprocessed data.

8. Create a final statistical DataFrame for each cluster

  • create a final characteristic DataFrame for each cluster using the groupby method and mean function only on numeric columns.

clustering-penguin-species's People

Contributors

yashuv avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.