Code Monkey home page Code Monkey logo

dsc-3-34-06-performing-principle-component-analysis-online-ds-sp-000's Introduction

Pincipal Component Analysis in scikit-learn - Lab

Introduction

Now that you've seen a brief introduction to PCA, it's time to try implementing the algorithm on your own.

Objectives

You will be able to:

  • Perform PCA in Python and scikit-learn using Iris dataset
  • Measure the impact of PCA on the accuracy of classification algorithms
  • Plot the decision boundary of different classification experiments to visually inspect their performance.

Iris Dataset

To practice PCA, you'll take a look at the iris dataset. Run the cell below to load it.

from sklearn import datasets
import pandas as pd
 
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Target'] = iris.get('target')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

In a minute, you'll perform PCA and visualize the datasets principle components. Before, its helpful to get a little more context regarding the data that you'll be working with. Run the cell below in order to visualize the pairwise feature plots. With this, notice how the target labels are easily separable by any one of the given features.

import matplotlib.pyplot as plt
%matplotlib inline

pd.plotting.scatter_matrix(df, figsize=(10,10));

png

# Create features and Target dataset


# Your code here 
# Standardize the features


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal length sepal width petal length petal width
0 -0.900681 1.032057 -1.341272 -1.312977
1 -1.143017 -0.124958 -1.341272 -1.312977
2 -1.385353 0.337848 -1.398138 -1.312977
3 -1.506521 0.106445 -1.284407 -1.312977
4 -1.021849 1.263460 -1.341272 -1.312977

PCA Projection to 2D Space

Now its time to perform PCA! Project the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variance present in the data.

  • Initialize an instance of PCA from scikit-learn with 2 components
  • Fit the data to the model
  • Extract the first 2 principal components from the trained model
# Run the PCA algorithm


# Your code here 

To visualize the components, it will be useful to also look at the target associated with the particular observation. As such, append the target (flower name) to the principal components in a pandas dataframe.

# Create a new dataset fro principal components 


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PC1 PC2 target
0 -2.264542 0.505704 Iris-setosa
1 -2.086426 -0.655405 Iris-setosa
2 -2.367950 -0.318477 Iris-setosa
3 -2.304197 -0.575368 Iris-setosa
4 -2.388777 0.674767 Iris-setosa

Great, you now have a set of two dimensions, reduced from four against our target variable, the flower name.

Visualize Principal Components

Using the target data, we can visualize the principal components according to the class distribution.

  • Create a scatter plot from principal components while color coding the examples
# Principal Componets scatter plot


# Your code here 

Explained Variance

You can see above that the three classes in the dataset are fairly well separable. As such, this compressed representation of the data is probably sufficient for the classification task at hand. Compare the variance in the overall dataset to that captured from your two primary components.

# Calculate the variance explained by pricipal components


# Your code here 
Variance of each component: [0.72770452 0.23030523]

 Total Variance Explained: 95.8

As you should see, these first two principal components account for the vast majority of the overall variance in the dataset. This is indicative of the total information encapsulated in the compressed representation compared to the original encoding.

Compare Performance of an Classifier with PCA

Since the principal components explain 95% of the variance in the data, it is interesting to consider how a classifier trained on the compressed version would compare to one trained on the original dataset.

  • Run a KNeighborsClassifier to classify the Iris dataset
  • Use a trai/test split of 80/20
  • For reproducability of results, set random state =9 for the split
  • Time the process for splitting, training and making prediction
# classification complete Iris dataset

# Your code here 
Accuracy: 1.0
Time Taken: 0.0017656260024523363

Great , so you can see that we are able to classify the data with 100% accuracy in the given time. Remember the time taken may different randomly based on the load on your cpu and number of processes running on your PC.

Now repeat the above process for dataset made from principal components

  • Run a KNeighborsClassifier to classify the Iris dataset with principal components
  • Use a trai/test split of 80/20
  • For reproducability of results, set random state =9 for the split
  • Time the process for splitting, training and making prediction
# Run the classifer on PCA'd data


# Your code here 
Accuracy: 0.9666666666666667
Time Taken: 0.00035927799763157964

While some accuracy is loss in this representation, the training time has vastly improved. In more complex cases, PCA can even improve the accuracy of some machine learning tasks. In particular, PCA can be useful to reduce overfitting.

Summary

In this lab you applied PCA to the popular Iris dataset. You looked at performance of a simple classifier and impact of PCA on it. From here, you'll continue to explore PCA at more fundamental levels.

dsc-3-34-06-performing-principle-component-analysis-online-ds-sp-000's People

Contributors

loredirick avatar shakeelraja avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.