Code Monkey home page Code Monkey logo

dsc-3-34-03-curse-of-dimensionality-simulation-lab-online-ds-sp-000's Introduction

Curse of Dimensionality Simulation - Lab

Introduction

The Curse of Dimensionality as we saw earlier, is one of the key problems in multivariate machine learning with high dimensional data. It appears in many different forms, but all of them have the same net form and source: the fact that points in high-dimensional space are highly sparse, and this leads to overfitting. In this lab, we shall try to simulate this problem see how it manifests itself in s simple experimental setup.

Note: You are advised to read the additional document on hypercubes referenced the previous lesson before attempting these simulations.

Objectives

You will be able to:

  • Understand how dimensionality effects the probability space and the accuracy of each observation
  • Simulate the effect of distance between observations as dimensionality increases
  • Simulate the effect of dimensionality on the volume of hypercubes.

Dimensionality vs. Probability Space

Let us consider probability space for one variable represented by the unit interval (0, 1). Imagine drawing ten samples along that interval. On such a 1D interval (a straight line), each sample would represent 10% of the probability space. You could imagine a straight line , divided into ten sections.

Now consider a second feature defined on another orthogonal (perpendicular) (0, 1) interval, also being represented by ten samples. We now have 10 $(x1 , x2)$ points on a plane defined by the orthogonal x1 , x2 lines, and representing a new probability space. This is shown below:

Now we have a new feature space that has $10 X 10 = 100$ area units, so each of the ten points now represents only 1% of the probability space. We now need require 100 points, in order for each point to represent the same 10% of the probability space that was represented by 10 points in only one dimension. Now, although there are 100 $(x1 , x2)$ points, neither axis has more accuracy than previous 10 on a single axis, because there are 10 values of x1 required for the 10 different values of x2 necessary to represent the new probability plane.

Taking this further,consider adding a third dimension creating a probability space represented by a cube with ten units on a side, and 1000 probability units within. We need to stack 10 of the previous $(x1 , x2)$ planes to construct the new $(x1 , x2 , x3)$ cube, and of course 10 times the number of samples per axis to maintain the former level of accuracy, or probability coverage. This is shown below:

We can conclude that $10^n$ samples would be required for a n-dimension problem. This is the "curse of dimensionality" .In practice when the samples are taken at random, it appears as though there are more samples, but this isn't the case in terms of probability space coverage which is related to the density of points, not the number of observations per axis.

Dimensionality vs. Euclidean Distance

Let's simulate the above behavior with two points only which lie at 0 and 1 on a single dimension, at a unit length from each other. Suppose we introduce a second axis of "data", again distributed a unit distance away. Now we have two points, (0,0) and (1,1). But the distance between the points has grown to $\sqrt{2}$ according to Euclidean distance $$d(x,y) = \sqrt{(x_1 - y_1)^2 +.....(X_N - y_N)^2}$$

if we add a third dimension, the two points (0, 0, 0) and (1, 1, 1) will be $\sqrt{3}$ away, and so on. By the time we reach 10 dimensions, the two points are now a distance of 3 away from one another: thrice as far as they were when the data was 1-dimensional:

Simulate the above scanario and plot the output as a line plot with number of dimensions on x-axis and euclidean distance at Y-axis as shown in the output

# Your code here 
[<matplotlib.lines.Line2D at 0x11fae2630>]

png

Dimensionality vs. Volume of a Hypercube

Let's formalize it a bit more. Consider a p -dimensional hypercube with unit volume. Suppose that we have $n$ data points uniformally distributed inside the hypercube. The volume (area) of a square is $e^2$ , where $e$ is the edge length; the volume of a cube is $e^3$ , and so on . Hence the volume of a p -dimensional hypercube is given by:

$$Vp=1=e^p$$

Let $r$ be the ratio of points inside the cube which are within some neighborhood. To capture an $r$ with maximum points in the data, we need to grow a cube which takes up $r$ of the unit cube's volume. Since the length of an edge on the cube is simply 1, we have:

$$r=e^p$$

Expressed in terms of $e_p$ , the edge length necessary to fill a p-dimensional -hypercube, we have that:

$$e_p(r)=r^{1/p}$$

This means, in order to capture 10% (0.1) of the samples in 2 dimensions, we need $e_2(.1)=(.1)^{1/2}=0.31$.

But in 10-d space we need $e_10(.1)=(.1)^1/10=0.8$ , or 80 percent of the volume of the cube.

Use $e_p(r)=r^{1/p}$ to calculate percentage of volume covered for p = [1,..,10]. Produce the output as shown below:

# Your code here 
<matplotlib.legend.Legend at 0x11f95d2b0>

png

Observation 1

We see that as $p$ grows, the proportion of volume of the hypercube that we need ($r$) also grows.

What starts out needing just 10% of the volume grows to needing almost 80% of the volume as we see in the plot above.

Observation 2

The growth is uneven across the percentage $r$ we want to capture.

To capture 80% of the points, we start out needing 80% of the volume and end up needing ~95% of it. This is more than we needed to start with, but a significantly slower rate of growth than the 10%-to-80% growth to get 10% of r.

Summary

In this lesson, we looked at curse of dimensionality and saw how this issues manifests itself as the number of dimensions grow. We ran a couple of simulations based on the developed intuitions to conclude the two observations given at the end.

dsc-3-34-03-curse-of-dimensionality-simulation-lab-online-ds-sp-000's People

Contributors

shakeelraja avatar loredirick avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.