Curse of Dimensionality Simulation - Lab
Introduction
The Curse of Dimensionality as we saw earlier, is one of the key problems in multivariate machine learning with high dimensional data. It appears in many different forms, but all of them have the same net form and source: the fact that points in high-dimensional space are highly sparse, and this leads to overfitting. In this lab, we shall try to simulate this problem see how it manifests itself in s simple experimental setup.
Note: You are advised to read the additional document on hypercubes referenced the previous lesson before attempting these simulations.
Objectives
You will be able to:
- Understand how dimensionality effects the probability space and the accuracy of each observation
- Simulate the effect of distance between observations as dimensionality increases
- Simulate the effect of dimensionality on the volume of hypercubes.
Dimensionality vs. Probability Space
Let us consider probability space for one variable represented by the unit interval (0, 1). Imagine drawing ten samples along that interval. On such a 1D interval (a straight line), each sample would represent 10% of the probability space. You could imagine a straight line , divided into ten sections.
Now consider a second feature defined on another orthogonal (perpendicular) (0, 1) interval, also being represented by ten samples. We now have 10
Now we have a new feature space that has
Taking this further,consider adding a third dimension creating a probability space represented by a cube with ten units on a side, and 1000 probability units within. We need to stack 10 of the previous
We can conclude that
Dimensionality vs. Euclidean Distance
Let's simulate the above behavior with two points only which lie at 0 and 1 on a single dimension, at a unit length from each other. Suppose we introduce a second axis of "data", again distributed a unit distance away. Now we have two points, (0,0) and (1,1). But the distance between the points has grown to
if we add a third dimension, the two points (0, 0, 0) and (1, 1, 1) will be
Simulate the above scanario and plot the output as a line plot with number of dimensions on x-axis and euclidean distance at Y-axis as shown in the output
# Your code here
[<matplotlib.lines.Line2D at 0x11fae2630>]
Dimensionality vs. Volume of a Hypercube
Let's formalize it a bit more. Consider a p -dimensional hypercube with unit volume. Suppose that we have
Let
Expressed in terms of
This means, in order to capture 10% (0.1) of the samples in 2 dimensions, we need
But in 10-d space we need
$e_p(r)=r^{1/p}$ to calculate percentage of volume covered for p = [1,..,10]. Produce the output as shown below:
Use # Your code here
<matplotlib.legend.Legend at 0x11f95d2b0>
Observation 1
We see that as
$p$ grows, the proportion of volume of the hypercube that we need ($r$ ) also grows.
What starts out needing just 10% of the volume grows to needing almost 80% of the volume as we see in the plot above.
Observation 2
The growth is uneven across the percentage
$r$ we want to capture.
To capture 80% of the points, we start out needing 80% of the volume and end up needing ~95% of it. This is more than we needed to start with, but a significantly slower rate of growth than the 10%-to-80% growth to get 10% of r.
Summary
In this lesson, we looked at curse of dimensionality and saw how this issues manifests itself as the number of dimensions grow. We ran a couple of simulations based on the developed intuitions to conclude the two observations given at the end.