Code Monkey home page Code Monkey logo

k-means-auto-encoder's Introduction

K-means Encoder Examples

The code in this directory contains some examples of auto-encoders based on k-means clustering.

Random Points

The clustering.r script generates 15 dimensional data points that are actually on a 3-dimensional curved manifold (surface-ish). The way these points are generated is by generating 3-dimensional points and sending them through a randomly generated neural network which has a single layer of hidden nodes. This neural network simultaneously distorts the original data and projects it to the final 15 dimensional representation.

Since the neural network represents a continuous function from R^3 to R^15, the projected points will have a simple, but obscured structure.

This process is used to generate a training data set as well as a test data set.

After the data points are generated, the training data are clustered and the resulting clusters are used to encode and reconstruct both the training data and the test data. This is done for varying numbers of clusters from 10 to 2000.

The plot in points.pdf shows how accurately the points are reconstructed by simply using the centroid of the nearest cluster instead of the original point. The black line shows how well the training data is reconstructed and the red shows the same for the test data. Since the training data was used to create the clusters, it isn't too surprising that the reconstruction error is lower. In particular, by the time that we have 2000 cluster, a number of these clusters will only have a single data point assigned to them resulting in zero error for that point in the training data. The test data will inevitably not contain exactly that same data point and thus will not have any points that get zero reconstruction error.

Theoretically speaking, if the data points covered some region of 15 dimensional space uniformly, the best k-means clustering of a very large number of data points is closely related to the question of how to pack spheres as tightly as possible. The number of spheres n of a particular radius r that are required to fill a unit cube goes up with ![(1/r)^d](images/1 over r^d.png) where d is the dimension of the space we are filling. In our case, we are only filling a 3-dimensional manifold so our error (roughly the radius r) should decrease with the inverse cube root of the number of cluster centroids. This rough relationship can be seen in cube-root.pdf. The deviation from the ideal form for large k has to do with the finite number of data points we have. I don't know the cause of the slight offset from the ideal form.

Time-series

The time-series.r script generates time series data instead of random points. This time series is generated by first generating two relatively low dimensional random pulses. The time series is then created by stepping forward a bit and adding one or the other pulse to the time series. If short steps are taken, then the pulses will overlap and the result will be the sum of more than one pulse. If long steps are taken, then there will be a period of time when the output falls to zero wherever neither pulse has any influence. As with the random points demo, both training and test data is generated.

The clustering of the data proceeds by generating a windowed version of the original training time series. These windows are created by taking 32 consecutive samples from the time series and multiplying by a sin^2 windowing function that allows all the windowed values to be added together again to recreate the original time series. Each window steps 16 samples from the previous window so that the windows overlap by half.

After the windows are generated, they are clustered, again using k-means to do so. These clusters are then normalized to have magnitude 1 and each time series is encoded using the clusters and then decoded using the same clusters. Since the clusters have been normalized, encoding proceeds by computing the dot product of a window with all of the cluster centers and finding the largest dot product. This largest value is remembered along with the corresponding index.

To decode the resulting index and magnitude back to a time series, the index is used to get a cluster centroid and the magnitude is used to scale the centroid. The result is added back into the final time series with an appropriate time offset. To compute the reconstruction erorr, we can simple subtract the original time series from the result we get by first encoding and then deocding and take the mean absolute value (MAV).

As with the random points, having more clusters results in lower error, but the time series appears to be somewhat simpler in that error decreases more rapidly with increasing k than for the random points.

The figures

Image Description
cube-root.pdf - shows how closely the accuracy versus cluster count follows theoretical considerations. The fit is very good except when the number of clusters becomes large with respect to the number of data points.
points.pdf - shows the error versus number of clusters for training and test data in the case of randomly distributed points. The error on the test data is larger than the training data, but continues to decrease with increasing number of clusters. This indicates some over-fitting is happening, but that large number of clusters are still improving the fit to new data.
time-series.pdf - shows the error versus number of clusters for training and test data in the case of time series data. This is very similar to the case for randomly chosen data points, but the shape of the curve indicates a lower dimension for the embedded data. You can't actually tell the different shape by eye, but it becomes apparent when fitting the data.
time-series-reconstruction.pdf - shows how well the reconstructed signal (red) matches the original test signal (in black). The blue line shows the reconstruction error. To avoid obscuring the features of the blue line, it is offset by 1 unit downwards.

Running the code

To run the code and regenerate the figures, make sure you have R installed and then run the following commands:

$ Rscript clustering.r
$ Rscript time-series.r

These scripts should take less than a minute to run. On a mac, you can then do this to open all of the figures in Preview:

$ open *.pdf

Getting Help

The code should be well enough commented to follow, especially with reference to this README and the associated blog. If you have problems, send me a question at one of the following addresses:

k-means-auto-encoder's People

Contributors

tdunning avatar

Watchers

James Cloos avatar Benedikt avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.