Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering

About Research Work

Motivation:

Traditional approaches for clustering high dimensional data involve dimensionality reduction followed by classical clustering algorithms such as k-means in lower dimensions.
However, approaches based on k-means clustering suffer from the drawbacks and limitations of k-means clustering, namely, high dependency on initialization of cluster centroids, non-repeatability of clustering results. Also, k-means can converge locally and hence does not guarantee optimal clustering.

Proposed Approach:

An optimal clustering approach in one dimension based on dimensionality reduction is proposed.
One dimensional representation of high dimensional data is obtained using Kernel Principal Component Analysis using a suitable kernel function such as Radial Basis Function(RBF).
One dimensional representation is then clustered optimally using dynamic programming algorithm in polynomial time

Contribution:

Developed a program using Python and C programming languages for implementing proposed clustering approach.
Using Silhouette score as a metric to evaluate quality of clustering, the advantages of proposed approach over traditional k-means based approaches are demonstrated.
For testing the proposed approach, a real world high dimensional dataset and a synthetic two dimensional dataset is used.

Programming Languages and Libraries

Programming Languages:

Python
C

Python Libraries

Numpy
Pandas
scikit-learn
Matplotlib
traceback
datetime

Program file Description and Usage

To execute the program, run main.py program. It automatically calls the code present in other files as required. Following is a brief description of program files:

paths.py - Specify dataset,program and output(result) paths in this file.
main.py - Main python program to be executed.
c_thresholding_new.c(C program) - Implements one dimensional optimal clustering.
kmeans.py - Implements k-means clustering algorithm.
hungarian.py - Implements Hungarian algorithm for class-cluster assignment.
silhouette.py - Calculates Silhouette Score for resulting clusters.
gen_moonpairs.py - Generates specified number of half-moon pairs at random positions, angles in 2 dimensional plane.
dataset.py - Used for loading, handling datasets.
labelscan.py - Converts non-numeric class categories to numeric labels.
logger.py - Logs important information(such as runtime exception handling stacktrace), variables values during the program execution.
plot_graph.py - Plots the graph of "Numbers of Clusters"(X-axis) vs "Silhouette Score"(Y-axis) for comparing clustering performance.

Datasets Used

English Letters.csv - Labeled real world dataset containing 20,000 samples of 26 letters of English alphabets. Each sample is a 16 dimensional encoding, and there 26 different class labels corresponding to 26 letters of the English alphabet. Class labels(letter names) are in same file.
moons_data.csv - Labeled dataset datset containing 25 half moon pairs, that is 50 clusters corresponding to 50 individual half moons. Any number of half moon pairs can be generated using gen_moonpairs.py program.
groundtruth.csv - Contains class labels for data points in moon_data.csv.

Contact Information

Nachiket Bhide - [email protected]
Dr. Luis Rueda - [email protected]

nachiket-bhide / masters-thesis-research-work Goto Github PK

masters-thesis-research-work's Introduction

Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering

About Research Work

Motivation:

Proposed Approach:

Contribution:

Programming Languages and Libraries

Programming Languages:

Python Libraries

Program file Description and Usage

Datasets Used

Contact Information

masters-thesis-research-work's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent