Code Monkey home page Code Monkey logo

nachiket-bhide / masters-thesis-research-work Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 338 KB

This repository contains all program files and datasets used in implementation of Masters Thesis Research Work for the topic - "Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering".

C 34.12% Python 65.88%
machine-learning clustering dimensionality-reduction kpca dynamic-programming validity-indices silhouette-score unsupervised-machine-learning k-means-clustering hungarian-algorithm radial-basis-function principal-component-analysis python numpy pandas scikit-learn matplotlib c optimal

masters-thesis-research-work's Introduction

Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering

About Research Work

Motivation:

  1. Traditional approaches for clustering high dimensional data involve dimensionality reduction followed by classical clustering algorithms such as k-means in lower dimensions.
  2. However, approaches based on k-means clustering suffer from the drawbacks and limitations of k-means clustering, namely, high dependency on initialization of cluster centroids, non-repeatability of clustering results. Also, k-means can converge locally and hence does not guarantee optimal clustering.

Proposed Approach:

  1. An optimal clustering approach in one dimension based on dimensionality reduction is proposed.
  2. One dimensional representation of high dimensional data is obtained using Kernel Principal Component Analysis using a suitable kernel function such as Radial Basis Function(RBF).
  3. One dimensional representation is then clustered optimally using dynamic programming algorithm in polynomial time

Contribution:

  1. Developed a program using Python and C programming languages for implementing proposed clustering approach.
  2. Using Silhouette score as a metric to evaluate quality of clustering, the advantages of proposed approach over traditional k-means based approaches are demonstrated.
  3. For testing the proposed approach, a real world high dimensional dataset and a synthetic two dimensional dataset is used.

Programming Languages and Libraries

Programming Languages:

  1. Python
  2. C

Python Libraries

  1. Numpy
  2. Pandas
  3. scikit-learn
  4. Matplotlib
  5. traceback
  6. datetime

Program file Description and Usage

To execute the program, run main.py program. It automatically calls the code present in other files as required. Following is a brief description of program files:

  1. paths.py - Specify dataset,program and output(result) paths in this file.
  2. main.py - Main python program to be executed.
  3. c_thresholding_new.c(C program) - Implements one dimensional optimal clustering.
  4. kmeans.py - Implements k-means clustering algorithm.
  5. hungarian.py - Implements Hungarian algorithm for class-cluster assignment.
  6. silhouette.py - Calculates Silhouette Score for resulting clusters.
  7. gen_moonpairs.py - Generates specified number of half-moon pairs at random positions, angles in 2 dimensional plane.
  8. dataset.py - Used for loading, handling datasets.
  9. labelscan.py - Converts non-numeric class categories to numeric labels.
  10. logger.py - Logs important information(such as runtime exception handling stacktrace), variables values during the program execution.
  11. plot_graph.py - Plots the graph of "Numbers of Clusters"(X-axis) vs "Silhouette Score"(Y-axis) for comparing clustering performance.

Datasets Used

  1. English Letters.csv - Labeled real world dataset containing 20,000 samples of 26 letters of English alphabets. Each sample is a 16 dimensional encoding, and there 26 different class labels corresponding to 26 letters of the English alphabet. Class labels(letter names) are in same file.
  2. moons_data.csv - Labeled dataset datset containing 25 half moon pairs, that is 50 clusters corresponding to 50 individual half moons. Any number of half moon pairs can be generated using gen_moonpairs.py program.
  3. groundtruth.csv - Contains class labels for data points in moon_data.csv.

Contact Information

  1. Nachiket Bhide - [email protected]
  2. Dr. Luis Rueda - [email protected]

masters-thesis-research-work's People

Contributors

nachiket-bhide avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.