Code Monkey home page Code Monkey logo

smendowski / data-embedding-and-visualization Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 180.78 MB

Visualization and embedding of large datasets using various Dimensionality Reduction (DR) techniques such as t-SNE, UMAP, PaCMAP & IVHD. Implementation of custom metrics to assess DR quality with complete explaination and workflow.

License: MIT License

Dockerfile 0.01% Jupyter Notebook 99.92% Shell 0.01% Python 0.02% TeX 0.05%
pca visualization autoencoders isomap mds pacmap reduction t-sne dimiensionality ivhd

data-embedding-and-visualization's Introduction

1. Introduction

Visualization makes it easier to understand and notice dependencies in the high dimensionality data that are not trivial to capture and perceive. It is an inseparable, far-reaching, and effectual concept of data analysis or its initial recognition, but also an autonomous tool and dextrous field of machine learning. Visualization allows checking whether there are groups of similar observations forming clusters and finally gain more priceless intuition and understanding about data. In the case of multi and highdimensional ones, it is necessary to reduce their dimensions to at most three. The relationships in data are often non-linear, which rules out methods like PCA regarding separation quality. Therefore, it is required to use Manifold Learning techniques to discover the surface (manifold) on which the data is distracted and make reasonable projections into a space with the desired dimensionality. This project aims to analyze and visualize the MINST, 20 News Groups, and RCV Reuters datasets using methods such as t-SNE, UMAP, ISOMAP, PaCMAP and IVHD. Therefore, the particular motivation is to show the concept of high-dimensional data visualization, assess multiple data embedding techniques, and highlight potential comparative criteria of data separation quality.

2. VISKIT

1. Configuration and Setup

Viskit Repository and README

git clone https://gitlab.com/bminch/viskit.git
docker build -t viskit -f Dockerfile .
docker run -it viskit /bin/bash

2. Graphs

Graphs are required by VisKit. For this project, they can be downloaded either manually or automatically.

source /utils/download_graphs.sh

Graphs location on Google Drive:
mnist_cosine.bin
mnist_euclidean.bin
reuters_cosine.bin
reuters_euclidean.bin
tng_cosine.bin
tng_euclidean.bin

3. Usage documentation

Provide dataset (without labels; path_to_dataset_file), labels (path_to_labels_file) as separate csv files and graph file ({path_to_graph_file}). Visualization text file will be saved to specified path (path_to_visualization).

cd /opt/viskit/viskit_offline
./viskit_offline {path_to_dataset_file} {path_to_labels_file} {path_to_graph_file} {path_to_visualization} 2500 2 1 1 0 0 0 "force-directed"
./viskit_offline {path_to_dataset_file} {path_to_labels_file} {path_to_graph_file} {path_to_visualization}

4. Usage examples

cd /opt/viskit/viskit_offline
./viskit_offline "./datasets/mnist_data.csv" "./labels/mnist_labels.csv" "./graphs/mnist.bin" ./visualization.txt 2500 2 1 1 0 0 0 "force-directed"
./viskit_offline "./datasets/mnist_data.csv" "./labels/mnist_labels.csv" "./graphs/mnist.bin" ./visualization.txt

3. Metrics

Metrics are used to asses and compare quality of dimensionality reduction techniques. Two major aspects are worth to include during assesment - the local and global quality of separation.

Implemented Metrics:

  1. Distance matrix-based metric
  2. Distance matrix-based metric with KMeans optimization
  3. KMeans extension of distance matrix based metric
  4. Thrustworthiness-based metric
  5. Spearman correlation-based metric
  6. KNN Gain & DR Quality
  7. Sheppard Diagram
  8. Co-ranking matrix-based metric

4. [Appendix] Introduction to Dimenstionality Reduction

Jupyter notebooks that covers basic and advanced issues regarding the visualization of large data sets and Dimensionality Reduction

  1. Principal Component Analysis
  2. Roulade projections using t-SNE and MDS
  3. f-MNIST and MNIST visualizations using t-SNE, UMAP and LargeVis
  4. Neural Networks hidden layers activations embedding

6. Authors

Mateusz Smendowski & Michał Grela

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.