The data-embedding-and-visualization's intro from smendowski

1. Introduction

Visualization makes it easier to understand and notice dependencies in the high dimensionality data that are not trivial to capture and perceive. It is an inseparable, far-reaching, and effectual concept of data analysis or its initial recognition, but also an autonomous tool and dextrous field of machine learning. Visualization allows checking whether there are groups of similar observations forming clusters and finally gain more priceless intuition and understanding about data. In the case of multi and highdimensional ones, it is necessary to reduce their dimensions to at most three. The relationships in data are often non-linear, which rules out methods like PCA regarding separation quality. Therefore, it is required to use Manifold Learning techniques to discover the surface (manifold) on which the data is distracted and make reasonable projections into a space with the desired dimensionality. This project aims to analyze and visualize the MINST, 20 News Groups, and RCV Reuters datasets using methods such as t-SNE, UMAP, ISOMAP, PaCMAP and IVHD. Therefore, the particular motivation is to show the concept of high-dimensional data visualization, assess multiple data embedding techniques, and highlight potential comparative criteria of data separation quality.

2. VISKIT

1. Configuration and Setup

Viskit Repository and README

git clone https://gitlab.com/bminch/viskit.git
docker build -t viskit -f Dockerfile .
docker run -it viskit /bin/bash

2. Graphs

Graphs are required by VisKit. For this project, they can be downloaded either manually or automatically.

source /utils/download_graphs.sh

Graphs location on Google Drive:
mnist_cosine.bin
mnist_euclidean.bin
reuters_cosine.bin
reuters_euclidean.bin
tng_cosine.bin
tng_euclidean.bin

3. Usage documentation

Provide dataset (without labels; path_to_dataset_file), labels (path_to_labels_file) as separate csv files and graph file ({path_to_graph_file}). Visualization text file will be saved to specified path (path_to_visualization).

cd /opt/viskit/viskit_offline
./viskit_offline {path_to_dataset_file} {path_to_labels_file} {path_to_graph_file} {path_to_visualization} 2500 2 1 1 0 0 0 "force-directed"
./viskit_offline {path_to_dataset_file} {path_to_labels_file} {path_to_graph_file} {path_to_visualization}

4. Usage examples

cd /opt/viskit/viskit_offline
./viskit_offline "./datasets/mnist_data.csv" "./labels/mnist_labels.csv" "./graphs/mnist.bin" ./visualization.txt 2500 2 1 1 0 0 0 "force-directed"
./viskit_offline "./datasets/mnist_data.csv" "./labels/mnist_labels.csv" "./graphs/mnist.bin" ./visualization.txt

3. Metrics

Metrics are used to asses and compare quality of dimensionality reduction techniques. Two major aspects are worth to include during assesment - the local and global quality of separation.

Implemented Metrics:

Distance matrix-based metric
Distance matrix-based metric with KMeans optimization
KMeans extension of distance matrix based metric
Thrustworthiness-based metric
Spearman correlation-based metric
KNN Gain & DR Quality
Sheppard Diagram
Co-ranking matrix-based metric

4. [Appendix] Introduction to Dimenstionality Reduction

Jupyter notebooks that covers basic and advanced issues regarding the visualization of large data sets and Dimensionality Reduction

5. Documentation

6. Authors

Mateusz Smendowski & Michał Grela

smendowski / data-embedding-and-visualization Goto Github PK

data-embedding-and-visualization's Introduction

1. Introduction

2. VISKIT

1. Configuration and Setup

2. Graphs

3. Usage documentation

4. Usage examples

3. Metrics

4. [Appendix] Introduction to Dimenstionality Reduction

5. Documentation

6. Authors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent