Code Monkey home page Code Monkey logo

visualizing-datasets-with-facets's Introduction

Visualizing Machine Learning Datasets using Anaconda & Facets

Facets allows for easy visualization. For using Facets, first clone the git repository:

git clone https://github.com/PAIR-code/facets.git

To use the visualization capabilities, you will have to add an nbextension. Therefore, find the path to the facets-dist directory in the cloned git repo and execute the following line of code:

jupyter nbextension install facets-dist/ --user

In which case 'facets-dist' is the path to the respective folder.

If the above command still does not show the visualizations on the notebook, copy the file called facets-jupyter.html in 'facets/facets-dist' folder your local anaoconda file path '[anaconda_path]/share/jupyter/nbextensions/'. This is a known issue PAIR-code/facets#41

You might need to restart jupyter after this and proceed with the vizualisation. For a more detailed installation guide and updates, have a look at:

https://github.com/PAIR-code/facets

Do also install the protobuf package

conda install protobuf

# Add the facets overview python code to the python path
import sys
# FACETS_PATH is the full path to the python file in the clonde github repo of Facets.
# It should look similar to this: ".../facets/facets_overview/python"
# If you have cloned the facets repo to your current working directory, you can proceed.
# If you have chosen another location, just add it here.

FACETS_PATH = 'facets-master/facets_overview/python'
sys.path.append(FACETS_PATH)
import pandas as pd

train_data = pd.read_csv(
    "train.csv",
    #sep=r'\s*,\s*',
    engine='python',
    na_values="?")

test_data = pd.read_csv(
    "test.csv",
    #sep=r'\s*,\s*',
    engine='python',
    na_values="?")

test_salaries = pd.read_csv(
    "test_salaries.csv",
    #sep=r'\s*,\s*',
    engine='python',
    na_values="?")

test_data = pd.concat([test_salaries, test_data], axis=1)
# Calculate the feature statistics proto from the datasets and stringify it for use in 
# facets overview
from generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import base64

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': train_data},
                                  {'name': 'test', 'table': test_data}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")
# Display the facets overview visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""

html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

png

png

Facets Overview provides a quick understanding of the distribution of values across the features of their datasets. Multiple datasets, such as a training set and a test set, can also be compared on the same visualization.

Common data issues that can hamper machine learning are pushed to the forefront, such as: unexpected feature values, features with high percentages of missing values, features with unbalanced distributions, and feature distribution skew between datasets.

Known Issues

The Facets visualizations currently work only in Chrome browsers

visualizing-datasets-with-facets's People

Contributors

kwokhing avatar

Watchers

 avatar

Forkers

judyxyang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.