Code Monkey home page Code Monkey logo

datawig's Introduction

DataWig - Imputation for Tables

PyPI version GitHub license GitHub issues Build Status

DataWig learns Machine Learning models to impute missing values in tables.

See our user-guide and extended documentation here.

Installation

CPU

pip3 install datawig

GPU

If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:

wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
rm requirements.gpu-cu${CUDA_VERSION}.txt

where ${CUDA_VERSION} can be 75 (7.5), 80 (8.0), 90 (9.0), or 91 (9.1).

Running DataWig

The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:

Product Type Description Size Color
Shoe Ideal for Running 12UK Black
SDCards Best SDCard ever ... 8GB Blue
Dress This yellow dress M ?

Quickstart Example

For most use cases, the SimpleImputer class is the best starting point. For convenience there is the function SimpleImputer.complete that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:

import datawig, numpy

# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric() 
# mask 10% of the values
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .9)

# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)

You can also impute values in specific columns only (called output_column below) using values in other columns (called input_columns below). DataWig currently supports imputation of categorical columns and numeric columns.

Imputation of categorical columns

import datawig

df = datawig.utils.generate_df_string( num_samples=200, 
                                       data_column_name='sentences', 
                                       label_column_name='label')

df_train, df_test = datawig.utils.random_split(df)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['sentences'], # column(s) containing information about the column we want to impute
    output_column='label', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

Imputation of numerical columns

import datawig

df = datawig.utils.generate_df_numeric( num_samples=200, 
                                        data_column_name='x', 
                                        label_column_name='y')         
df_train, df_test = datawig.utils.random_split(df)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['x'], # column(s) containing information about the column we want to impute
    output_column='y', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
             

In order to have more control over the types of models and preprocessings, the Imputer class allows directly specifying all relevant model features and parameters.

For details on usage, refer to the provided examples.

Acknowledgments

Thanks to David Greenberg for the package name.

Building documentation

git clone [email protected]:awslabs/datawig.git
cd datawig/docs
make html
open _build/html/index.html

Executing Tests

Clone the repository from git and set up virtualenv in the root dir of the package:

python3 -m venv venv

Install the package from local sources:

./venv/bin/pip install -e .

Run tests:

./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest

Updating PyPi distribution

Before updating, increment the version in setup.py.

git clone [email protected]:awslabs/datawig.git
cd datawig
# build local distribution for current version
python setup.py sdist
# upload to PyPi
twine upload --skip-existing dist/*

datawig's People

Contributors

tdhd avatar tammor avatar prathik-naidu avatar felixbiessmann avatar andrey-tpt avatar vinit777 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.