Code Monkey home page Code Monkey logo

holoclean-legacy-deprecated's Introduction

Master: Build Status Dev: Build Status

HoloClean: A Machine Learning System for Data Enrichment

HoloClean is built on top of PyTorch and PostgreSQL.

HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.

Installation

HoloClean was tested on Python versions 2.7, 3.6, and 3.7. It requires PostgreSQL version 9.4 or higher.

1. Install and configure PostgreSQL

We describe how to install PostgreSQL and configure it for HoloClean (creating a database, a user, and setting the required permissions).

Option 1: Native installation of PostgreSQL

A native installation of PostgreSQL runs faster than docker containers. We explain how to install PostgreSQL then how to configure it for HoloClean use.

a. Installing PostgreSQL

On Ubuntu, install PostgreSQL by running $ apt-get install postgresql postgresql-contrib

For macOS, you can find the installation instructions on https://www.postgresql.org/download/macosx/

b. Setting up PostgreSQL for HoloClean

By default, HoloClean needs a database holo and a user holocleanuser with permissions on it.

  1. Start the PostgreSQL psql console from the terminal using
    $ psql --user <username>. You can omit --user <username> to use current user.

  2. Create a database holo and user holocleanuser

CREATE DATABASE holo;
CREATE USER holocleanuser;
ALTER USER holocleanuser WITH PASSWORD 'abcd1234';
GRANT ALL PRIVILEGES ON DATABASE holo TO holocleanuser;
\c holo
ALTER SCHEMA public OWNER TO holocleanuser;

You can connect to the holo database from the PostgreSQL psql console by running psql -U holocleanuser -W holo.

HoloClean currently populates the database holo with auxiliary and meta tables. To clear the database simply connect as a root user or as holocleanuser and run

DROP DATABASE holo;
CREATE DATABASE holo;

Option 2: Using Docker

If you are familiar with docker, an easy way to start using HoloClean is to start a PostgreSQL docker container.

To start a PostgreSQL docker container, run the following command:

docker run --name pghc \
    -e POSTGRES_DB=holo -e POSTGRES_USER=holocleanuser -e POSTGRES_PASSWORD=abcd1234 \
    -p 5432:5432 \
    -d postgres:11

which starts a backend server and creates a database with the required permissions.

You can then use docker start pghc and docker stop pghc to start/stop the container.

Note the port number which may conflict with existing PostgreSQL servers. Read more about this docker image here.

2. Setting up HoloClean

HoloClean runs on Python 2.7 or 3.6+. We recommend running it from within a virtual environment.

Creating a virtual environment for HoloClean

Option 1: Conda Virtual Environment

First, download Anaconda (not miniconda) from this link. Follow the steps for your OS and framework.

Second, create a conda environment (python 2.7 or 3.6+). For example, to create a Python 3.6 conda environment, run:

$ conda create -n hc36 python=3.6

Upon starting/restarting your terminal session, you will need to activate your conda environment by running

$ conda activate hc36
Option 2: Set up a virtual environment using pip and Virtualenv

If you are familiar with virtualenv, you can use it to create a virtual environment.

For Python 3.6, create a new environment with your preferred virtualenv wrapper, for example:

Either follow instructions here or install via pip.

$ pip install virtualenv

Then, create a virtualenv environment by creating a new directory for a Python 3.6 virtualenv environment

$ mkdir -p hc36
$ virtualenv --python=python3.6 hc36

where python3.6 is a valid reference to a Python 3.6 executable.

Activate the environment

$ source hc36/bin/activate

Install the required python packages

Note: make sure that the environment is activated throughout the installation process. When you are done, deactivate it using conda deactivate, source deactivate, or deactivate depending on your version.

In the project root directory, run the following to install the required packages. Note that this commands installs the packages within the activated virtual environment.

$ pip install -r requirements.txt

Note for macOS Users: you may need to install XCode developer tools using xcode-select --install.

Running HoloClean

See the code in examples/holoclean_repair_example.py for a documented usage of HoloClean.

In order to run the example script, run the following:

$ cd examples
$ ./start_example.sh

Notice that the script sets up the Python path environment to run HoloClean.

holoclean-legacy-deprecated's People

Contributors

aayushshah15 avatar ah89 avatar codelionx avatar epang080516 avatar gmichalo avatar ihabilyas avatar j48zheng avatar jvonderwell avatar jw-mcgrath avatar laferrieren avatar matrixpachi-w avatar minafarid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

holoclean-legacy-deprecated's Issues

Dataset Class

-clean_data
-not_know data
-feature
-labels

methods
init.model()
discoverdomain()
d.X
d.Y
d.W

Dataset Descriptions

Quite a few tables were added recently with no descriptions in dataset.py. Most are self-explanatory but we should add quick comments to be consistent with what we were doing before.

errordetector.py

  1. Remove sys.add etc from beginning. Just link to holoclean env or whatever is needed.
  2. Run pep8

featurization.py

Fix typos: one example

  1. _query_for_featurization_of_cooncur -> cooccur

Also Class Featurizer should be a template class.
DCFeaturizer
InitValFeaturizer
CooccurFeaturizer

should be different subclasses

We should also have FeaturizationEngine class or something that takes as input a set of featurizers and applies them to the input dataset.

  1. Table names are off again. Remove table1 table2 table3 etc. Give them names!

  2. Add comments.

Review dataengine module

As we followed new structure , May I ask you to take look at dataengine.py and ask me to check part you think it might be erroneous .

Thank in advance

Review README.md for any clarifications or additions needed

Review the Installation and send me errors you encountered, at which step you encountered them and how it was resolved on your machine.
In general, any step that is not clear enough (i.e. you would have/had questions about during installation), let me know.

Badges

Coverage, Travis, Code review

Changes on dataengine.py

  1. Rename "_register_meta_table" to something with a meaning. Remove the word register. We are creating a table.

  2. Add comments to _add_meta method to capture high-level logic

3)Alignment is off in method query

Holocleansession class

-Member:
Spark Session
Data Engine
Dataset={}

-Constructor:
start spark session
start dateEngine

-Methods:
ingest Dataset (name, handle)
-> ingestor=new
DeletedErrors( d, [detection methods])

Adding flags for API functions

We should have flags (with defaults) for things like

  • verbose/debug mode
  • learning rates
  • training iterations
  • batch size for neural net
  • weight decay
  • momentum
  • domain pruning threshold

Essentially, any parameter to the model for learning should be at the surface level

Changes in dataset.py

  1. biases should be Biases. Consistency

  2. Remove line27. I have no idea what is that.

  3. So in dataset the moment I want to access a specific table I need to know its id in list attributes? This is not good practice. You convert them to simple members initialized to point to a DB table or DB_table name whatever you have. Update all code that calls Dataset.attributes[id] to use the proper member of the dataset instance.

  4. self.table_name[0]=self.dataset_id what is table_name? The code has to be readable

Database column type

In this issue you have to look at the denial constraint and base on the symmetric or non sysmetric operator change the type of the columns
set of symmetric operations are {=,!=} which work fine with number and string if number type is TEXT.
However, in the other operation {<,>,>=,<=} we should have NUMERICAL type.

Cleanup Open Issues

Please close anything that does not contain relevant information to start fresh.

DataEngine class

-Member:
jodbc_connector
datasets={}

-Constructor:
open connection

-Methods:
register dataset (datasetname, schema)
[create a table in db]
register dataset(datasetnamp)
load(data,dataframe)
retrieve(dataframe, sql_query) [return dataframe]

Review The Factor Graph

We just create the wrapper.py in learning folder and we call it in the holoclean.py/Session class in the method _numskull(self). best point to start review the code. in the test forlde we have test.py which is working and at the end print the before and after weight but it is same

Required White Space at End of DC Files

Right now if there's not a blank new line at the end of a denial constraint file, it errors out and fails to load the DC's correctly. Not a huge deal, just a little janky. I'll clean it up.

Neural Networks for Feature Extraction and Prediction

Statistics

Dataset Hospital Food
# of Clean Cells 1860
# of Dirty Cells 6140

Logistics Regression

Dataset Hospital Food
Precision 0.9879
Recall 0.6727

Baseline

Default Value: 0.9428/0.4136
No cooccur: 0.9882/0.6818
No cooccur and No init(only DC): 0.9897/0.8227
No DC: 0.9641/0.0

+ Embedding Feature

FC1+FC2+FC3+CONCATE: 0.9428/0.4136
FC1+FC3+CONCATE: 0.9428/0.4136
FC1+FC2+FC3+SUM: 0.9879/0.6727

0.9897/0.8227
TRAIN: 64 out of 1860 are not default value in predictions
TEST: 74 out of 6140 are not default value in predictions

Changes on dcparser.py

  1. Add comments
  2. Run pep8 and fix formatting
  3. _dc_to_Sql_condition -> _dc_to_sql_condition lower case sql

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.