Code Monkey home page Code Monkey logo

holoclean's Introduction

Master: Build Status Dev: Build Status

HoloClean: A Machine Learning System for Data Enrichment

HoloClean is built on top of PyTorch and PostgreSQL.

HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.

Installation

HoloClean was tested on Python versions 2.7, 3.6, and 3.7. It requires PostgreSQL version 9.4 or higher.

1. Install and configure PostgreSQL

We describe how to install PostgreSQL and configure it for HoloClean (creating a database, a user, and setting the required permissions).

Option 1: Native installation of PostgreSQL

A native installation of PostgreSQL runs faster than docker containers. We explain how to install PostgreSQL then how to configure it for HoloClean use.

a. Installing PostgreSQL

On Ubuntu, install PostgreSQL by running $ apt-get install postgresql postgresql-contrib

For macOS, you can find the installation instructions on https://www.postgresql.org/download/macosx/

b. Setting up PostgreSQL for HoloClean

By default, HoloClean needs a database holo and a user holocleanuser with permissions on it.

  1. Start the PostgreSQL psql console from the terminal using
    $ psql --user <username>. You can omit --user <username> to use current user.

  2. Create a database holo and user holocleanuser

CREATE DATABASE holo;
CREATE USER holocleanuser;
ALTER USER holocleanuser WITH PASSWORD 'abcd1234';
GRANT ALL PRIVILEGES ON DATABASE holo TO holocleanuser;
\c holo
ALTER SCHEMA public OWNER TO holocleanuser;

You can connect to the holo database from the PostgreSQL psql console by running psql -U holocleanuser -W holo.

HoloClean currently populates the database holo with auxiliary and meta tables. To clear the database simply connect as a root user or as holocleanuser and run

DROP DATABASE holo;
CREATE DATABASE holo;

Option 2: Using Docker

If you are familiar with docker, an easy way to start using HoloClean is to start a PostgreSQL docker container.

To start a PostgreSQL docker container, run the following command:

docker run --name pghc \
    -e POSTGRES_DB=holo -e POSTGRES_USER=holocleanuser -e POSTGRES_PASSWORD=abcd1234 \
    -p 5432:5432 \
    -d postgres:11

which starts a backend server and creates a database with the required permissions.

You can then use docker start pghc and docker stop pghc to start/stop the container.

Note the port number which may conflict with existing PostgreSQL servers. Read more about this docker image here.

2. Setting up HoloClean

HoloClean runs on Python 2.7 or 3.6+. We recommend running it from within a virtual environment.

Creating a virtual environment for HoloClean

Option 1: Conda Virtual Environment

First, download Anaconda (not miniconda) from this link. Follow the steps for your OS and framework.

Second, create a conda environment (python 2.7 or 3.6+). For example, to create a Python 3.6 conda environment, run:

$ conda create -n hc36 python=3.6

Upon starting/restarting your terminal session, you will need to activate your conda environment by running

$ conda activate hc36
Option 2: Set up a virtual environment using pip and Virtualenv

If you are familiar with virtualenv, you can use it to create a virtual environment.

For Python 3.6, create a new environment with your preferred virtualenv wrapper, for example:

Either follow instructions here or install via pip.

$ pip install virtualenv

Then, create a virtualenv environment by creating a new directory for a Python 3.6 virtualenv environment

$ mkdir -p hc36
$ virtualenv --python=python3.6 hc36

where python3.6 is a valid reference to a Python 3.6 executable.

Activate the environment

$ source hc36/bin/activate

Install the required python packages

Note: make sure that the environment is activated throughout the installation process. When you are done, deactivate it using conda deactivate, source deactivate, or deactivate depending on your version.

In the project root directory, run the following to install the required packages. Note that this commands installs the packages within the activated virtual environment.

$ pip install -r requirements.txt

Note for macOS Users: you may need to install XCode developer tools using xcode-select --install.

Running HoloClean

See the code in examples/holoclean_repair_example.py for a documented usage of HoloClean.

In order to run the example script, run the following:

$ cd examples
$ ./start_example.sh

Notice that the script sets up the Python path environment to run HoloClean.

holoclean's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

holoclean's Issues

Datasets and Constraint files

Hello,

Would it be possible to obtain the Physicians and Foods datasets mentioned in the paper, as well as the denial constraints used for these datasets and the flight dataset.

Thank you,
Niki

Training and Inferring in different datasets

Hi,

As part of the testing I have been doing with HoloClean I was trying to figure out the next use case:

Train on a subset of the dataset and then infer/correct wrong cells on the complete dataset.
Loading the new dataset between fit_repair_model() and infer_repairs() would not work as intended because the compilation step requires the initial observed cells.

Are these assumptions correct?
Thanks a lot

unable to process large data set of Food inspections.

Hi,
I am facing few issues in executing holoclean over Chicago Food inspections dataset as described in paper.
However, it turned out to be a machine with 32GB RAM, 100 GB SSD - is unable to process, noticing memory leaks.
Also, query results

  1. large data in pandas dataframe

14:12:20 - [ERROR] - generating aux_table pos_values Traceback (most recent call last): File "/home/ubuntu/hc36/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context cursor, statement, parameters, context File "/home/ubuntu/hc36/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 536, in do_execute cursor.execute(statement, parameters) psycopg2.DatabaseError: out of memory for query result

  1. Since above data frame was not used in subsequent operations, i moved forwarded by simply bypassing it for this code block, however after 4 hours of execution , i am now facing an issue where log just shows killed.
    21:09:36 - [DEBUG] - Time to execute query: 0.00 secs 21:09:36 - [DEBUG] - featurizing training data... 21:09:43 - [DEBUG] - Time to execute query: 4.46 secs Killed
    It seems tensor.cat is blowing memory here...
    As shown below: running a sample for hospital and performing memory analysis, with little data. It turned out , the tensors footprint is 228 MB which grows further when combining all tensors.
    to 1.4GB.
    This is not scaling well, if we need to process of larger dataset of 2-4GB in a day.. Any suggestions ?
    image

  2. Also, we noticed archived version was compatible with spark, is there any specific reason to move away from spark. Since we are planning to use spark to manage huge datasets with holoclean.
    Any suggestions ?

Correlation for categorical columns

Hi

It was trying holoclean on some geographical dataset, which has columns like state-code, state name and zip-code. I wrote rule "if two records have same state-code, then state name should be same".

But it was not at all correcting any records. Then i saw the correlation between state-code and state name, which was less than the default threshold used for holoclean (0.1), thus no repair was made.

To find correlation between categorical columns, holoclean does label encoding first for these columns and then checks for correlation, but I believe label encoding is not a good measure to calculate correlation between two categorical columns. We can say that state-code and state-name are correlated columns, but label encoding is not efficient to tell that.

I tried using statistical test like chi-square, and it gave me a good correlation between the state code and state name, thus errors were corrected.

I have two questions to ask:

  1. Was there a specific reason of using label encoding to find correlation between two categorical columns ?
  2. Since repairs are done only on columns which are mentioned in rules, then how holoclean is better than a rule based approach. We can simply write code to correct errors for two columns, which says if two state code are same, then state names should be same (considering concept of minimality).

pos_values df in memory

Hi,

pos_values are used in get_infer_dataframes() method in repair.py and it takes values using SQL query and not from DF. Since the size of pos_values is large, we can get rid of memory used by DF as DF is not at all used but kept in memory.

I can add patch and send pull request, if is fine with you.

Table cell_distr is either empty or does not exist

While running the example, I keep running into two alternating errors:
After holoclean is done training the repair model, I run into the error:

[ERROR] - generating aux_table cell_distr
Traceback (most recent call last):
  File "/home/amit/Documents/holoclean_project/randomDB_sim.py", line 55, in <module>
    hc.repair_errors(featurizers)
  File "holoclean_master/holoclean.py", line 327, in repair_errors
    status, infer_time = self.repair_engine.infer_repairs()
  File "holoclean_master/repair/repair.py", line 49, in infer_repairs
    self.ds.generate_aux_table(AuxTables.cell_distr, distr_df, store=True, index_attrs=['_vid_'])
  File "holoclean_master/dataset/dataset.py", line 150, in generate_aux_table
    self.aux_table[aux_table].store_to_db(self.engine.engine)
  File "holoclean_master/dataset/table.py", line 75, in store_to_db
    self.df.to_sql(self.name, db_conn, if_exists=if_exists, index=index, index_label=index_label)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/pandas/core/generic.py", line 2663, in to_sql
    method=method,
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/pandas/io/sql.py", line 521, in to_sql
    method=method,
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/pandas/io/sql.py", line 1316, in to_sql
    table.create()
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/pandas/io/sql.py", line 648, in create
    self.pd_sql.drop_table(self.name, self.schema)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/pandas/io/sql.py", line 1363, in drop_table
    self.meta.reflect(only=[table_name], schema=schema)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 4159, in reflect
    Table(name, self, **reflect_opts)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 469, in __new__
    metadata._remove_table(name, schema)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 67, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 277, in reraise
    raise value
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 464, in __new__
    table._init(name, metadata, *args, **kw)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 552, in _init
    metadata, autoload_with, include_columns, _extend_on=_extend_on
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 574, in _autoload
    _extend_on=_extend_on,
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1596, in run_callable
    return callable_(self, *args, **kwargs)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 415, in reflecttable
    table, include_columns, exclude_columns, **opts
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/sqlalchemy/engine/reflection.py", line 639, in reflecttable
    raise exc.NoSuchTableError(table.name)
sqlalchemy.exc.NoSuchTableError: cell_distr

One solution I tried is to drop holo; create holo; Directly after training the repair model, this leads to:

[ERROR] - generating aux_table cell_distr
Traceback (most recent call last):
  File "/home/amit/Documents/holoclean_project/randomDB_sim.py", line 55, in <module>
    hc.repair_errors(featurizers)
  File "holoclean_master/holoclean.py", line 327, in repair_errors
    status, infer_time = self.repair_engine.infer_repairs()
  File "holoclean_master/repair/repair.py", line 49, in infer_repairs
    self.ds.generate_aux_table(AuxTables.cell_distr, distr_df, store=True, index_attrs=['_vid_'])
  File "holoclean_master/dataset/dataset.py", line 152, in generate_aux_table
    self.aux_table[aux_table].create_df_index(index_attrs)
  File "holoclean_master/dataset/table.py", line 90, in create_df_index
    self.df.set_index(attr_list, inplace=True)
  File "/home/amit/anaconda3/envs/hc36/lib/python3.6/site-packages/pandas/core/frame.py", line 4303, in set_index
    raise KeyError(f"None of {missing} are in the columns")
KeyError: "None of ['_vid_'] are in the columns"

placing a print statement inside create_df_index it is clear that self.df is empty, so it is difficult to say exactly what is happening.

PostgreSQL < 9.4 not supported

Running holoclean with a postgreSQL version < 9.4 generates the following error:

ERROR setting up domain: For table: pos_values (psycopg2.ProgrammingError) syntax error at or near "WITH ORDINALITY"
LINE 1: ...y(regexp_replace(domain,'[{""}]','','gi'),'|||')) WITH ORDIN...
^

This occurs in domain.py, function store_domains (ll. 57-66). The table pos_values is not created during the store_domains call which breaks the code.

Cause of the problem is that the "with ordinality" was introduced to postgres first in version 9.4; I can only work with 9.2 at the moment, though.

My quick workaround uses the following query instead of the one in l. 64 (domain.py) in order to create pos_values (based on this stackoverflow answer:

query = "select *, row_number() over (partition by vid, cid, tid) as val_id from (select vid, cid, tid, attribute, rv_val from (select *, unnest(string_to_array(r
egexp_replace(domain,'[{""}]','','gi'),'|||')) as rv_val from %s as placeholder_1) as placeholder_2) as placeholder_3" % AuxTables.cell_domain.name

the SQL part in better readable formatting:

"select *, row_number()
over (partition by vid, cid, tid)
as val_id
from (select vid, cid, tid, attribute, rv_val
from (select *, unnest(string_to_array(regexp_replace(domain,'[{""}]','','gi'),'|||'))
as rv_val
from %s)
as placeholder_2)
as placeholder_3"

Compatibility

It's not compatible with windows, will you do a compatibility?

Re-factor use of exception handlers

There are two unfortunate use-patterns of exception handlers in the codebase atm:

1. silencing exceptions with status messages

Example and fix.

We catch an exception then only return the error message in the status variable. This is bad since it silences programming errors like OutOfRangeErrors that should be propagated. Note the fix consists of two components:

  1. Using logging.error to add additional context information
  2. Re-raise (not raise another exception) via raise with no arguments. This keeps the stack trace intact for debugging/logs.

2. catching exceptions and raising a wrapped exception

Example and fix.

Similar to 1. above, we should be using logging.error and raise with no arguments to add context and re-raise the exception.

Factor graph models

Hi,

This code base contains a linear model for the repair task; while the paper mentions Factor graph model. Does the linear model achieve the same performance as the factor model? If not, do you plan to release the code for the factor model as well?

Thanks,
Suresh

Example that does not use postgres

Hi,

Would it be possible for someone to write a simple example that uses HoloClean with a basic csv, parquet file, or sqlite database instead of postgres? A HoloClean example without the postgres dependency, and even better than initially just uses pytorch-cpu, would make HoloClean feel accessible to a much larger user group.

How to incorporate external knowledge

Hi, this might be a stupid question, but during a project that I am working on, I used Holoclean for some data repair task. I found that, however, there's no documentation on how external knowledge (like the example in the paper) can be incorporated. Could you advise please?

Boolean value TypeError in start_example.sh script

After setting up a docker container and conda virtual environment (3.6) with the packages as versioned in requirements.txt, I've been attempting to run the start_example.sh script, though am running into the attached error when the script tries to execute this query: SELECT t1._tid_ FROM "hospital" as t1 WHERE t1."Sex"='female' AND t1."Relationship"='husband' .

Changing versions did also not resolve the issue but executing the query directly against the postgres backend (in the docker container) did not result in an error.

Any insights on the issue or how to resolve?

image

doubt about featurizer tensor size

Hi,
According to the code, the tensor created is tensor[0][init_idx][attr_idx] = 1.0.
Can you please elaborate on why its not tensor[0][attr_idx][init_idx] = 1.0(axis 1 and 2 swapped).

file : repair/featurize/initattfeat.py@12
From what I understand,
if each row in the tensor should be for an attribute, it should be the axis=1 not 2.
The axis =2 should be for valid value for the attribute the row corresponds to.

Hope that makes sense.
If my thinking is wrong, could you please help me out here?
`

Repairs are no longer being found as the size of a dataset is increased

Hi,

I am trying to clean a few dirty rows with respect to 2 denial constraints:
t1&EQ(t1.Sex,"Female")&EQ(t1.Relationship,"Husband")
t1&EQ(t1.Sex,"Male")&EQ(t1.Relationship,"Wife")

When I use a small sample of my data (20 rows), I get the following correct repairs:
#(11, u'relationship', 'Husband', u'Husband')
#(11, u'sex', 'Female', u'Male')
#(4, u'relationship', 'Husband', u'Husband')
#(4, u'sex', 'Female', u'Male')

When I add some more rows, for example 1100, the values no longer change:
#(4, u'relationship', 'Husband', u'Husband')
#(4, u'sex', 'Female', u'Female')
#(11, u'relationship', 'Husband', u'Husband')
#(11, u'sex', 'Female', u'Female')
#(652, u'relationship', 'Husband', u'Husband')
#(652, u'sex', 'Female', u'Female')
#(689, u'relationship', 'Wife', u'Wife')
#(689, u'sex', 'Male', u'Male')
#(504, u'relationship', 'Husband', u'Husband')
#(504, u'sex', 'Female', u'Female')
#(26, u'relationship', 'Husband', u'Husband')
#(26, u'sex', 'Female', u'Female')
#(1084, u'relationship', 'Husband', u'Husband')
#(1084, u'sex', 'Female', u'Female')
#(703, u'relationship', 'Wife', u'Wife')
#(703, u'sex', 'Male', u'Male')

The same happens when I run on the full dataset, 48842 rows. The data is the same; e.g., the version with 20 rows just contains the first 20 rows from the full set.

My script is based on the example in tests/, and I use default settings (I tried some tweaking but this did not solve the issue)
Noisy cells are detected correctly and the generated possible domains contain multiple values

I do not understand why this happens, which problem or restriction prevents the repairs from happening on the larger versions of the dataset ...

Code and data can be found in my fork: https://github.com/j-r77/holoclean

Confused active attributes returned if not running detect_errors before generate domain

If hc.setup_domain() runs without hc.detect_errors(detectors) in advance, the active_attributes as well as the dk cells for domain generation will be read from the postgres database (AuxTables.dk_cells.name) directly, which may cause the error of no field of name XXX.

We should explicitly show that when using holoclean for a dataset, we must
either run the detect errors or load the dk cells from a specific relation

Investigate missing values from self.single_stats

It was discovered in #23 (comment) that some values were missing from self.single_stats even though the values are derived in the same attribute column.

An example of a missing value

KeyError: u'surgeryxpatientsxneedingxhairxremovedxfromxthexsurgicalxareaxbeforexsurgery& xwhoxhadxhairxremovedxusingxaxsaferxmethodx(electricxclippersxorxhairxremovalxcreamx\xef\xbf\xbdcxnotxaxrazor)'

This is most likely due to an encoding issue when storing data into Postgres and subsequently retrieving it without decoding it properly.

translation of simple DCs with a constant to SQL queries not working

Hi,

when I implement simple DCs that compare the given attribute to a constant, HoloClean does not generate a syntactically correct SQL query from that DC. I reproduced this behaviour on the iris dataset as I can't share my data.

The data

The iris dataset and one single - semantically nonsensical, but syntactically ok - denial constraint that looks like the following:

t1&EQ(t1.sepal_length,"6.9")

The error

Error detection according to the DC works fine (the constraint detects 4 cells which is expected from the data).

However, during the call of repair_errors, the following error occurs:

ProgrammingError: syntax error at or near "AND"
LINE 1: ...= t2._tid_ AND t2.attribute = 'sepal_length' AND  AND t2.rv_...

Which fails to be executed due to the "AND AND". Some digging makes me think that the error occurs during setting up the featurizers, more specifically when the tensor for the dcfeaturizer is created (l.17 in repair/featurize/featurize.py ).

Any ideas on why that might happen; perhaps a faulty template for the SQL queries?

branch: best branch for testing purposes

Hi,

I just wanted to ask which branch should I use for testing HoloClean on some datasets.
Currently, I am using master however, I see that hcq-embedding-3 is several commits ahead of master.

Which branch would you suggest me to use?

Is there any example code for flight dataset?

I have read the paper of holoclean, and I want to run an example in the flight dataset as my baseline. I found that the flight dataset is something different from the hospital dataset. Besides, I lack the constrain rule of that experiment.

PyTorch multiprocessing pool causes OSError Too Many Files Open on Linux

Currently, using torch.multiprocessing in InitAttFeaturizer fails on linux and not on macOS.
The reported error is OSError Too Many Files Open. Similar issues have been reported on PyTorch github repo:

Documentation for sharing strategies:

From the documentation:

file_descriptor is the default strategy (except for macOS and OS X where it’s not supported).

Error Creating Index due to Bad Chars

Error:

LINE 1: ...ATE INDEX census_0126 ON census_01 (migration_code-change_in...
                                                             ^
 [SQL: 'CREATE INDEX census_0126 ON census_01 (migration_code-change_in_msa)'] (Background on this error at: http://sqlalche.me/e/f405)
For table: census_01 ERROR while creating index for table census_01 on attributes ['migration_code-change_in_msa']

===================

tmp solution added to my local holoclean/dataset/table:36

bad_chars = re.compile('[^a-zA-Z0-9]+')
renamed_cols = [bad_chars.sub(' ', col).strip().replace(' ', '_') for col in columns]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.