Code Monkey home page Code Monkey logo

reef's Introduction

Reef: Overcoming the Barrier to Labeling Training Data

Code for VLDB 2019 paper Snuba: Automating Weak Supervision to Label Training Data

Reef is an automated system for labeling training data based on a small labeled dataset. Reef utilizes ideas from program synthesis to automatically generate a set of interpretable heuristics that are then used to label unlabeled training data efficiently.

Installation

Reef uses Python 2. The Python package requirements are in the file requirements.txt. If you have Snorkel, can set a flag here as True but there is a simple version of learning heuristic accuracies in this repo as well.

Reef Workflow Overview

The inputs to Reef are the following:

  • A labeled dataset, which contains a numerical feature matrix and a vector of ground truth labels (currently only supports binary classification)
  • An unlabeled dataset, which contains a numerical feature matrix

The following is the overall workflow Reef follows to label training data automatically. The overall process is encoded in [1] generate_reef_labels.ipynb and the main file program_synthesis/heuristic_generator.py

  1. Using the labeled dataset, Reef generates heuristics like decision trees, or small logistic regression models. The synthesis code is in program_synthesis/synthesizer.py.
    1. A heuristic is generated for each possible combination of c features, where c is the cardinality. For example, with c=1 and 10 features, 10 heuristics will be generated.
    2. For each generated heuristic, a beta parameter is calculated. This represents the minimum confidence level at which the heuristics will assign a label. This is done by maximizing the F1 score on the labeled dataset.
  2. These heuristics are passed to a pruner that selects the best heuristic by maximizing a combination of the F1 score on the labeled dataset and diversity in terms of how many points it labels that previously selected heuristics don’t.
  3. The selected heuristic and previously chosen heuristics are then passed to the verifier which learns accuracies for the heuristics based on the labels the heuristics assign to the unlabeled dataset.
  4. Finally, Reef calculates the probabilistic labels the heuristics assign to the labeled dataset and pass datapoint with low confidence labels to the synthesizer. We repeat this procedure in an iterative manner.

Tutorial

The tutorial notebooks are based on a text-based plot classification dataset. We go through generating heuristics with Reef and then train a simple LSTM model to see how an end model trained with Reef labels compares to an end model trained with ground truth training labels.

reef's People

Contributors

paroma avatar vincentschen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reef's Issues

ModuleNotFoundError: No module named 'label_aggregator'

I am trying to run the notebook 'generate_reef_labels.ipynb'.
This is the complete stack trace of the error:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-cf7b385f41e4> in <module>()
----> 1 from program_synthesis.heuristic_generator import HeuristicGenerator
      2 
      3 hg = HeuristicGenerator(train_primitive_matrix, val_primitive_matrix, val_ground, train_ground, b=0.5)
      4 hg.run_synthesizer(max_cardinality=1, idx=None, keep=3, model='dt')

~/work/reef/program_synthesis/heuristic_generator.py in <module>()
      4 
      5 from program_synthesis.synthesizer import Synthesizer
----> 6 from program_synthesis.verifier import Verifier
      7 
      8 class HeuristicGenerator(object):

~/work/reef/program_synthesis/verifier.py in <module>()
      1 import numpy as np
      2 from scipy import sparse
----> 3 from label_aggregator import LabelAggregator
      4 
      5 def odds_to_prob(l):

ModuleNotFoundError: No module named 'label_aggregator'

BaseModel __init__

Change train_ground to not be an option. It will fail if None is passed in.

Also change the ordering of train_ground and val_ground. The ordering is traditionally the other way.

def __init__(self, train_primitive_matrix, val_primitive_matrix,

Dataset for Twitter and CDR

Hi @paroma ,
Your paper seems interesting !

In the github repository, you have only provided IMDB dataset as a refernce. Could you provide Twitter and CDR data (with relevant pre-processing scripts) to replicate results from the paper ?

Thanks

which keras version do you use?

(venv) mldl@mldlUB1604:/ub16_prj/reef$ python 2-train_lstm_model.py
/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
DeprecationWarning)
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
Traceback (most recent call last):
File "2-train_lstm_model.py", line 54, in
y_pred = lstm_simple(train_text, train_reef, val_text, val_ground, bs=bs, n=n)
File "/home/mldl/ub16_prj/reef/lstm/imdb_lstm.py", line 32, in lstm_simple
model.add(LSTM(100))
File "/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/keras/engine/sequential.py", line 187, in add
output_tensor = layer(self.outputs[0])
File "/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/keras/layers/recurrent.py", line 500, in call
return super(RNN, self).call(inputs, **kwargs)
File "/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/keras/engine/base_layer.py", line 460, in call
output = self.call(inputs, **kwargs)
File "/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/keras/layers/recurrent.py", line 2112, in call
initial_state=initial_state)
File "/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/keras/layers/recurrent.py", line 609, in call
input_length=timesteps)
File "/home/mldl/ub16_prj/VENV_host/py2tf1.0/venv/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2957, in rnn
maximum_iterations=input_length)
TypeError: while_loop() got an unexpected keyword argument 'maximum_iterations'
(venv) mldl@mldlUB1604:
/ub16_prj/reef$

Multilabel classifier for images

Hey

I am currently working on a multilabel classifier for images and am curious if we can construct a multilabel classifier using reef. I have a dataset with more than 6 labels.

Looking forward to your response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.