Code Monkey home page Code Monkey logo

awasthiabhijeet / learning-from-rules Goto Github PK

View Code? Open in Web Editor NEW
49.0 4.0 5.0 114.92 MB

Implementation of experiments in paper "Learning from Rules Generalizing Labeled Exemplars" to appear in ICLR2020 (https://openreview.net/forum?id=SkeuexBtDr)

License: Apache License 2.0

Shell 14.28% Python 75.83% Perl 9.89%
weakly-supervised-learning iclr2020 high-level-supervision rule-based-nlp rulebasednlp representation-learning data-augmentation weak-supervision weakly-supervised

learning-from-rules's Introduction

LEARNING FROM RULES GENERALIZING LABELED EXEMPLARS (ICLR 2020)

This repository provides an implementation of experiments in our ICLR2020 paper

@inproceedings{
Awasthi2020Learning,
title={Learning from Rules Generalizing Labeled Exemplars},
author={Abhijeet Awasthi and Sabyasachi Ghosh and Rasna Goyal and Sunita Sarawagi},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=SkeuexBtDr}
}

Requirements

This code has been developed with

  • python 3.6
  • tensorflow 1.12.0
  • numpy 1.17.2
  • snorkel 0.9.1
  • tensorflow_hub 0.7.0

Data Description

We have currently released processed version of 4 datastes used in our paper. Following datasets can be found in data/ directory

data/TREC (or any other data dir) consists following four pickle files

  • d_processed.p (d set: labeled data -- In paper we refer to this is as the "L" dataset)
  • U_processed.p (U set: unlabeled data -- In paper as well this is referred as "U" dataset)
  • validation_processed.p (validation data)
  • test_processed.p (test data)
  • NOTE U_processed.p for YOUTUBE and MITR is unavailable on GitHub due to larger size. You can download entire data dir from this link

Following objects are dumped inside each pickle file

  • x : feature representation of instances
    • shape : [num_instances, num_features]
  • l : Class Labels assigned by rules
    • shape : [num_instances, num_rules]
    • class labels belong to {0, 1, 2, .. num_classes-1}
    • l[i][j] provides the class label provided by jth rule on ith instance
    • if jth rule doesn't cover ith instance, then l[i][j] = num_classes (convention)
    • in snorkel, convention is to keep l[i][j] = -1, if jth rule doesn't cover ith instance
  • m : Rule coverage mask
    • A binary matrix of shape [num_instances, num_rules]
    • m[i][j] = 1 if jth rule cover ith instance
    • m[i][j] = 0 otherwise
  • L : Instance labels
    • shape : [num_instances, 1]
    • L[i] = label of ith instance, if label is available i.e. if instance is from labeled set d
    • Else, L[i] = num_clases if instances comes from the unlabeled set U
    • class labels belong to {0, 1, 2, .. num_classes-1}
  • d : binary matrix of shape [num_instances, 1]
    • d[i]=1 if instance belongs to labeled data (d), d[i]=0 otherwise
    • d[i]=1 for all instances is from d_processed.p
    • d[i]=0 for all instances in other 3 pickles {U,validation,test}_processed.p
  • r : A binary matrix of shape [num_instances, num_rules]
    • r[i][j]=1 if jth rule was associated with ith instance
    • Highly sparse matrix
    • r is a 0 matrix in all the pickles except d_processed.p
    • Note that this is different from rule coverage mask "m"
    • This matrix defines the coupled rule,example pairs.

Usage

From src/hls

  • For reproducing numbers in Table 1, Row 1
    • python3 get_rule_related_statistics.py ../../data/TREC 6 None
    • This also provides Majority Vote accuracy in Table2 Column2 (Question dataset)
  • For training, saving and testing a snorkel model
    • python3 run_snorkel.py ../../data/TREC 6 None
    • (RUN THIS BEFORE EXPERIMENTS WHICH DEPEND ON SNORKEL LABELS) if snorkel model is not already saved in the dataset directory.
    • We have released pre-trained snorkel models in each dataset directory with name "saved_label_model" )
  • For reproducing (approximately) numbers in Table2 Column2 (Question dataset)
    • use train_TREC.sh for training models for different loss functions
    • use test_TREC.sh for testing models for different loss functions
    • best hyperparameters are already set in these scripts
    • both of the above scripts use TREC.sh
  • For reproducing numbers (approximately) for other datasets follow steps same as above, with TREC replaced by the dataset name.

Note:

  • f network refes to the classification network
  • w network refers to the rule network

File Description in src/hls

  • analyze_w_predictions.py - Used for diagnostics (Old Precision Vs Denoised Precision in Figure 3)
  • checkpoint.py - Load/Save checkpoints (Uses code from checkmate)
  • config.py - All configuration options go here
  • data_feeders.py - all kind of data handling for training and testing.
  • data_feeder_utils.py - Load train/test data from processed pickles
  • data_utils.py - Other utilities related to data processing
  • generalized_cross_entropy_utils.py - Implementation of a noise tolerant loss functions
  • get_rule_related_statistics.py - For reproducing numbers in Table 1
  • hls_data_types.py - some basic data types used in data_feeders.py
  • hls_model.py - Creates train ops All the loss functions are defined here
  • hls_test.py - Runs inference using f or w.
    • Inference on f tests the classification network (valid for all the loss functions)
    • Inference on w is used to analyze the denoised rule-precision obtained by w network
    • Inference on w is only meaningful for ImplyLoss and Posterior Reg. method since only these involve a rule (w) network.
  • hls_train.py - Two modes:
    • f_d (simply trains f network on labeled data)
    • f_d_U : used for all other modes which utilize unlabeled data
  • learn2reweight_utils.py - utilities for implementing L2R method
  • main.py - entry point
  • metrics_utils.py - utilities for computing metrics
  • networks.py - implementation of f network (classification network) and w network (rule network)
  • pr_utils.py - utilities for implementing Posterior Reg. method
  • run_snorkel.py - training, saving and testing a snorkel model
  • snorkel_utils.py - utilitiy to convert l in our format to l in snorkel's format
  • test_"DATASET_NAME".sh - model testing (inference) script
    • e.g. test_TREC.sh runs inference for models trained on TREC dataset
  • "train_"DATASET_NAME".sh - model training script
    • e.g. train_TREC.sh trains models on TREC dataset
  • "DATASET_NAME".sh - test_"DATASET_NAME".sh and train_"DATASET_NAME".sh use "DATASET_NAME".sh
  • utils.py - misc. utilities

learning-from-rules's People

Contributors

awasthiabhijeet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

learning-from-rules's Issues

suggestions on search space of gamma and q

Hi,

If we want to do hyper-parameter search for the two primary parameters of implyloss, ie, weighting factor gamma and q for Generalized-XENT loss. Do you have any suggestions on the search space?

Pickle Dump Overwrite Behavior

Edited: I believe I misunderstood some of the code, so I have removed the erroneous description and closed the issue. Thanks!

How to view rules

I want to view what those rules look like, how can I dump them? Thanks.

tabular data/ noisy instances

Hi,
thanks for sharing your implementation. I have two questions about it:

  1. Does it also work on tabular data?
  2. Is it possible to identify the noisy instances (return the noisy IDs or the clean set)?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.