Code Monkey home page Code Monkey logo

metal's Introduction

Snorkel MeTaL

Build Status

v0.5.0

ANNOUNCEMENT

Snorkel MeTaL is being merged into Snorkel v0.9 this summer!

The core Snorkel repo is currently undergoing a major redesign that includes bringing under one roof a number of projects in the Snorkel ecosystem that have previously been posted in separate repositories—Snorkel, Snorkel MeTaL, TANDA, etc. This new version will include:

  • Support for new training data operators: labeling functions (LFs), transformation functions (TFs), and slicing functions (SFs)
  • The matrix-completion-based approach for learning LF accuracies and correlation structure first introduced as part of Snorkel MeTaL
  • Native support for multi-task learning (MTL), transfer learning (TL), and complex model flows (an improved version of the mmtl package in MeTaL)
  • A Snorkel 101 guide that provides a gentle introduction to the technology and API for first-time users
  • A fresh batch of tutorials demonstrating different use cases and integrations
  • A more modular form factor that makes it easier to integrate with other libraries
  • A commitment to stability, with full coverage unit tests, type checking, and doc strings

The new version will be available via pip and conda. The current Snorkel MeTaL repository will remain available, but most development effort will be focused on the primary Snorkel repository, which should also support any workflows currently supported by Snorkel MeTaL.

If you'd like to stay in the loop on the latest news in the Snorkel ecosystem, join the Snorkel Google Group. We'll let you know when the new version is released!

Getting Started

News

3/20 We are excited to have achieved a new state-of-the-art score on the GLUE Benchmark and four of its component tasks using Snorkel MeTaL. Check out the corresponding blog post for an overview of how we did it. The code we used to accomplish this was part of a significant restructuring of multi-task end models in Snorkel MeTaL to make it as easy as possible to perform Massive Multi-Task Learning (MMTL) with supervision at varying levels of granularity and over an arbitrarily large number of tasks. That mmtl package has been released as a part of Snorkel MeTaL v0.5, along with a basic tutorial. Additional tutorials showing more advanced usage (e.g., using a pre-trained BERT network as a shared input module, using multiple label sets, supervising at the token and sentence level simultaneously, etc.) will be released in future minor version updates, though such functionality is already supported.

Stay tuned on other developments in the Snorkel ecosystem at our project landing page: snorkel.stanford.edu.

Motivation

This project builds on Snorkel in an attempt to understand how massively multi-task supervision and learning changes the way people program. Multitask learning (MTL) is an established technique that effectively pools samples by sharing representations across related tasks, leading to better performance with less training data (for a great primer of recent advances, see this survey). However, most existing multi-task systems rely on two or three fixed, hand-labeled training sets. Instead, weak supervision opens the floodgates, allowing users to add arbitrarily many weakly-supervised tasks. We call this setting massively multitask learning, and envision models with tens or hundreds of tasks with supervision of widely varying quality. Our goal with the Snorkel MeTaL project is to understand this new regime, and the programming model it entails.

More concretely, Snorkel MeTaL is a framework for using multi-task weak supervision (MTS), provided by users in the form of labeling functions applied over unlabeled data, to train multi-task models. Snorkel MeTaL can use the output of labeling functions developed and executed in Snorkel, or take in arbitrary label matrices representing weak supervision from multiple sources of unknown quality, and then use this to train auto-compiled MTL networks.

Snorkel MeTaL uses a new matrix approximation approach to learn the accuracies of diverse sources with unknown accuracies, arbitrary dependency structures, and structured multi-task outputs. This makes it significantly more scalable than our previous approaches.

References

Blog Posts

Q&A

If you are looking for help regarding how to use a particular class or method, the best references are (in order):

  • The docstrings for that class
  • The MeTaL Commandments
  • The corresponding unit tests in tests/
  • The Issues page (We tag issues that might be particularly helpful with the "reference question" label)

Sample Usage

This sample is for a single-task problem. For a multi-task example, see tutorials/Multitask.ipynb.

"""
n = # data points
m = # labeling functions
k = cardinality of the classification task

Load for each split:
L: an [n,m] scipy.sparse label matrix of noisy labels
Y: an n-dim numpy.ndarray of target labels
X: an n-dim iterable (e.g., a list) of end model inputs
"""

from metal.label_model import LabelModel, EndModel

# Train a label model and generate training labels
label_model = LabelModel(k)
label_model.train_model(L_train)
Y_train_probs = label_model.predict_proba(L_train)

# Train a discriminative end model with the generated labels
end_model = EndModel([1000,10,2])
end_model.train_model(train_data=(X_train, Y_train_probs), valid_data=(X_dev, Y_dev))

# Evaluate performance
score = end_model.score(data=(X_test, Y_test), metric="accuracy")

Note for Snorkel users: Snorkel MeTaL, even in the single-task case, learns a slightly different label model than Snorkel does (e.g. here we learn class-conditional accuracies for each LF, etc.)---so expect slightly different (hopefully better!) results.

Release Notes

Major changes in v0.5:

  • Introduction of Massive Multi-Task Learning (MMTL) package in metal/mmtl/ with tutorial.
  • Additional logging improvements from v0.4

Major changes in v0.4:

  • Upgrade to pytorch v1.0
  • Improved control over logging/checkpointing/validation
    • More modular code, separate Logger, Checkpointer, LogWriter classes
    • Support for user-defined metrics for validation/checkpointing
    • Logging frequency can now be based on seconds, examples, batches, or epochs
  • Naming convention change: hard (int) labels -> preds, soft (float) labels -> probs

Setup

[1] Install anaconda: Instructions here: https://www.anaconda.com/download/

[2] Clone the repository:

git clone https://github.com/HazyResearch/metal.git
cd metal

[3] Create virtual environment:

conda env create -f environment.yml
source activate metal

[4] Run unit tests:

nosetests

If the tests run successfully, you should see 50+ dots followed by "OK". Check out the tutorials to get familiar with the Snorkel MeTaL codebase!

Or, to use Snorkel Metal in another project, install it with pip:

pip install snorkel-metal

Developer Guidelines

First, read the MeTaL Commandments, which describe the major design principles, terminology, and style guidelines for Snorkel MeTaL.

If you are interested in contributing to Snorkel MeTaL (and we welcome whole-heartedly contributions via pull requests!), follow the setup guidelines above, then run the following additional command:

make dev

This will install a few additional tools that help to ensure that any commits or pull requests you submit conform with our established standards. We use the following packages:

  • isort: import standardization
  • black: automatic code formatting
  • flake8: PEP8 linting

After running make dev to install the necessary tools, you can run make check to see if any changes you've made violate the repo standards and make fix to fix any related to isort/black. Fixes for flake8 violations will need to be made manually.

GPU Usage

MeTaL supports GPU usage, but does not include this in automatically-run tests; to run these tests, first install the requirements in tests/gpu/requirements.txt, then run:

nosetests tests/gpu

metal's People

Contributors

agnusmaximus avatar ajratner avatar bhancock8 avatar chmccreery avatar danich1 avatar dliangsta avatar inimino avatar jason-fries avatar jay2113853 avatar jdunnmon avatar nishithbsk avatar paroma avatar phiradet avatar senwu avatar vincentschen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.