Code Monkey home page Code Monkey logo

unnatural_language_processing_replication's Introduction

Midterm Replication Project: Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

'Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data' by Alana Marzoev, Samuel Madden, M. Frans Kaashoek, Michael Cafarella, Jacob Andreas

Introduction

The objective of the paper is to generate and train a model on synthetic text data to perform some natural language processing task. This paper builds off of previous work ('Building a Semantic Parser Overnight') that uses a program and synthetic utterance grammar to generate corresponding canonical synthetic utterances for each program. The contribution of this paper is the development of a projection model to map natural language utterances to sentences in the synthetic text domain. The synthetic sentence can then be used as input to the trained semantic parser.

The combination of the (1) synthetic data generation, (2) trained synthetic data task model, and (3) projection model compose a framework for developing NLP systems without large annotated datasets.

At a high level, the pipeline looks as follows: Pipeline

In this replication, we focus on this synthetic model and projection framework in the context of Semantic Parsing with the Overnight dataset.

Method

To replicate this part of this paper, I broke the process down into three primary components: (1) collect synthetic data, (2) train semantic parser on synthetic utterances and programs, (3) write projection model from NL utterances to synthetic utterances in the grammar.

(1) Collect synthetic data

Data is collected from the Calendar domain of the Overnight dataset.

Download data files from the SEMPRE CodaLab: worksheet Files are collected in the overnight_data folder.

(The SEMPRE full semantic parsing pipeline, which we do not use but is the resulting model from the original Overnight dataset paper, can also be run in an interactive shell by following the instructions

Once the data files for grammar and examples were downloaded, I wrote a dataloader for the original synthetic canonical utterance, natural language paraphrase, and Calendar domain program. Based on whether I want to train on only synthetic->program, paraphrase->program, synthetic/paraphrase->program, I load the corresponding input utterance and output programs accordingly.

(2) Train semantic parser

With the dataloader written, I follow suit of the paper and train a LSTM encoder-decoder model with an embedding size of 256 and a hidden size of 1024. This model is the seq2seq semantic parser from utterance to program in the Calendar domain.

Graphs from the training procedure on training the models on synthetic data, NL paraphrases, and both are shown below: Training Performance Test Performance

(3) Write projection model

In the paper, for all datasets in the semantic parsing Overnight dataset, the authors use the "flat" projection model , which is to select the "closest" synthetic utterance to an input natural language utterance by using sentence embeddings (embed) and searching over all synthetic utterances in a set of utterances to find the one with the closest cosine similarity : Projection

Therefore, we do the same, using SentenceTransformers's 'all-MiniLM-L12-v2' model and sklearn's cosine_similarity metric. The paper is unclear on how they obtain the set of synthetic utterances , so I tried a couple techniques, all documented below:

  • save all synthetic utterances from training into
  • save all synthetic utterances from training and test into
  • save all synthetic utterances from training and programmatically augment each example with argument flips of matching type, according to the grammar files

Results

Model Train data Program Accuracy (w test) (w augment) (w test & augment) Synth Utterance Accuracy (w test) (w augment) (w test & augment) Paper Accuracy
adagrad-8eps synthetic 0.0 0.0 0.0 0.0 0.4166 0.4226 0.3988 0.3988 0.32
adam-20eps synthetic 0.4167 0.4226 -- -- 0.4167 0.4226 -- -- 0.32
adam-8eps synthetic 0.4167 0.4226 0.3988 0.3988 0.4167 0.4226 0.3988 0.3988 0.32
adam-15-real real 0.0119 0.0119 0.0119 0.0119 0.4167 0.4226 0.3988 0.3988 --
adam-15-both both 0.0 0.0 0.0 0.0 0.4167 0.4226 0.3988 0.3988 --
adam-15-real real (no projection) 0.5298 -- -- -- -- -- -- -- 0.27
adam-15-both both (no projection) 0.0179 -- -- -- -- -- -- -- 0.13
adam-30-both both (no projection) 0.0119 -- -- -- -- -- -- -- 0.13

My results follow the same general model performance trend: high performance with the synthetic data and projection model ("synthetic + projection") and the real data with no projection model ("real"), and low performance with both real and synthetic data with no projection model ("both"). However, the results did vary somewhat significantly. I discuss some theories and experiments around the variance below:

  • Performance of my "synthetic + projection" model outperformed that of the paper. I believe this may have to do with the smaller set of synthetic utterances I was projecting onto, as my seq2seq model was fit to the set of synthetic utterances I mapped onto, and my program accuracy matched my synthetic utterance match accuracy. To test this, I artificially "augmented" the set of synthetic utterances: for each synthetic utterance, create additional synthetic utterances by swapping out arguments for other arguments of the same semantic type from the grammar. I noticed that performance dropped with augmentation, which suggest support for my theory about performance increase due to overfitting and bias with a smaller set of synthetic utterances.
  • Performance of my "real" model outperformed their "real" model by almost double the accuracy. I assume this is either a bug in my code or there was some inconsistency in model, number of training epochs, data set, or other.
  • Performance of my "both" model underperformed their "both" model by almost 10x less. I assume this is either a bug in my code or there was some inconsistency in model, number of training epochs, data set, or other.

How to replicate

Setup

conda create --name synthetic python=3.8
conda activate synthetic

conda install pytorch
conda install -c conda-forge wandb 
pip install -U sentence-transformers

Run

First, we need to train a semantic parser from the synthetic utterances to programs in the Overnight grammar. Customize the optimizer, number of epochs, batch size, and model filename as desired:

python seq2seq.py \
        --optimizer adam \
        --epochs 8 \
        --batch_size 8 \
        --save_model --train_model \
        --model_filename 'seq2seq_adam_8'

Once the (synthetic) semantic parser has been trained, we pre-pend a projection model. The projection model projects natural language inputs to utterances in the synthetic grammar, which can be fed as input to the trained semantic parser. The following command loads the trained semantic parsing model, runs a projection model with the semantic parsing model on all natural language paraphrases from the test dataset, and calculates the exact-match accuracy of the results.

python baseline.py \
        --batch_size 8 \
        --model_filename 'seq2seq_adam_8' \
        --output_file 'output/seq2seq_adam_8_projection.txt'

An interactive version of the interface can be run with the following command:

python baseline_demo.py \
        --model_filename 'seq2seq_adam_8'
        --output_file 'output/interactive_projection.txt'

(Unused, but interesting) Run Stanford Semantic Parsing Model (Trained on Overnight dataset)

Download Apache Ant ANT_HOME=/Users/celinelee/Downloads/apache-ant-1.9.16 or wherever you uncompressed the download to. PATH=${ANT_HOME}/bin:$PATH

Install Java JDK.

git clone https://github.com/percyliang/sempre
cd sempre
ruby ./pull-dependencies core
ant core
ruby ./run @mode=simple

Useful: https://github.com/percyliang/sempre/blob/master/TUTORIAL.md

unnatural_language_processing_replication's People

Contributors

celine-lee avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.