Code Monkey home page Code Monkey logo

idiomify's Introduction

Exploring the Efficacy of Idiomify: How Effective is GPT-3 for Teaching Idioms to EFL Writers?

m-1-3 v3.0.2

some examples of idiomify with GPT-3

Research Questions

We aimed to answer the overarching question: How effective is Idiomify in helping EFL writers learn idioms? To answer this, we explored the efficacy of Idiomify in terms of five factors:

  1. Appropriateness. How appropriate are Idiomify’s suggestions?
  2. Learning idioms. Do learners learn the meaning of idioms as they revise their drafts with Idiomify?
  3. Learners’ perceptions. What about Idiomify do learners find helpful or unhelpful?
  4. Transparency of idioms. Do the answers to 1 and 2 differ in how transparent learners find the idioms that Idiomify suggests?

idiomify's People

Contributors

eubinecto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

idiomify's Issues

Implementing `m-1-1` - the first baseline

최종목표

해당 논문의 literal -> idiomatic style transfer를 구현하는 것이 목표!
image

Why?

In order to see the advantage we could get from integrating the notion of transparency,
we first need a baseline that does not care about the notion of transparency of idioms.
And since this is about simple experimentations, we only care about at most nine idioms.

To-do's

  • build a mini dataset: idioms
  • build a mini dataset: literal2idiom
  • BART - SRCBuilder
  • BART - TGTBuilder
  • 데이터 모듈 수정 & 테스트
  • alpha 베이스라인 구현하기
  • ainize에서 모델 학습 돌리기
  • Idiomifier 구현하기
  • main_infer 완성하고, infer pipeline 구축 완료하기
  • streamlit에 issue_1 배포하기

`d-1-4` : Preprocess PIE dataset to build NER labels for Idiomify task

Why?

We need to transform the dataset into NER-compatible format.

How?

We could get use of the labels that we already have:

like these
image
image

But why do some labels include O in between B and I, like the one above?

someone has asked the same question, and someone came up with an answer that I was looking for I
image

So it is because some words in a phrase may not be part of the entity of the phrase.

and this is the case with idiomifiable sentences
image

It should not be:

  • '(watched', 'B'), ('the', 'I'), ('workers', 'I'), ('at', 'I'), ('his', 'I'), ('office', 'I'), ('carefully', 'I')`

because it is only watched & and carefully that corresponds to the meaning of keep an eye on in isolation.

To-do's

  • we don't need annotate anymore
  • explore_fetch_pie_replace_labels.py
  • change main_upload_literal2idiomatic to preprocess the labels

Choose the idioms to collect data for

What?

For the time being, aim up to 50 examples.

But just how many examples do we need per-class?

Start with five examples for each idiom. So, we collect 10 idioms to start with.

What idioms should we train our model with?

We should keep some conditions

  1. Idiomify should be practical; the idioms it suggest should be the ones that are widely used among the English culture2.
  2. Idiomify should output idioms with a varying degree of transparency. Why? Because one of the research questions asks us to explore how students may behave on encountering opaque or transparent idioms.
Liu (2003) has already complied a set of the most frequently used idioms
image

As we can see from the compiled list, most of the most frequently used idioms are transparent ones. Rarely do you see opaque idioms like kick the bucket from the top 100 idioms.

Liu (2003) points outs this pattern that has been observed in multiple studies in the past
image

So, what's the takeaway?

The condition 1 and the condition 2 are not compatible. You can't be practical and accommodating at the same time.
You should either compromise practicality by embracing a few opaque idioms.

Then.. what should be the ratio?

Let's go for this, as of right now

  • 90% = practical, reasonably transparent idioms
  • 10% = impractical, but interestingly opaque idioms

How?

Just how will you collect examplar contexts? We need expert examples for this. It is a bit risky to rely on GPT-3 by self-generating datasets.

But is that really so? Using GPT3 might be more fitting than expert examples
image

Because by definition, auto-regressive inference predicts the most probable tokens given the past. You can see from above that GPT-3 is capable of generating the most 'exemplar" contexts for come up with. And this is indeed a good example of come up with. They are collocations-rich, in a way.

`entities:d-1-4`: define the entities

Why?

We need to encode each label (B's and I's and O) with a unique integer. For example, what we need is something like this:

B/beat around the bush
I/beat around the bush
B/ballpark figure
I/ballpark figure
...
O

We will need to define this as a list, so that each label gets a unique integer on enumeration.

To-do's

  • update idioms to d-1-4
  • add main_upload_entities.py to upload entities:d-1-4

Why does fine-tuning perform worse on the same data?

Why?

As you can see from below, my first attempt is not as good as I expected it to be
image
image
image

but you know that with few-shot prompt design, the performance is generally great with only a few examples (~10).

Why is this?

: Idiomfier as an NER tagger

How?

First of all, let's try this with the baseline approach - just using a simple linear layer approach. Performance is not what matters as of right now.

Could you use BART for this? Yes you could, but.. BART is an auto-regressive model - cannot refer to the future when processing the past. BERT would be a better choice than BART.

To-do's

  • delete tokenizer- related fetchers and paths
  • change the builders: InputsBuilder
    • we must make sure that the tokens are not split further than it is now..
  • explore_ inputs_builder
  • change the builders: LabelsBuilder
  • explore_labels_builder
  • rewrite Idiomifier to learn NER with BERT

Chronicles

m-1-x models 🔰 (Seq2Seq with BART)

m-1-x versions are primarily meant to be as a demonstration, or piloting of the tools I'll be building. 1 means that the architecture does not change from that of a vanilla BART. This model does not regard idioms as a single entity.

m-1-1

The very first baseline of Idiomify. This model is trained only on the first 146 entries of PIE dataset.

m-1-2

The scaled-up version of the previous model. No significant change has been made, it is just that m-1-2 is now trained on the entire entries of PIE dataset (train=0.8). This is also the first version that is deployed to the web via streamlit & huggingface.

m-1-3

you have to search every single word to see where the change is!
image

This is rather inconvenient. We need some way of telling the user that signifies "here is the part that has been changed".
m-1-3 is a new version for doing that. It is trained on the same dataset as the previous version, but now two special tokens are added before and after idioms: <idiom> & </idiom>

now it looks much better!
image

m-1-4 - just don't split the sentences

Why am I going back to using BART? This may not be absolutely terrible yet.

  • #27
  • experiment - does it perform better than before (m-1-3)?
  • experiment - how does it compare against GPT-3 (v3.0.1)?
  • my guess - GPT-3 would probably perform worse by false-negative.

m-1-5 just don't include the special tokens and treat this as a simple seq2seq problem

Why is only one idiom suggested? could it be because of the special tokens?

  • #26
  • use difflib to highlight what has been changed
  • experiment - does it perform better than before (m-1-4 & m-1-3)?
  • experiment - does it perform better than GPT-3?

m-2-x models 🏷 (NER with BERT)

m-1-x models demonstrated some potentials, but not without problems. There are some problems that stem from the nature of a seq2seq approach to solving Idiomify.

  1. The model occasionally distorts the input sentence. We hope that the model will learn to "copy", and it indeed does, but we can never be entirely sure about this with a seq2seq approach. In an NER approach, you can be entirely sure that the source sentence is preserved because we get to label the sentence rather than transform the sentence.
  2. Normalising variations of idioms into their lemma is like squaring a peg in a round hole. Yeah, you could do something like
    You were <idiom> beat around the bush </idiom> when I first interviewed you last time, where beating around the bush is the correct form the idiom. if recommending with a normalise form is what you want to do at the end of the day, then the task is more of a NER task than a seq2seq task, where each idiom is a named entity.

So, what could be better is an NER system rather than a translation system. Granted, it does not explicitly "idiomify" sentences, but it can recommend what idioms to use for what parts of the sentence. I'm not sure if this will turn out to perform better than seq2seq, but one thing we can gurantee for sure is that NER won't distort the source sentence.

m-2-1

This is the first version of m-2-x models. As for the labels, we just follow the IOB convention.

  • #12
  • #11
  • #15
    • as for the tokenizer, we just use the pre-trained one. We need no additional tokens

v3.0 Idiomify with GPT-3

TL;DR - use GPT-3 rather than BERT.

But why a sudden switch from NER with BERT to seq2seq with GPT-3? This is for the following two reasons:

First, the few-shot performance of GPT-3 is surprisingly better than I thought. Just have a look at the example below.

an example of few-shot Idiomify. The proof is in the pudding!
image

Woah, and that is a result I got with only a handful, carefully curated list of examples, which is perfectly doable within a few hours. Yes, GPT-3 is expensive and I'll never use this if I were in the industry. The RoI of a GPT-3 based application will be stupidly low unless you charge 100 dollars a month to customers. But hey, I'm an academic. as long as I can get it to work on only a dozens of personal statements. It's okay to stop being an NLP engineer for a few months and just embrace the world of prompt engineering, especially if the performance gain is huge enough to justify the price.

But then what would be the point of your research, you may ask. Surely, merely presenting a use case of GPT-3 is by no means research in the field of NLP. That is just another interesting NLP project, because it does not improve any inductive bias on anything. Then what justifies my switch to GPT-3? I am not an NLP researcher, technically. I'm an SLA researcher. That is, the aim of my research should be (and frankly, should have been) about coming up with and justifying better ways of teaching a second language to EFL learners.

And that is the second reason for the switch to GPT-3. The top priority of my research should not be designing better inductive bias. Rather, you should just use the best tools out there to build the feedback system as soon as possible, and focus on asking the right questions and answering them with scientific methods.

So, here are the two reasons, re-iterated:

  1. The Idiomify performance of GPT-3 is unexpectedly better than I thought
  2. Suggesting better inductive bias is not my top priority

And so it begins, the world of prompt engineering.

to-do's

  • #16
  • deploy v3.0 to streamlit cloud

v3.0.1 - prompt design with a password check

The fine-tuning approach does not seem to work very well, for whatever reason that I don't know of. But I must come up with a complete version by this Friday. I should have a back-up plan.

This version is a minor upgrade from v3.0, where I try to keep v3.0 but password check from v3.1 is added. I'm doing this just in case I end up going back to this prompt design for my research.

  • #24
  • tag the version
  • deploy v3.0.1

v3.0.2 - pay-your-own-request version

Rather than allowing access to only those who know the master key, it is better to just open access to the web to anyone but request them to register their API-KEY. Since OpenAI gives away 30 dollars worth of API credit, that should account for enough requests so far as my research participants are concerned.

V3.0.3 - preparing for automating the research

  • a script for generating fake alias for each participant
  • a detailed instructions for signing up to Open AI
  • a script for auto-generating Cloze tests (just recalling the definitions in Korean)
  • deploy!

v3.1 fine-tune Davinci with more quality examples

Approaching this with few-shot learning is not sustainable, as the API's are just too expensive. You must fine-tune a model to build this model successfully.

They say aim for up to 500 examples.

  • update readme.md
  • #20
  • #21
  • #18
  • #22
  • #23
  • experiment (1) does v3.1 Idiomify more than 1 phrases, given a long paragraph?
  • experiment (2) does v3.1 Idiomify give more natural suggestions? (Does it no longer "square a peg in a hole"?)
  • deploy v3.1

some time in the next version

you might want to evaluate your fine-tuned model with an extrinsic measure

  • evaluate the model with PPL (intrinsic measure)

fine-tune Davinci-002

Why?

image

Once you fine-tune a model, you'll be billed only for the tokens you input, not including the prompt. If that is the case, I should just build a dataset and fine-tune my model.

To-do's

  • save openAPI secret locally using streamlit.secrets
  • [ ]

Implementing `m-1-2` - testset, metrics, deploy

What?

Train Seq2Seq on the whole dataset.
Submit a random paragraph, and let Idiomifer idiomify the whole paragraph. See what we get from this.

Why?

We want to have a taste of what Idiomifer as a rewriter could be able to do. After you see the results on a random paragraph,
you may be able to have a sense of what should be the direction of my dissertation.

To-do's

  • 데이터셋 - train / test split 하기 (wisdomify 코드의 도움을 받으면 될 것)
  • datamodule을 변경하기 (이전의 모델을 굳이 지원할 필요는 없다. 그냥 변경하자!)
  • pipeline - 배치 프로세싱
  • paragraph를 처리할 수 있는 상태에서 git tag -a tag012 할 것!

remove the special tokens

Why?

Maybe you don't need <idiom> & </idiom> special tokens at all. The only function it has at the moment is just highlighting the parts that have changed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.