eubinecto / idiomify Goto Github PK

Exploring the Efficacy of Idiomify: How Effective is GPT-3 for Teaching Idioms to EFL Writers?

Python 100.00%

natural-language-processing nlp edutech idioms-learning idioms applied-linguistics second-language-acquisition

idiomify's Introduction

Exploring the Efficacy of Idiomify: How Effective is GPT-3 for Teaching Idioms to EFL Writers?

some examples of idiomify with GPT-3

Research Questions

We aimed to answer the overarching question: How effective is Idiomify in helping EFL writers learn idioms? To answer this, we explored the efficacy of Idiomify in terms of five factors:

Appropriateness. How appropriate are Idiomify’s suggestions?
Learning idioms. Do learners learn the meaning of idioms as they revise their drafts with Idiomify?
Learners’ perceptions. What about Idiomify do learners find helpful or unhelpful?
Transparency of idioms. Do the answers to 1 and 2 differ in how transparent learners find the idioms that Idiomify suggests?

idiomify's People

Contributors

Stargazers

Watchers

idiomify's Issues

build `literal2idiomatic` dataset (at least three exemplar usages for each idiom)

To-do's

search online dictionaries to get sample contexts.
now paraphrase them with their literal meaning.

Implementing `m-1-1` - the first baseline

최종목표

해당 논문의 literal -> idiomatic style transfer를 구현하는 것이 목표!

Why?

In order to see the advantage we could get from integrating the notion of transparency,
we first need a baseline that does not care about the notion of transparency of idioms.
And since this is about simple experimentations, we only care about at most nine idioms.

To-do's

`d-1-4` : Preprocess PIE dataset to build NER labels for Idiomify task

Why?

We need to transform the dataset into NER-compatible format.

How?

We could get use of the labels that we already have:

like these

But why do some labels include O in between B and I, like the one above?

someone has asked the same question, and someone came up with an answer that I was looking for I

So it is because some words in a phrase may not be part of the entity of the phrase.

and this is the case with idiomifiable sentences

It should not be:

'(watched', 'B'), ('the', 'I'), ('workers', 'I'), ('at', 'I'), ('his', 'I'), ('office', 'I'), ('carefully', 'I')`

because it is only watched & and carefully that corresponds to the meaning of keep an eye on in isolation.

To-do's

we don't need annotate anymore
explore_fetch_pie_replace_labels.py
change main_upload_literal2idiomatic to preprocess the labels

`d-1-3` : PIE dataset - annotate the idioms with special tokens and add their definitions to `idioms` artifacts

To-do's

Update: literal2idiomatic:d-1-3; annotate the idioms with special tokens
Update: idioms:d-1-3; add the definitions of idioms as well

`m-1-3`: The same as `m-1-2`, except that it prints out the special tokens before and after the idioms.

Why?

To pin-point exactly where the idioms appear.

To-do's

train the model & upload the weights to wandb
remove author & version from the web, you ego-maniac!

add password check

Just copy what you did in #22

Choose the idioms to collect data for

What?

For the time being, aim up to 50 examples.

But just how many examples do we need per-class?

Start with five examples for each idiom. So, we collect 10 idioms to start with.

What idioms should we train our model with?

We should keep some conditions

Idiomify should be practical; the idioms it suggest should be the ones that are widely used among the English culture2.
Idiomify should output idioms with a varying degree of transparency. Why? Because one of the research questions asks us to explore how students may behave on encountering opaque or transparent idioms.

Liu (2003) has already complied a set of the most frequently used idioms

As we can see from the compiled list, most of the most frequently used idioms are transparent ones. Rarely do you see opaque idioms like kick the bucket from the top 100 idioms.

Liu (2003) points outs this pattern that has been observed in multiple studies in the past

So, what's the takeaway?

The condition 1 and the condition 2 are not compatible. You can't be practical and accommodating at the same time.
You should either compromise practicality by embracing a few opaque idioms.

Then.. what should be the ratio?

Let's go for this, as of right now

90% = practical, reasonably transparent idioms
10% = impractical, but interestingly opaque idioms

How?

Just how will you collect examplar contexts? We need expert examples for this. It is a bit risky to rely on GPT-3 by self-generating datasets.

But is that really so? Using GPT3 might be more fitting than expert examples

Because by definition, auto-regressive inference predicts the most probable tokens given the past. You can see from above that GPT-3 is capable of generating the most 'exemplar" contexts for come up with. And this is indeed a good example of come up with. They are collocations-rich, in a way.

`entities:d-1-4`: define the entities

Why?

We need to encode each label (B's and I's and O) with a unique integer. For example, what we need is something like this:

B/beat around the bush
I/beat around the bush
B/ballpark figure
I/ballpark figure
...
O

We will need to define this as a list, so that each label gets a unique integer on enumeration.

To-do's

update idioms to d-1-4
add main_upload_entities.py to upload entities:d-1-4

Why does fine-tuning perform worse on the same data?

Why?

As you can see from below, my first attempt is not as good as I expected it to be

but you know that with few-shot prompt design, the performance is generally great with only a few examples (~10).

Why is this?

login with OpenAI token

What is a simple way of.. just testing if the API token is valid?

: Idiomfier as an NER tagger

How?

First of all, let's try this with the baseline approach - just using a simple linear layer approach. Performance is not what matters as of right now.

Could you use BART for this? Yes you could, but.. BART is an auto-regressive model - cannot refer to the future when processing the past. BERT would be a better choice than BART.

To-do's

delete tokenizer- related fetchers and paths
change the builders: InputsBuilder
- we must make sure that the tokens are not split further than it is now..
explore_ inputs_builder
change the builders: LabelsBuilder
explore_labels_builder
rewrite Idiomifier to learn NER with BERT

Chronicles

`m-1-x` models 🔰 (Seq2Seq with BART)

m-1-x versions are primarily meant to be as a demonstration, or piloting of the tools I'll be building. 1 means that the architecture does not change from that of a vanilla BART. This model does not regard idioms as a single entity.

`m-1-1`

The very first baseline of Idiomify. This model is trained only on the first 146 entries of PIE dataset.

`m-1-2`

The scaled-up version of the previous model. No significant change has been made, it is just that m-1-2 is now trained on the entire entries of PIE dataset (train=0.8). This is also the first version that is deployed to the web via streamlit & huggingface.

`m-1-3`

you have to search every single word to see where the change is!

This is rather inconvenient. We need some way of telling the user that signifies "here is the part that has been changed".
m-1-3 is a new version for doing that. It is trained on the same dataset as the previous version, but now two special tokens are added before and after idioms: <idiom> & </idiom>

now it looks much better!

`m-1-4` - just don't split the sentences

Why am I going back to using BART? This may not be absolutely terrible yet.

#27
experiment - does it perform better than before (m-1-3)?
experiment - how does it compare against GPT-3 (v3.0.1)?
my guess - GPT-3 would probably perform worse by false-negative.

`m-1-5` just don't include the special tokens and treat this as a simple seq2seq problem

Why is only one idiom suggested? could it be because of the special tokens?

#26
use difflib to highlight what has been changed
experiment - does it perform better than before (m-1-4 & m-1-3)?
experiment - does it perform better than GPT-3?

`m-2-x` models 🏷 (NER with BERT)

m-1-x models demonstrated some potentials, but not without problems. There are some problems that stem from the nature of a seq2seq approach to solving Idiomify.

The model occasionally distorts the input sentence. We hope that the model will learn to "copy", and it indeed does, but we can never be entirely sure about this with a seq2seq approach. In an NER approach, you can be entirely sure that the source sentence is preserved because we get to label the sentence rather than transform the sentence.
Normalising variations of idioms into their lemma is like squaring a peg in a round hole. Yeah, you could do something like
You were <idiom> beat around the bush </idiom> when I first interviewed you last time, where beating around the bush is the correct form the idiom. if recommending with a normalise form is what you want to do at the end of the day, then the task is more of a NER task than a seq2seq task, where each idiom is a named entity.

So, what could be better is an NER system rather than a translation system. Granted, it does not explicitly "idiomify" sentences, but it can recommend what idioms to use for what parts of the sentence. I'm not sure if this will turn out to perform better than seq2seq, but one thing we can gurantee for sure is that NER won't distort the source sentence.

`m-2-1`

This is the first version of m-2-x models. As for the labels, we just follow the IOB convention.

#12
#11
#15
- as for the tokenizer, we just use the pre-trained one. We need no additional tokens

`v3.0` Idiomify with GPT-3

TL;DR - use GPT-3 rather than BERT.

But why a sudden switch from NER with BERT to seq2seq with GPT-3? This is for the following two reasons:

First, the few-shot performance of GPT-3 is surprisingly better than I thought. Just have a look at the example below.

an example of few-shot Idiomify. The proof is in the pudding!

Woah, and that is a result I got with only a handful, carefully curated list of examples, which is perfectly doable within a few hours. Yes, GPT-3 is expensive and I'll never use this if I were in the industry. The RoI of a GPT-3 based application will be stupidly low unless you charge 100 dollars a month to customers. But hey, I'm an academic. as long as I can get it to work on only a dozens of personal statements. It's okay to stop being an NLP engineer for a few months and just embrace the world of prompt engineering, especially if the performance gain is huge enough to justify the price.

But then what would be the point of your research, you may ask. Surely, merely presenting a use case of GPT-3 is by no means research in the field of NLP. That is just another interesting NLP project, because it does not improve any inductive bias on anything. Then what justifies my switch to GPT-3? I am not an NLP researcher, technically. I'm an SLA researcher. That is, the aim of my research should be (and frankly, should have been) about coming up with and justifying better ways of teaching a second language to EFL learners.

And that is the second reason for the switch to GPT-3. The top priority of my research should not be designing better inductive bias. Rather, you should just use the best tools out there to build the feedback system as soon as possible, and focus on asking the right questions and answering them with scientific methods.

So, here are the two reasons, re-iterated:

The Idiomify performance of GPT-3 is unexpectedly better than I thought
Suggesting better inductive bias is not my top priority

And so it begins, the world of prompt engineering.

to-do's

#16
deploy v3.0 to streamlit cloud

`v3.0.1` - prompt design with a password check

The fine-tuning approach does not seem to work very well, for whatever reason that I don't know of. But I must come up with a complete version by this Friday. I should have a back-up plan.

This version is a minor upgrade from v3.0, where I try to keep v3.0 but password check from v3.1 is added. I'm doing this just in case I end up going back to this prompt design for my research.

#24
tag the version
deploy v3.0.1

`v3.0.2` - pay-your-own-request version

Rather than allowing access to only those who know the master key, it is better to just open access to the web to anyone but request them to register their API-KEY. Since OpenAI gives away 30 dollars worth of API credit, that should account for enough requests so far as my research participants are concerned.

#28
deploy!

`V3.0.3` - preparing for automating the research

a script for generating fake alias for each participant
a detailed instructions for signing up to Open AI
a script for auto-generating Cloze tests (just recalling the definitions in Korean)
deploy!

`v3.1` fine-tune Davinci with more quality examples

Approaching this with few-shot learning is not sustainable, as the API's are just too expensive. You must fine-tune a model to build this model successfully.

They say aim for up to 500 examples.

some time in the next version

you might want to evaluate your fine-tuned model with an extrinsic measure

evaluate the model with PPL (intrinsic measure)
- GPT-3 API does output logprobs, so you should be able to compute PPLs.

fine-tune Davinci-002

Why?

Once you fine-tune a model, you'll be billed only for the tokens you input, not including the prompt. If that is the case, I should just build a dataset and fine-tune my model.

To-do's

save openAPI secret locally using streamlit.secrets
[ ]

add a simple password check

How?

Consider using this function

Implement `Idiomifier`class with OpenAI's GPT-3 API

To-do's

how do I programmatically interact with Open AI's GPT-3?
register your API key as an environment variable for the project in Pycharm
explore_openai_completion.py
run_deploy.py

Implementing `m-1-2` - testset, metrics, deploy

What?

Train Seq2Seq on the whole dataset.
Submit a random paragraph, and let Idiomifer idiomify the whole paragraph. See what we get from this.

Why?

We want to have a taste of what Idiomifer as a rewriter could be able to do. After you see the results on a random paragraph,
you may be able to have a sense of what should be the direction of my dissertation.

To-do's

데이터셋 - train / test split 하기 (wisdomify 코드의 도움을 받으면 될 것)
datamodule을 변경하기 (이전의 모델을 굳이 지원할 필요는 없다. 그냥 변경하자!)
pipeline - 배치 프로세싱
paragraph를 처리할 수 있는 상태에서 git tag -a tag012 할 것!

remove the special tokens

Why?

Maybe you don't need <idiom> & </idiom> special tokens at all. The only function it has at the moment is just highlighting the parts that have changed.

`main_infer.py`: don't split sentences

Why?

When you split sentences, you lose contexts. The point of this should be allowing more contexts in a single sequence.

`t-1-1` : Saving a pre-trained `BartTokenizer` with the special tokens (`<idiom>` , `</idiom>`)

Why?

It'd be more convenient to pull a tokenizer with the tokens added already straight from wandb than to add the special tokens every single time we load a tokenizer.

To-do's

main_upload_tokenizer.py
fetch_tokenizer
explore_fetch_tokenizer

eubinecto / idiomify Goto Github PK

idiomify's Introduction

Exploring the Efficacy of Idiomify: How Effective is GPT-3 for Teaching Idioms to EFL Writers?

Research Questions

idiomify's People

Contributors

Stargazers

Watchers

idiomify's Issues

To-do's

최종목표

Why?

To-do's

Why?

How?

To-do's

To-do's

Why?

To-do's

What?

But just how many examples do we need per-class?

What idioms should we train our model with?

How?

Why?

To-do's

Why?

How?

To-do's

m-1-x models 🔰 (Seq2Seq with BART)

m-1-1

m-1-2

m-1-3

m-1-4 - just don't split the sentences

m-1-5 just don't include the special tokens and treat this as a simple seq2seq problem

m-2-x models 🏷 (NER with BERT)

m-2-1

v3.0 Idiomify with GPT-3

v3.0.1 - prompt design with a password check

v3.0.2 - pay-your-own-request version

V3.0.3 - preparing for automating the research

v3.1 fine-tune Davinci with more quality examples

some time in the next version

Why?

To-do's

How?

To-do's

What?

Why?

To-do's

Why?

Why?

Why?

To-do's

Recommend Projects

Recommend Topics

Recommend Org

`m-1-x` models 🔰 (Seq2Seq with BART)

`m-1-1`

`m-1-2`

`m-1-3`

`m-1-4` - just don't split the sentences

`m-1-5` just don't include the special tokens and treat this as a simple seq2seq problem

`m-2-x` models 🏷 (NER with BERT)

`m-2-1`

`v3.0` Idiomify with GPT-3

`v3.0.1` - prompt design with a password check

`v3.0.2` - pay-your-own-request version

`V3.0.3` - preparing for automating the research

`v3.1` fine-tune Davinci with more quality examples