Code Monkey home page Code Monkey logo

metaicl's Introduction

MetaICL: Learning to Learn In Context

This includes an original implementation of "MetaICL: Learning to Learn In Context" by Sewon Min, Mike Lewis, Luke Zettlemoyer and Hannaneh Hajishirzi.

Check out our demo at qa.cs.washington.edu:2021!

This README is mainly for how to reproduce MetaICL and Channel MetaICL in the paper, but also describe how to reproduce our baselines, including Multi-task zero-shot and various raw LM methods. All methods used in the paper are available in this repo (please see the below table).

Updates on 02/25/2022: Our code has been updated in minor changes with (1) better preprocessing for poem_sentiment and superglue-copa datasets, and (2) some utilities added (specify --dataset for a single dataset experiment or for custom dataset).

Updates on 01/10/2022: Our code and checkpoints have been updated with a better preprocessing (using newlines instead of spaces, and removing BOS and EOS), which improves the performance by 1--4%. If you have downloaded checkpoints prior to 01/10/2022, make sure to re-download checkpoints and use the updated code. Stay tuned for the updated paper with more details and updated results. You can find the brief summary of the updated results in the Result section of this README.

For any questions about the paper or the code, please contact the first author (email) or leave issues.

If you find our code or paper useful, please cite the paper:

@inproceedings{ min2022metaicl,
  title={ Meta{ICL}: Learning to Learn In Context },
  author={ Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh },
  booktitle={ NAACL-HLT },
  year={ 2022 }
}

Content

  1. Installation
  2. Quick Start
  3. Data
  4. Training
  5. Inference
  6. Results
  7. Downloading Checkpoints

Installation

These are installation guidelines mainly for running baselines. Requirements for data are provided here. All codes are tested with Python 3.8.

pip install torch==1.9.0
pip install git+https://github.com/huggingface/transformers.git@c37573806ab3526dd805c49cbe2489ad4d68a9d7

To train the model, we use an 8-bit optimizer and mixed precision that significantly save the memory. To use them, please use the following commands (but skip if you will run inference only using released checkpoints):

# For 8-bit optimization: see https://github.com/facebookresearch/bitsandbytes for more details
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda102 # modify based on your CUDA version

# For mixed precision training: see https://github.com/NVIDIA/apex for more details
# make sure your nvcc is working (e.g. `nvcc --version`)
cd .. # move outside of this project directory
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../MetaICL # come back to this project directory

Quick Start

This is an example with a dataset financial_phrasebank.

First, prepare a list of training examples

train_data = [{"input": INPUT_1, "output": OUTPUT_1},
              {"input": INPUT_2, "output": OUTPUT_2},
              ...
              {"input": INPUT_K, "output": OUTPUT_K}]

If you prefer, you can download our training data by running the command python -m utils.download --demo_data then loading the downloaded file as follows.

with open("data/financial_phrasebank/financial_phrasebank_16_100_train.jsonl", "r") as f:
    train_data = []
    for line in f:
        train_data.append(json.loads(line))

Then, you can use our model as follows.

from metaicl.data import MetaICLData
from metaicl.model import MetaICLModel

# Load the model
data = MetaICLData(method="channel", max_length=1024, max_length_per_example=256)
model = MetaICLModel()
model.load("channel-metaicl")
model.cuda()
model.eval()

# Make a prediction for `input1`
input1 = "Both operating profit and net sales for the six-month period increased as compared to the corresponding period in 2007."
data.tensorize(train_data, [input1], options=["positive", "neutral", "negative"])
prediction = model.do_predict(data)[0]
print (prediction) # positive

# Make another prediction for `input2`
input2 = "The deal will have no significant effect on the acquiring company's equity ratio."
data.tensorize(train_data, [input2], options=["positive", "neutral", "negative"])
prediction = model.do_predict(data)[0]
print (prediction) # neutral

Data

As described in the paper, we use a collection of 142 tasks taken from CrossFit and UnifiedQA. We experiment with seven different settings, where there is no overlap in meta-training and target tasks. Download/Preprocessing guidelines are here.

Setting name alias (for command) # meta-train tasks # meta-train examples # target tasks
High Resource → Low Resource hr_to_lr 61 819,200 26
Classification → Classification class_to_class 43 384,022 20
Non-Classification → Classification non_class_to_class 37 368,768 20
QA → QA qa_to_qa 37 486,143 22
Non-QA → QA non_qa_to_qa 33 521,342 22
Non-NLI → NLI non_nli_to_nli 55 463,579 8
Non-Paraphrase Detection → Paraphrase Detection non_paraphrase_to_paraphrase 59 496,106 4

To run experiments for each setting, use "alias (for command)" for commands in the Training section and the Inference section.

All settings above do not use any templates/instructions. If you want to use instruction version as in ablations in the paper, use settings in the following table.

Setting name alias (for command) # instructions / meta-train task # meta-train tasks # meta-train examples # target tasks
High Resource → Low Resource without instructions hr_to_lr_noinst 0 32 492,655 12
High Resource → Low Resource with instructions (1 per task) hr_to_lr_inst 1 32 492,655 12
High Resource → Low Resource with instructions (all) hr_to_lr_inst_all 8.3 32 492,655 12

If you use these data resources, please make sure to cite CrossFit and UnifiedQA.

@inproceedings{ ye2021crossfit,
    title={ {C}ross{F}it: A Few-shot Learning Challenge for Cross-task Generalization in NLP },
    author={ Ye, Qinyuan and Lin, Bill Yuchen and Ren, Xiang },
    booktitle={ EMNLP },
    year={ 2021 }
}
@inproceedings{ khashabi2020unifiedqa,
    title={ {U}nified{QA}: Crossing Format Boundaries With a Single QA System },
    author={ Khashabi, Daniel and Min, Sewon and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh },
    booktitle={ Findings of EMNLP },
    year={ 2020 }
}

If you use the instruction version, please make sure to cite the T0 paper.

@inproceedings{ sanh2022multitask,
    title={ Multitask Prompted Training Enables Zero-Shot Task Generalization },
    author={ Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Teven Le Scao and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal Nayak and Debajyoti Datta and Jonathan Chang and Mike Tian-Jian Jiang and Han Wang and Matteo Manica and Sheng Shen and Zheng Xin Yong and Harshit Pandey and Rachel Bawden and Thomas Wang and Trishala Neeraj and Jos Rozen and Abheesht Sharma and Andrea Santilli and Thibault Fevry and Jason Alan Fries and Ryan Teehan and Stella Biderman and Leo Gao and Tali Bers and Thomas Wolf and Alexander M. Rush },
    booktitle={ ICLR },
    year={ 2022 }
}

How to Download and Preprocess

The code is modified from the original CrossFit repo. First, install requirements:

pip install datasets==1.4.0 wget

Warning: we found that datasets==1.4.0 is not compatible with Transformers version we use for training and inference. Please use a separate environement for data preprocessing and model training/inference.

cd preprocess
# preprocess from crossfit
python _build_gym.py --build --n_proc=40 --do_test
python _build_gym.py --build --n_proc=40 --do_train # skip if you won't run training yourself
# preprocess from unifiedqa
python unifiedqa.py --do_train --do_test # skip `--do_train` if you won't run training yourself

By default, preprocessed data is saved at data/.

Additional flags:

  • train_k: number of examples per task for meta-training tasks (16384 by default)
  • test_k: number of examples per task for target tasks (16 by default)

If you want to use values that are different from default ones, please simply add the flag, e.g., python _build_gym.py --build --n_proc=40 --do_test --test_k 4.

Process instruction version

The instruction version is for settings using instructions. We use instructions from BigScience PromptSource. First, fetch instructions (prompts) from PromptSource by doing the following.

# assuming you are still inside `preprocess` directory
cd ../.. # go outside of your project directory
git clone https://github.com/bigscience-workshop/promptsource.git
cd promptsource
git checkout 4e67a38d9642bde222cb90e36e8a66fd6e4a861a
mv promptsource ../MetaICL/preprocess/ # move promptsource directory under `preprocess` directory
cd ../MetaICL/preprocess # comte back to `preprocess` directory
pip install pandas jinja2 "pyyaml>=5"

Note that this is a workaround that does not use python-pip to install the promptsource packages because it requires to use python<=3.7, while all other codes in this repo use python 3.8. If promptsource starts supporting python 3.8, please install the package following the guidelines in the original repo.

Then, download the data via:

python _build_gym.py --build --n_proc=20 --do_test --inst
python _build_gym.py --build --n_proc=20 --do_train --inst # skip if you won't run training yourself

Training

First, run the command to tensorize the text data and save them.

python train.py \
  --task $task --k 16384 --test_k 16 --seed 100 --use_demonstrations --method channel \
  --do_tensorize --n_gpu 8 --n_process 40
  • --task: name of the setting, like hr_to_lr, class_to_class, non_class_to_class, etc
  • --k: # of examples per meta-training task
  • --test_k: # of examples to be used at inference
  • --seed: data seed for training data
  • --method: direct / channel
  • --n_gpu: the number of gpus you will use for training
  • --n_process: the number of processed for preprocessing

Then, run the following command to train the model.

python -m torch.distributed.launch --nproc_per_node=8 train.py \
  --task $task --k 16384 --test_k 16 --seed 100 --train_seed 1 --use_demonstrations --method channel --n_gpu 8 \
  --batch_size 1 --lr 1e-05 --fp16 --optimization 8bit-adam --out_dir checkpoints/channel-metaicl/$task
  • --fp16: for mixed precision training
  • --optimization 8bit-adam: for 8-bit approximations for Adam optimizer
  • --batch_size: batch size per GPU; we use 1, so that the global batch size is 8
  • --num_training_steps: number of training steps; 30000 by default
  • --log_file: you can optionally specify this to save logs as a text file

Training takes around 4.5 hours

If you want to train Multi-task zero-shot model that is one of our baselines in the paper, you can use similar commands for both tensorizing and training, but without --use_demonstrations and --test_k. Training takes around 3 hours.

Inference

python test.py --task $task --k 16 --split test --seed 100 --test_batch_size 16 \
    --method {channel|direct} --use_demonstrations \
    --out_dir checkpoints/metaicl/$task \
    --global_step 30000

Instead of specifying --global_step, you can specify --checkpoint for path to the checkpoint if you want to use checkpoint stored in somewhere else (for example, if you have downloaded the released checkpoints and want to use them). You must specify one of checkpoint and global_step.

  • --seed: seed for training data you will use at inference
  • --test_batch_size: batch size for inference; you can use 16 with a 32GB GPU
  • --unseen_domain_only: specify if you would like to run inference on unseen domain only
  • --log_file: Similar to in training, specify the path to the file where you want to save logs

If you want to run inference for Multi-task zero-shot baseline, you can use a similar command but without --use_demonstrations and --k. For this baseline, you can use --test_batch_size 64 with a 32GB GPU.

If you want to run raw LM baselines in the paper, you do not need to specify --checkpoint or --global_step. Instead, specify --do_zeroshot, and then:

  • For 0-shot, run the command --method direct
  • For PMI 0-shot, run the command using --is_null, and then run the command using --use_calibration (for both, with --method direct)
  • For Channel 0-shot, run the command using --method channel
  • For In-context/PMI In-context/Channel In-context, do the same as above except always adding --use_demonstrations

You can use the same out_dir for all raw LM baselines if you are using the same GPT2 model, e.g., checkpoints/raw-gpt2-large

Results

Here is the summary of key results. Full results can be found in the paper.

Method hr_to_lr class_to_class non_class_to_class qa_to_qa non_qa_to_qa non_nli_to_nli non_paraphrase_to_paraphrase
0-shot 34.8 34.2 34.2 40.2 40.2 25.5 34.2
PMI 0-shot 35.1 33.8 33.8 40.2 40.2 27.9 39.2
Channel 0-shot 36.5 37.3 37.3 38.7 38.7 33.9 39.5
In-context 38.2/35.3 37.4/33.9 37.4/33.9 40.1/38.7 40.1/38.7 34.0/28.3 33.7/33.1
PMI In-context 39.2/33.7 38.8/30.0 38.8/30.0 40.3/38.8 40.3/38.8 33.0/28.0 38.6/33.4
Channel In-context 43.1/38.5 46.3/40.3 46.3/40.3 40.8/38.1 40.8/38.1 39.9/34.8 45.4/40.9
Multi-task 0-shot 35.6 37.3 36.8 45.7 36.0 40.7 30.6
Channel Multi-task 0-shot 38.8 40.9 42.2 42.1 36.4 36.8 35.1
MetaICL 43.3/41.7 43.4/39.9 38.1/31.8 46.0/44.8 38.5/36.8 49.0/44.8 33.1/33.1
Channel MetaICL 49.1/46.8 50.7/48.0 50.6/48.1 44.9/43.5 41.9/40.5 54.6/51.9 52.2/50.3

Two numbers are computed by taking the average and the worst-case performance per task (macro-F1 for classification tasks and accuracy for others) over five different seeds, and then taking the macro-average over tasks (one number for 0-shot models since they are not dependent to the seed). Bold indicates the best average result.

Updates on 01/10/2022: The results are updated from the previous version with a better preprocessing. Updates are not reflected in the paper yet---stay tuned!

Downloading Checkpoints

You can run the inference script by specifying --checkpoint {model_name}, and the script will automatically download the corresponding checkpoint under the checkpoints/ directory. {model_name} can either be

  • {metaicl|channel-metaicl|multitask-zero|channel-multitask-zero}: corresponding method trained in the hr_to_lr setting
  • {metaicl|channel-metaicl|multitask-zero|channel-multitask-zero}-instruction: corresponding method trained in the hr_to_lr_inst_all setting
  • {metaicl|channel-metaicl|multitask-zero|channel-multitask-zero}/{setting_name}: corresponding method trained in the corresponding setting (for setting_name, see the Table in the data section)

Alternatively, you can download all checkpoints via:

python -m utils.download --checkpoints --setting all --method all

If you want to download one of settings only, specify --setting {setting_name} (using "alias for command" in the setting table above) If you want to download one of methods only, specify --method {method_name} where method_name is one of metaicl, channel-metaicl, multitask-zero, channel-multitask-zero.

Simply reproducing all results in the paper

You can use the following commands (based on a 32GB GPU):

# raw LM zero-shot baselines (0-shot, PMI 0-shot, Channel 0-shot)
bash reproduce.sh {setting_name} {zero|pmi-zero|channel-zero} 100 64

# raw LM in-context baselines (in-context, PMI in-context, Channel in-context)
bash reproduce.sh {setting_name} {ic|pmi-ic|channel-ic} 100,13,21,42,87 16

# Multi-task 0-shot baselines
bash reproduce.sh {setting_name} {multitask-zero|channel-multitask-zero} 100 64

# MetaICL
bash reproduce.sh {setting_name} {metaicl|channel-metaicl} 100,13,21,42,87 16

License

MetaICL is CC-BY-NC 4.0 licensed.

metaicl's People

Contributors

junshern avatar shmsw25 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metaicl's Issues

Max length of context

Thank you for your great work! It's very inspiring.

I would like your help to clarify on one point. The input to the model is x_1, y_1, ..., x_k, y_k, x_k+1 where k = 16 in your experiment settings. And the maximum length of the model is 1024. My question is how 16 examples in some datasets with long text can be fit to 1024?

Consider Xsum (used in the HR -> LR setting), where the average document length is 431 and the average summary length 23. Thus roughly one example has an average length of 450. If the long context is truncated, then the maximum length 1024 only affords less than 3 examples on average, which is far below 16.

With the maximum length limit and the experiment settings, I assume that's the only available solution. I just would like to know what you did in your experiments and if you have any suggestions on how resolve this issue.

Thank you very much in advance!

Data Downloading error

Unable to dowload the demo dataset provided. Can you please share any link to download the demo data.

TypeError: can't pickle _thread.RLock objects

Hi there,

After doing all installation, data downloading, and preprocessing, I ran the first line of training step:
python train.py
--task $task --k 16384 --test_k 16 --seed 100 --use_demonstrations --method channel
--do_tensorize --n_gpu 8 --n_process 40

and it gives me the following error. would you help me on this?

Traceback (most recent call last):
File "train.py", line 140, in
main(logger, args)
File "train.py", line 60, in main
metaicl_data.tensorize_for_training(train_data, keyword=args.task, seed=args.seed)
File "/home/monajati/main/fewshot_vl/MetaICL/metaicl/data.py", line 410, in tensorize_for_training
for out in p.imap_unordered(self._tensorize_for_training, sharded_inputs):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
raise value
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 210, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle _thread.RLock objects

Inference with multi-GPU

Is there any possibility to inference with multi-GPU since some larger models on HuggingFace cannot be loaded in single GPU?

Setting k for pre-processing train data

How can we change the amount of training samples produced by the preprocessing script? It seems many of the files (e.g. ade_effect.py, anli.py, etc) have k hardcoded and thus are not producing the number of examples passed in the arguments.

Thank you!

Checksum error in datasets==1.4.0

Was getting checksum errors when downloading data for a few datasets in _build_gym.py. Seems to be this error: huggingface/datasets#3787

I've been using the latest version of datasets==2.4.0, and it seems to be working alright for the few runs that I did.

Inference time

Hi, I would like to know how long would the inference testing on a single task (e.g. metaicl or channel-metaicl on non_qa_to_qa) last. I followed the provided command but experiments on a single config lasted for hours on a V100. I would like to know whether this is abnormal. Thanks.

Confusion with classification when having multi tokens

Hello, I am confused with the method of classifying when the option includes multi-token labels.

Let's assume that the classification task has two options for an answer which are [favor, against] and has the input as "I do not have an opinion about this move".

If we assume a prompt template as

Input: I do not have an opinion about this move
Output: 

what I understand is MetaICL calculates each loss of

Input: I do not have an opinion about this move
Output: favor

and

Input: I do not have an opinion about this move
Output: against

and calculate average losses where favor and against are located, and compare the average losses.
However, I am wondering whether it is a fair classification method.
To illustrate, let's say that 'favor' is tokenized by ['fa', 'vor'] and 'against' is tokenized by ['against'].

Then, I think that the loss according to the 'vor' could be heavily affected by prior 'fa' and may be significantly small.

This can lead to smaller average loss and give 'favor' more advantage than 'against'.

I am curious about your comments.

Best regard,
Wookje Han

Can you release the checkpoints for the smaller models?

Hi,

to facilitate research on fewer resources, could you you please release the checkpoints for the smaller model sizes mentioned in Appendix C.2 of the paper? I was only able to download the checkpoints for the 774M param model it seems.

Thank you!

CHILD PROCESS FAILED WITH NO ERROR_FILE

Hi There,
any idea why I'm getting this error by running the training script:
python -m torch.distributed.run --nproc_per_node=8 train.py --task hr_to_lr --k 16384 --test_k 16 --seed 100 --train_seed 1 --use_demonstrations --method channel --n_gpu 8 --batch_size 1 --lr 1e-05 --fp16 --optimization 8bit-adam --out_dir checkpoints/channel-metaicl/hr_to_lr

after successfully running the following script for tensorizing:
python train.py --task hr_to_lr --k 16384 --test_k 16 --seed 100 --use_demonstrations --method channel --do_tensorize --n_gpu 4 --n_process 40

my log file:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 17057 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/monajati/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/monajati/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 637, in
main()
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 629, in main
run(args)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 621, in run
elastic_launch(
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


        train.py FAILED            

=======================================
Root Cause:
[0]:
time: 2022-02-04_20:50:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 17057)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2022-02-04_20:50:02
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 17058)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2022-02-04_20:50:02
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 17059)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2022-02-04_20:50:02
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 17060)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[4]:
time: 2022-02-04_20:50:02
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 17061)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[5]:
time: 2022-02-04_20:50:02
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 17062)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[6]:
time: 2022-02-04_20:50:02
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 17063)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[7]:
time: 2022-02-04_20:50:02
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 17064)
error_file: <N/A>
msg: "Process failed with exitcode 1"


Why using [SEP] token while using gpt2 tokenizer?

Hi, I have a question about data preprocessing.

I noticed that the code used [SEP] when proprocessing dataset such as superglue rte or wic.

However, I am wondering why metaICL selects [SEP] token while it uses gpt2 tokenizer.

The reason why I think this is strange is because according to my understanding, gpt2 tokenizer doesn't handle [SEP] token.

It just tokenizes to [, SE, P, ].

Would you please explain the reason of using [SEP] token?

Thanks in advance.
Best regards,
Wookje Han

Questions around the data preprocessing

Hi again! I would like to run the training procedure with my own custom datasets, but I'm finding the data setup quite confusing.

In particular, I'm trying to understand the preprocessing done to generate the files in the MetaICL/data directory. Since I am not using HuggingFace datasets, I think the easiest route for me is to adapt unifiedqa.py to take my own input and output the right format.

However, looking at the files that have been generated in my MetaICL/data directory, I see a lot of files and I do not understand how they are used:

$ tree MetaICL/data
data
├── ade_corpus_v2-classification
│   ├── ade_corpus_v2-classification_16_100_dev.jsonl
│   ├── ade_corpus_v2-classification_16_100_test.jsonl
│   ├── ade_corpus_v2-classification_16_100_train.jsonl
│   ├── ade_corpus_v2-classification_16_13_dev.jsonl
│   ├── ade_corpus_v2-classification_16_13_test.jsonl
│   ├── ade_corpus_v2-classification_16_13_train.jsonl
│   ├── ade_corpus_v2-classification_16_21_dev.jsonl
│   ├── ade_corpus_v2-classification_16_21_test.jsonl
│   ├── ade_corpus_v2-classification_16_21_train.jsonl
│   ├── ade_corpus_v2-classification_16384_100_dev.jsonl
│   ├── ade_corpus_v2-classification_16384_100_train.jsonl
│   ├── ade_corpus_v2-classification_16_42_dev.jsonl
│   ├── ade_corpus_v2-classification_16_42_test.jsonl
│   ├── ade_corpus_v2-classification_16_42_train.jsonl
│   ├── ade_corpus_v2-classification_16_87_dev.jsonl
│   ├── ade_corpus_v2-classification_16_87_test.jsonl
│   └── ade_corpus_v2-classification_16_87_train.jsonl
├── ade_corpus_v2-dosage
│   ├── ade_corpus_v2-dosage_16_100_dev.jsonl
...

I understand that the files are named with {task}_{k}_{seed}_{split}.jsonl, but I am confused how these files are used / which are used during train / test.

My main questions are:

  1. Can you please explain how each of those files is used during training and testing?

  2. Why do you generate so many files, instead of simply 3 files for train, dev and test?

In case it's not covered in the general explanation, I also have some additional questions from looking through the code:

  1. With the default setup, it seems like only *_16384_100_train.jsonl is used during training. So if I want to train on a custom dataset, I can just put my file in data/my_task/my_task_16384_100_train.jsonl without any of the other files, and that should be enough to run the train procedure?

  2. The *_16_{seed}_train.jsonl files are always 16 lines long, whereas *_16_{seed}_test.jsonl files are always much longer. Why?

  3. As far as I can tell, the *_dev.jsonl files are never used?

  4. Is there something special about seed=100? Looking at this and this.

Thank you very much in advance!

Using MetaICL for Unconstrained Generation

Currently, the "options" field is required in any test set .json(l) for the data to load without throwing exceptions:

assert len(dp["options"])>=2, dp

While the original use-case is to evaluate on multiple-choice or classification problems only, I'd also be interested in evaluating the non-QA models on datasets like SQuAD or NaturalQuestions, which are not multiple-choice. It would also be interesting to observe what the model generates for conditional generation tasks in general.

I'd be willing to implement this in a PR if you think this feature would be useful!

MetaICL still needs to update its parameters during meta-training?

In Abstrat part of this Paper, it wrotes "with no parameter updates or task-specific tem-plates"

I thought this project a new method for "Prompt Tuning" through meta-learning, aiming at providing a better Prompt/Instruction than regular In-Context Learning.
But in the code, in "model.do_train()", it updates the model's parameters by BP (loss.backward()) ? Is is still a type of FT (fine-tune) ?

If I change my base-LM to something HUGE like GPT-3 176B,it costs too much.

Error when data tensorize

Hello,

Thanks a lot for your code.

I find that during the data tenderizing process, I met the bug:

ValueError: 'a' cannot be empty unless no samples are taken

I assume that the code in data.py in line 246

return [r] + _draw_random(tot, n-1, exclude_indices | set([r]))

Cause the problem.

Error in `_build_gym.py` when using `--do_train`

Hi, I faced an error when running

python _build_gym.py --build --n_proc=40 --do_train # skip if you won't run training yourself

as described in the README. The error is:

Traceback (most recent call last):
  File "onestop_english.py", line 62, in <module>
    main()
  File "onestop_english.py", line 59, in main
    train, dev, test = dataset.generate_k_shot_data(k=16, seed=seed, path="../data/")
  File "/home/jc11431/git/MetaICL/preprocess/fewshot_gym_dataset.py", line 171, in generate_k_shot_data
    self.save(path, k, seed, k_shot_train, k_shot_dev, k_shot_test)
  File "/home/jc11431/git/MetaICL/preprocess/fewshot_gym_dataset.py", line 109, in save
    self.write(k_shot_test, prefix + "_test.jsonl")
  File "/home/jc11431/git/MetaICL/preprocess/fewshot_gym_dataset.py", line 115, in write
    fout.write(line+"\n")
TypeError: can only concatenate tuple (not "str") to tuple

and this error is repeated for a large number of the tasks.

Basically, in fewshot_gym_dataset.py it seems like --do_train does not completely skip writing the test data, but it does skip the processing step so the writer fails in some cases when it tries to write the k_shot_test which has not been processed to a string.

I was able to fix this by changing

                self.write(k_shot_train, prefix + "_train.jsonl")
                self.write(k_shot_dev, prefix + "_dev.jsonl")
                self.write(k_shot_test, prefix + "_test.jsonl")

to

                self.write(k_shot_train, prefix + "_train.jsonl")
                if do_test:
                    self.write(k_shot_dev, prefix + "_dev.jsonl")
                    self.write(k_shot_test, prefix + "_test.jsonl")

but I'm not sure if that's your intended behavior. (My intuitive understanding of --do_test should also skip anything to do with train examples, and vice versa)

Additional information on `run_model` method from the `MetaICLModel` class

Hi @shmsw25

thank you very much for publishing the MetaICL code base.

I'm currently going through the code to understand it better and I stumbled over the following lines in the MetaICLModel class of the run_model method:
https://github.com/facebookresearch/MetaICL/blob/main/metaicl/model.py#L273 (removal of the last element)
https://github.com/facebookresearch/MetaICL/blob/main/metaicl/model.py#L277:L278 (removal of the first element)

I'm not sure why the last and first element is removed, one guess would be that it is due to the addition of newlines or spaces in the _prepro_each_datapoint function (https://github.com/facebookresearch/MetaICL/blob/main/metaicl/data.py#L116)?

(Another guess is that it is related to EOS and BOS tokens, but those are not used in the MetaICLData class because in the prepro_sentence_pair_single function they are commented out https://github.com/facebookresearch/MetaICL/blob/main/metaicl/data.py#L457:L460 )

Maybe I'm missing something important to understand the run_model method?

Reproducibility of PMI methods?

Hi, I tried running reproduce.sh exactly as described https://github.com/JunShern/MetaICL#simply-reproducing-all-results-in-the-paper, and so far the results I get do exactly match the reported results in the README, except for the pmi-zero and pmi-ic settings.

From the README:

  hr_to_lr class_to_class non_class_to_class qa_to_qa non_qa_to_qa non_nli_to_nli non_paraphrase_to_paraphrase
zero 34.9 34.2 34.2 40.4 40.4 25.5 34.2
pmi-zero 34.8 33.2 33.2 40.4 40.4 27.9 39.2
channel-zero 36.8 37.2 37.2 39.2 39.2 33.9 39.5
ic 38.2 37.4 37.4 40.2 40.2 34 33.7
pmi-ic 38.9 38.3 38.3 40.5 40.5 33 38.6

The results I get:

  hr_to_lr class_to_class non_class_to_class qa_to_qa non_qa_to_qa non_nli_to_nli non_paraphrase_to_paraphrase
zero 34.9 34.2 34.2 40.4 40.4 25.5 34.2
pmi-zero 31.3 24.1 24.1 36.4 36.4 26 33.1
channel-zero 36.8 37.2 37.2 39.2 39.2 33.9 39.5
ic 38.2 37.4 37.4 40.2 40.2 34 33.7
pmi-ic 37.6 35.8 -- -- -- 31.7 32.9

(Ignore the -- above, I have not gotten results for those yet.)

If you compare the tables, you will see that all the rows are identical except for pmi-zero and pmi-ic. In particular, if you compare class_to_class and non_class_to_class settings for pmi-zero, the difference is as much as 9%.

Is this expected / Do you have any guess why this happens?

Using MetaICL for multi-label classification

I may have missed this in the paper, but from my understanding, the setup used for MetaICL doesn't lend itself too well for multi-label classification, where you can classify a given input as multiple labels (for example picture of cat -> mammal, cat, animal, given options mammal, cat, animal, reptile, plant, etc.).

I say this because I imagine that evaluating the log likelihood for every option combination sounds unfeasible. You would have to evaluate [mammal], [mammal, cat], [mammal, cat, animal], [reptile], [reptile, cat], etc... You can binarise this to getting 2^N completion options for a problem with N labels, which for N>10 is already in the thousands, running into prompt length limits

My question is, have the authors considered a way to circumvent this issue? Is this even an issue or is my understanding incorrect? What are your thoughts?

Thank you.

How are the options shown to the model, if at all?

I apologise for additional clarification requests, but this is not immediately clear from the paper or the code. For multiple-choice tasks, how are the options shown to the model, if at all, particularly during evaluation?

Making use of @JunShern's pseudocode from #7 as a reference to the internals of MetaICL:

for task in eval_tasks:
  for seed in [100, 13, 21, 42, 87]:

    # Randomly sample a 16-shot context
    random.seed(seed)
    k_shot_context = random.choice(task['train'], size=16) # list of 16 (x,y) pairs
    
    # Evaluate with the context on all test examples
    for x, y in task['test']:
      prompt = str(k_shot_context) + str(x)
      y_pred = model.generate(prompt)
      score = calc_score(y, y_pred)

How are the available options (which for multiple-choice are unique to a given x) shown to the model? Are they concatenated to the input x? Otherwise generating a prediction for y seems very difficult. This also applies to the k-shot examples, right?

For example, consider this multiple choice example:

x y options
"Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine." 2 ["1: Ian", "2: Dennis"]
  1. ignoring the k_shot_context for simplicity, do we have that str(x) is the following?

    Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.
    1: Ian
    2: Dennis
    

    Or is str(x) just:

    Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.
    

    ?

  2. If we were to use the provided example as one of the shots in our k_shot context, would it look like

    Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.
    1: Ian
    2: Dennis
    2
    

    or like

    Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.
    2
    

I imagine in both 1. and 2., it's the former case (i.e. we concatenate the options to the input), because otherwise the model faces the very difficult task of guessing what the options are and then generating the correct option. But I may be wrong, I tried looking at the provided code and could not see any evidence of option concatenation, so I am confused.

Thank you!

Confusion about the initial evaluation results of the GPT2-LARGE model

Hello, I have a puzzle, I use gpt2-large pre-training language model to evaluate unseen_domain_test in-context, config file is configs/class_to_class.json, method is direct, k=16, found in-context f1 result is 38.63( the paper‘result is 30.6) 。in the paper or config file, we can know the unseen_domain_test dataset include "poem_sentiment", "climate_fever", "medical_questions_pairs", "financial_phrasebank" , all is 2 categories or 3 categories, even if a random pick should be at least 33.3(1/3), why is the paper 30.6? A little confused.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.