The evaluation from bigscience-workshop

Convert validation code to work with Megatron as well as huggingface

Talk to Stas in engineering if need help with this

Add CrowS-Pairs to Full Benchmark

Refactor task template to merge multilingual.json and english.json

(per question raised about slide 6 at the evaluation meeting on 9/1).

As the repository gets larger, we will eventually need to decide on a code convention. Though preferences may vary, it's probably safe to stick to the setup in transformers, which is black + isort.

We could create a simplified Makefile and do something like

.PHONY: style

style:
	black .
	isort . --profile=black .

Alternatively, create a .pre-commit-config.yaml.

repos:
  - repo: https://github.com/psf/black
    rev: 21.7b0
    hooks:
      - id: black
  - repo: https://github.com/pycqa/isort
    rev: 5.9.3
    hooks:
      - id: isort
        args: ["--profile", "black"]

Create toy tasks/dummy code for prompt-based evals

Use to specify the API that all others will follow

Wrap evaluation benchmark using HF-trainer

This might sounds like a bit of re-structuring but for the sake of future compatibility, I propose the following,

Move to huggingface trainer: This will help the repo to automatically adapt to deepspeed and all the exclusive features of transformers library.
We don't have to re-invent the wheel. Given that we are using huggingface trainer, we only need to implement the following functions for a trainer for different tasks.
-- data_loader
-- DataCollator
-- compute_metrics
-- predictions (if needed)
In case if we want to finetune our full model, we don't have to change a lot in the surface level.

I would love to take some responsibility if needed. Let me know. @jaketae @tianjianjiang @wilsonyhlee

Add GEM Response Generation to Full Benchmark

Response generation in Schema-Guided Dialog (including shuffle challenge set)

Add SuperGLUE Tasks to Full Benchmark

Setup testing

#56 set up a basic unit test, but we have to consider what kind of tests we want to run. This is especially important given that GitHub workflows does not have any GPU support, and will thus take a non-trivial amount of time to complete even a basic simple benchmark run. The proposal is to ideate some ways in which we could make tests modular and reasonably fast.

Add Edge Probing Suite to Full Benchmark

Add POS Tagging with UD to Full Benchmark

Add LAMA to Full Benchmark

Add BioASQ to Full Benchmark

use to test generalization to unseen domain; maybe use FLEX?

Add GEM Wikilingua to Full Benchmark

all 18 languages

Adding TyDi QA to simple_benchmark

Per request by the modeling WG, we'd like to create a simple evaluation benchmark.

This issue tackles adding the TyDi QA dataset to the benchmark.

Add GEM E2E to Full Benchmark

E2E NLG (+shuffle challenge set)

Add HANS to Full Benchmark

use to test generalization to unseen task; maybe use FLEX?

Add WikiANN to Full Benchmark

Add MasaskhaNER to Full Benchmark

benchmark mt5 on tydiqa prompting setup

Adding WMT to simple_benchmark

Per request by the modeling WG, we'd like to create a simple evaluation benchmark.

This issue tackles adding the WMT dataset to the benchmark.

Add PIQA to validation set

Add GEM ToTTo to Full Benchmark

Create toy tasks/dummy code for fine-tuning evals

Add TyDiQA to Full Benchmark

Add HuffPo Text Classification to Full Benchmark

use to test generalization to unseen labels; maybe use FLEX?

Add ANLI to Full Benchmark

use to test generalization to unseen task; maybe use FLEX?

Add DiaBLa to Full Benchmark

Add Flores 101 to to Full Benchmark (as LM tasks, not MT tasks)

Add GEM XSum to Full Benchmark

including COVID+bfp02 challenge sets

Add Jigsaw Toxicity Classification to Full Benchmark

translate validation prompts into all training languages

Add LinCE Testbed to Full Benchmark

Add GEM WebNLG to Full Benchmark

Russian/English (+shuffle/numbers challenge sets)

Start overleaf for benchmark tech report

Add QA-SRL to Full Benchmark

Add QASPER to Full Benchmark

use to test generalization to unseen domain; maybe use FLEX?

Add GEM MLSum to Full Benchmark

Spanish/German (including COVID challenge sets)

Add MNLI to Full Benchmark

coordinate with whoever is working on SuperGLUE, we only need to include MNLI once. But NLI will be held-out from model training (whereas the other SuperGLUE tasks will not) so interpreting MNLI results is different from other superglue tasks.

bigscience-workshop / evaluation Goto Github PK

evaluation's People

Contributors

Stargazers

Watchers

Forkers

evaluation's Issues

Recommend Projects

Recommend Topics

Recommend Org