facebookresearch / dynabench Goto Github PK

Dynamic Adversarial Benchmarking platform

License: MIT License

Python 55.29% HTML 0.12% JavaScript 37.87% CSS 0.79% Jupyter Notebook 5.05% Dockerfile 0.24% Shell 0.65%

dynabench's Introduction

This repo is deprecated.

Dynabench has transitioned to ML Commons (https://mlcommons.org/en/). See the new repo at https://github.com/mlcommons/dynabench/. Please submit all PRs/issues to the new repo.

dynabench's People

Contributors

Stargazers

Watchers

Forkers

vontell zpapakipos kokrui ishita-0112 anandrajaram21 fzchriha ewadkins hercules261188 entilzha mandalsoumik abecode adinawilliams jfceron ciroye remg1997

dynabench's Issues

Profile pic not shown on user page

See e.g. https://dynabench.org/users/2#profile, while max does have a profile pic (as you can see in the user leaderboard on eg QA).

For HS, change "Target of hate" to "Type of hate"

Migration to change "target" to "type" in example metadata
Change the UI and call it "type"

cc @bvidgen, @ZeerakW

Move run_round_divyansh.py to torchserve/SageMaker

.. so we can retire the old model server.

Global user leaderboard

Add a "Users" top nav link, which goes to a page that shows a table of all users, sorted by vMER (and maybe their badges?).

Hate Speech round 2

@bvidgen, @ZeerakW to provide training data
@TristanThrush, @douwekiela Train up model (doing some model selection on validation)
@douwekiela Deploy new model in the loop
@douwekiela Update HS to round 2

Tag-based filtering of contexts and examples

cc @easonnie

We need to be able to add tags to contexts (when importing them) and examples (when storing them). We then need to be able to filter by these tasks, i.e. when getting a new context we should be able to specify the desired tag, or when getting a new example to validate we should be able to specify the desired tag.

Suggestion:

Add tags table
Add table linking tags to contexts and tags to examples
Add filter endpoints to context and examples controllers

CSS line wraps

If you make the screen less wide, there's something weird with the CSS background color of the menu item when you hover over it:

This should probably wrap to the top the same way we do on tasks https://dynabench.org/tasks/1#overall

Add devops scripts to codebase

We're currently using a custom and hacky shell script written by contractors. Add this here so we can clean it up later.

The tag filtering does not seem to work right now.

I tried the following API requesthttps://api.dynabench.org/contexts/1/4?tags=whatever. The server is still able to return me a context.

{
  "id": 76472,
  "r_realid": 12,
  "context": "How to start free running You need to start off slow and small. During your first few months it is recommended that you condition your body every day and practice small techniques repetitively to build muscle coordination and confidence. Performing some form of calisthenics & a bit of weight training goes a long way.",
  "metadata_json": null,
  "total_used": null,
  "last_used": null
}

I can also retrieve an example with https://api.dynabench.org/examples/1/4?tags=whatever.

However, I expected the return message to be 500, f"No contexts available ({round.id})" according to

dynabench/api/controllers/contexts.py

Line 48 in bb26789

bottle.abort(500, f"No contexts available ({round.id})")

Rejection rate is not calculated correctly on user profiles

The rejection rate is set to this.state.user.total_fooled - this.state.user.total_verified_fooled which is only correct if we assume that everything has been validated. This should just track the total number of verified incorrect.

Allow validators to give their own explanations

In the validation interface, allow people to provide their own explanations (in the case of HS, including the type of hate). Store these in the metadata_json field.

When exporting, you should also be able to get this information.

cc @adinawilliams @ZeerakW @bvidgen

Make sure we store model predictions

People upload their model predictions now, which we score in api/controllers/model.py. But we do not actually store their uploaded labels. We should do that. Add a field to the scores table and in api/models/score.py bulk_create, make sure you also store the actual predictions.

Script to process Turker data

Write a process_data.py script that takes as input an export.json file and interfaces with the local mephisto to make sure workers are paid and get bonuses if applicable.

Steps:

Check that the example comes from mturk
Map the example to the appropriate HIT
Check if you have enough information to decide whether to approve/reject/bonus the HIT (e.g. are all examples verified)
Approve/reject the HIT and give bonus if applicable

Update annotators/README.md

Update the README to reflect the latest changes and to make sure this is as easy for people as we can possibly make it.

Badges

We only have some basic badges working now, the rest should be implemented too.

We need a python script that runs as a cronjob periodically (every N minutes), that will check if any badge conditions are met, and assign badges/add notifications accordingly.

Add filters for task owner validation interface

Right now the task owner validation interface only shows verified_flagged=True examples. Add different content filters as a dropdown underneath the validation interface that allows owners to say what data they want to view/override:

Flagged once
Flagged
High disagreement
All

Include anonymized annotator ids for validations

For every validation that we store in the validations table, also store a unique anon_id. This value is DIFFERENT from the example anon_id (e.g. add some string after the secret in the hashing function to make it unique).

Add a list of anon_val_ids to the example export, which are not mappable back to anon_ids; the anon_val_ids should be combinable with the validation labels (so either in the same order, or both are given as a list of tuples).

cc @bvidgen, @ZeerakW

Get validations for entries that don't fool the model

At the moment (at least for hate speech), entries are only validated if model_wrong == TRUE (i.e. the model has been 'tricked'). At present, we get up to 5 validations for each entry.

We would like to also have validations for cases where the model correctly classified the content. We see this as the most efficient use of annotator time:

If validators #1 and # 2 agree with the original labels of the entry then it does need any further validation.
If either of validator #1 or #2 disagree with the original entry then it needs all 5 validations.

Alternatively, you could set it so that if just validator #1 agrees with the original labels of the entry then there is no further validation and/or set it to only maximum 3 validations. This would probably be a smarter use of people's time.

@ZeerakW

Change model prediction upload form to reflect model cards

We want to capture many more details about models than we do now. Look into model cards (https://arxiv.org/abs/1810.03993) for this. Store things like number of parameters, inference time, etc.

Updates with notes from our discussion:

Change "Description" to "Summary"
Add fields:
-- # Parameters
-- Language(s)
-- License(s)
Model card template:

Model Details
..
Intended Use, Caveats and Recommendations
..
Data (Train and Dev data)
..
Additional Information
..
Ethical and Fairness Considerations
..

On the model display page, show performance first
No need for inference time information just yet
Modal to display information about how to get information about the number of parameters

Examples subpage and export

Users should be able to export all of their own examples that they generated on the platform. On the user profile, add a "Examples" tab/subpage. On that subpage, have a table that lists all tasks, and displays the users' stats on those tasks (which we can more easily compute on the fly once we have the validations table). For each of the tasks, also have a "Export" button that allows them to export their own data for that task (taking care not to accidentally expose fields they shouldn't have access to), in the same way a task owner can export the data for a given task.

Include raw validation scores in task owner Export

Right now we do not show the raw validations in the export data. We should include these (e.g. as an anonomyized comma-separated list of labels, so NOT with the uid).

Add Education webpage

Add web page for education with top nav link, some description and three links: one to the slide deck, one to the video lecture and one to a "practical.zip". In that zip file, put a README for teachers, and a notebook for them that helps them handle students' export files (cf #107).

Provide breakdown of metrics by example tags on leaderboard

We should be able to provide scores for different example categories/tags on the per-round leaderboards. For example, for QA round 1, this is actually three datasets; D(BiDAF), D(BERT), and D(RoBERTa) collected with different models in the loop. We want to be able to provide a breakdown of scores on each, as well as the overall score:

test examples should have an additional attribute tags with a list of example-relevant tags e.g. tags: ['D(BiDAF)']
add a metadata_json TEXT field to the scores db table
update helpers.py validate_prediction() to also aggregate scores (for each round) based on the example tags and add this breakdown to score_obj
update score.py bulk_create() to log the breakdown to the database
update round leaderboards to be full-width and display these aggregate scores in separate columns, along with the round overall score

User leaderboards display

Do not display user leaderboards for pre-dynabench rounds (NLI1-3 and QA1) .

Display explanations if available in validation interface

Request from Bertie and Zeerak: If the user provided an explanation, display it in the validation interface.

Sort user leaderboard by model fooling

Right now we don't sort by vMER but by total number of examples in the user leaderboard on TaskPage. Sort by number of model-fooling examples instead.

Gentler intro to first-timers

For first time users, feedback is that it's not easy to use. We tried to do some onboarding of the interface but that didn't really do the trick. We should improve task-specific instructions, and when people get started, tell them in the CreateInterface that they can view instructions with a link that opens up the details.

We should then encourage all task owners to update their instructions to make them more specific.

Feature request: Task owner context import

It would be super nice if task owners can import contexts (and tags) via the web interface. I would like to see a proposal for what that would look like and then make it a priority to add this, because it's adding a lot of overhead for me personally having to do this (the alternative is giving everyone DB access, which is also not desirable). This will be needed for our "anyone can add a task" timeline anyway.

Tokens table for JWT

We need to allow multiple JWT tokens to be active at the same time and/or allow people to be logged in on multiple windows at the same time. We'll probably need a tokens table for this, but it would be good to work this out into a detailed proposal of changes before we start moving.

Fix validation backend for Turkers

In the create method of the validation model, make it so that if uid == 'turk', don't pass it.
Add a database migration that alters the table to make uid nullable for validations.
Store the annotator_id in the metadata_json for turkers
In PUT validations/[eid], make sure people don't validate their own examples by checking annotator_id in the metadata_json field of the example and validation
Divyansh can do a filter within the ApiService getRandomExample function if we're in turk mode, so that it makes sure we re-sample if we retrieved an example that we generated.

cc @dkaushik96

For HS, always show all explanation inputs

In not-model fooling cases, still display the "Explain why you think the model was fooled" text box but call it "What did you do to try to trick the model?".

cc @bvidgen, @ZeerakW

Issues with JSON export

There are a few issues with JSON export:

1: Some content isn't outputted as utf-8 so that creates issues when reading the JSON file in python. This can be reproduced by exporting data for tid 10.
2: Currently if I export data from dynabench, it throws an error as well.

Update hate speech datasets on prod

Update the test files (on prod1, prod2 and dev)
Double check the code (you can upload predictions and they're scored using the right files)
You can add a special tag to examples to capture R2a/R2b etc (take a look at QA)
Upload some model predictions for the best models in the paper to get some models up on the leaderboard

cc @bvidgen

User leaderboard is too slow

This computation needs to have something cached if it's already linear, as you said.. it's too slow now

Flagging in create interface seems broken

I can't flag my own example after I've created it. Nothing happens when I click that button.

Model Overview displays "uploaded Invalid date"

Model Overview for both models that dynabench uses displays "uploaded Invalid date". Steps to reproduce:

Go to "OVERALL MODEL LEADERBOARD"
Select the model
Error displays

DevOps improvements on prod

Run the cron job
Logging in prod doesn't show the correct URLs, can we fix?
Before we start redeployment from bootstrap.sh, temporarily replace with something that says "we will be right back"
Remove python 2.7 and default to python3
We are not handling frontend errors in prod (rather, we're giving a white page - see https://stackoverflow.com/questions/49925345/why-does-an-error-result-in-a-blank-screen-instead-of-a-message we should fix with error boundaries)
Automatic periodical database backups via a cron job (daily?)

Move task owner mode toggle to validation settings modal

The task owner mode toggle at the bottom on the validation interface can be moved to the settings modal under the gear icon now. I think that's a lot cleaner in the long term and clutters the interface less.

Tristan's TODO list: Add exporting of raw validation scores

Inspiration button

In the create interface, it would be fun to have an "Inspiration" button (with a lightbulb as an icon?). When you click on that button, you'll be shown model-fooling examples for the current task, to give you inspiration.

Add sentiment dataset and leaderboard

Now that DynaSent is out, we should add it so that we can get a leader board going.

Add task owner admin interface and task-specific settings

Add task owner gear icon

Add task owner admin modal

Move Export button to that modal

Add settings_json field to the tasks table. Start with two settings:

The number of required validations (set this to 3 as default)
Whether or not we validate non-model fooling examples task-dependent

When get_example to get examples to validate, first validate all model-fooling ones, and if the settings allow it, then move to non-fooling examples.

Add JS linter and fix warnings

TBD

Add model upload and scoring for hate speech round 1

Now that round 1 is finished, we need to allow people to submit model predictions on the first round.

cc @bvidgen @ZeerakW

Clean up create and validate interface

We should create the right abstractions for turning the create and validate interfaces into things with reusable components, so that we can easily 1) use them in MTurk settings as well; 2) customize them as desired; and 3) share components between creation and validation.

UI issue with overlays on user profile metrics

I think it's very unintuitive from a UI/IX perspective to have an overlay trigger for an entire div without user feedback.

Please make the overlay trigger only wrap around the right hand side on the user profile metrics/error rates (both in your own profile and other users' profiles). Give the user feedback by showing a cursor:pointer when they hover over the metric.

Turn validation actions into radio buttons

In the white container area, have three radio buttons: 1. correct 2. incorrect 3. flag.
If you select 2, show dropdown with correct label
-- for QA, tell them "select correct answer; if it's not there flag it"
-- for HS/Sent, if they select this, tell them that the correct label is X according to them
If you select 3, show text input for explaining flag
In the gray area, have two buttons: Submit (make this blue, btn-primary) and the current Skip and load new button

Validations are not done in balanced way

Bug report from Bertie: it seems that some examples have 0 validations and others have 5. The annotators have recently been in validation mode, rather than creation mode, so that feels off.

Fix NLI labels

Entailing/entailment/entailed? What is the right way to describe the NLI labels?

If validator says example is incorrect, ask for right label; if they flag, ask why

In the validation interface, if the example is marked incorrect, show a dropdown asking for the right label.
If the example is flagged, ask them to explain why they flagged it.
Put a "Save" button next to that and if they click Save move on to the next one.

cc @bvidgen

facebookresearch / dynabench Goto Github PK

dynabench's Introduction

This repo is deprecated.

dynabench's People

Contributors

Stargazers

Watchers

Forkers

dynabench's Issues

Recommend Projects

Recommend Topics

Recommend Org