Code Monkey home page Code Monkey logo

dynabench's Introduction

dynabench's People

Contributors

adinawilliams avatar anandrajaram21 avatar ciroye avatar douwekiela avatar entilzha avatar ewadkins avatar fzchriha avatar gwenzek avatar ishita-0112 avatar kokrui avatar ktirumalafb avatar maxbartolo avatar remg1997 avatar tristanthrush avatar vontell avatar zpapakipos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dynabench's Issues

Global user leaderboard

Add a "Users" top nav link, which goes to a page that shows a table of all users, sorted by vMER (and maybe their badges?).

Tag-based filtering of contexts and examples

cc @easonnie

We need to be able to add tags to contexts (when importing them) and examples (when storing them). We then need to be able to filter by these tasks, i.e. when getting a new context we should be able to specify the desired tag, or when getting a new example to validate we should be able to specify the desired tag.

Suggestion:

  1. Add tags table
  2. Add table linking tags to contexts and tags to examples
  3. Add filter endpoints to context and examples controllers

Add devops scripts to codebase

We're currently using a custom and hacky shell script written by contractors. Add this here so we can clean it up later.

The tag filtering does not seem to work right now.

I tried the following API requesthttps://api.dynabench.org/contexts/1/4?tags=whatever. The server is still able to return me a context.

{
  "id": 76472,
  "r_realid": 12,
  "context": "How to start free running You need to start off slow and small. During your first few months it is recommended that you condition your body every day and practice small techniques repetitively to build muscle coordination and confidence. Performing some form of calisthenics & a bit of weight training goes a long way.",
  "metadata_json": null,
  "total_used": null,
  "last_used": null
}

I can also retrieve an example with https://api.dynabench.org/examples/1/4?tags=whatever.

However, I expected the return message to be 500, f"No contexts available ({round.id})" according to

bottle.abort(500, f"No contexts available ({round.id})")
.

Make sure we store model predictions

People upload their model predictions now, which we score in api/controllers/model.py. But we do not actually store their uploaded labels. We should do that. Add a field to the scores table and in api/models/score.py bulk_create, make sure you also store the actual predictions.

Script to process Turker data

Write a process_data.py script that takes as input an export.json file and interfaces with the local mephisto to make sure workers are paid and get bonuses if applicable.

Steps:

  1. Check that the example comes from mturk
  2. Map the example to the appropriate HIT
  3. Check if you have enough information to decide whether to approve/reject/bonus the HIT (e.g. are all examples verified)
  4. Approve/reject the HIT and give bonus if applicable

Update annotators/README.md

Update the README to reflect the latest changes and to make sure this is as easy for people as we can possibly make it.

Badges

We only have some basic badges working now, the rest should be implemented too.

We need a python script that runs as a cronjob periodically (every N minutes), that will check if any badge conditions are met, and assign badges/add notifications accordingly.

Add filters for task owner validation interface

Right now the task owner validation interface only shows verified_flagged=True examples. Add different content filters as a dropdown underneath the validation interface that allows owners to say what data they want to view/override:

  1. Flagged once
  2. Flagged
  3. High disagreement
  4. All

Include anonymized annotator ids for validations

For every validation that we store in the validations table, also store a unique anon_id. This value is DIFFERENT from the example anon_id (e.g. add some string after the secret in the hashing function to make it unique).

Add a list of anon_val_ids to the example export, which are not mappable back to anon_ids; the anon_val_ids should be combinable with the validation labels (so either in the same order, or both are given as a list of tuples).

cc @bvidgen, @ZeerakW

Get validations for entries that don't fool the model

At the moment (at least for hate speech), entries are only validated if model_wrong == TRUE (i.e. the model has been 'tricked'). At present, we get up to 5 validations for each entry.

We would like to also have validations for cases where the model correctly classified the content. We see this as the most efficient use of annotator time:

  1. If validators #1 and # 2 agree with the original labels of the entry then it does need any further validation.
  2. If either of validator #1 or #2 disagree with the original entry then it needs all 5 validations.

Alternatively, you could set it so that if just validator #1 agrees with the original labels of the entry then there is no further validation and/or set it to only maximum 3 validations. This would probably be a smarter use of people's time.

@ZeerakW

Change model prediction upload form to reflect model cards

We want to capture many more details about models than we do now. Look into model cards (https://arxiv.org/abs/1810.03993) for this. Store things like number of parameters, inference time, etc.

Updates with notes from our discussion:

  • Change "Description" to "Summary"
  • Add fields:
    -- # Parameters
    -- Language(s)
    -- License(s)
  • Model card template:
Model Details
..
Intended Use, Caveats and Recommendations
..
Data (Train and Dev data)
..
Additional Information
..
Ethical and Fairness Considerations
..
  • On the model display page, show performance first
  • No need for inference time information just yet
  • Modal to display information about how to get information about the number of parameters

Examples subpage and export

Users should be able to export all of their own examples that they generated on the platform. On the user profile, add a "Examples" tab/subpage. On that subpage, have a table that lists all tasks, and displays the users' stats on those tasks (which we can more easily compute on the fly once we have the validations table). For each of the tasks, also have a "Export" button that allows them to export their own data for that task (taking care not to accidentally expose fields they shouldn't have access to), in the same way a task owner can export the data for a given task.

Add Education webpage

Add web page for education with top nav link, some description and three links: one to the slide deck, one to the video lecture and one to a "practical.zip". In that zip file, put a README for teachers, and a notebook for them that helps them handle students' export files (cf #107).

Provide breakdown of metrics by example tags on leaderboard

We should be able to provide scores for different example categories/tags on the per-round leaderboards. For example, for QA round 1, this is actually three datasets; D(BiDAF), D(BERT), and D(RoBERTa) collected with different models in the loop. We want to be able to provide a breakdown of scores on each, as well as the overall score:

  • test examples should have an additional attribute tags with a list of example-relevant tags e.g. tags: ['D(BiDAF)']
  • add a metadata_json TEXT field to the scores db table
  • update helpers.py validate_prediction() to also aggregate scores (for each round) based on the example tags and add this breakdown to score_obj
  • update score.py bulk_create() to log the breakdown to the database
  • update round leaderboards to be full-width and display these aggregate scores in separate columns, along with the round overall score

Sort user leaderboard by model fooling

Right now we don't sort by vMER but by total number of examples in the user leaderboard on TaskPage. Sort by number of model-fooling examples instead.

Gentler intro to first-timers

For first time users, feedback is that it's not easy to use. We tried to do some onboarding of the interface but that didn't really do the trick. We should improve task-specific instructions, and when people get started, tell them in the CreateInterface that they can view instructions with a link that opens up the details.

We should then encourage all task owners to update their instructions to make them more specific.

Feature request: Task owner context import

It would be super nice if task owners can import contexts (and tags) via the web interface. I would like to see a proposal for what that would look like and then make it a priority to add this, because it's adding a lot of overhead for me personally having to do this (the alternative is giving everyone DB access, which is also not desirable). This will be needed for our "anyone can add a task" timeline anyway.

Tokens table for JWT

We need to allow multiple JWT tokens to be active at the same time and/or allow people to be logged in on multiple windows at the same time. We'll probably need a tokens table for this, but it would be good to work this out into a detailed proposal of changes before we start moving.

Fix validation backend for Turkers

  • In the create method of the validation model, make it so that if uid == 'turk', don't pass it.

  • Add a database migration that alters the table to make uid nullable for validations.

  • Store the annotator_id in the metadata_json for turkers

  • In PUT validations/[eid], make sure people don't validate their own examples by checking annotator_id in the metadata_json field of the example and validation

  • Divyansh can do a filter within the ApiService getRandomExample function if we're in turk mode, so that it makes sure we re-sample if we retrieved an example that we generated.

cc @dkaushik96

Issues with JSON export

There are a few issues with JSON export:

1: Some content isn't outputted as utf-8 so that creates issues when reading the JSON file in python. This can be reproduced by exporting data for tid 10.
2: Currently if I export data from dynabench, it throws an error as well.
Screen Shot 2020-11-15 at 12 05 23 AM
Screen Shot 2020-11-15 at 12 05 15 AM

Update hate speech datasets on prod

Update the test files (on prod1, prod2 and dev)
Double check the code (you can upload predictions and they're scored using the right files)
You can add a special tag to examples to capture R2a/R2b etc (take a look at QA)
Upload some model predictions for the best models in the paper to get some models up on the leaderboard

cc @bvidgen

DevOps improvements on prod

  1. Run the cron job
  2. Logging in prod doesn't show the correct URLs, can we fix?
  3. Before we start redeployment from bootstrap.sh, temporarily replace with something that says "we will be right back"
  4. Remove python 2.7 and default to python3
  5. We are not handling frontend errors in prod (rather, we're giving a white page - see https://stackoverflow.com/questions/49925345/why-does-an-error-result-in-a-blank-screen-instead-of-a-message we should fix with error boundaries)
  6. Automatic periodical database backups via a cron job (daily?)

Inspiration button

In the create interface, it would be fun to have an "Inspiration" button (with a lightbulb as an icon?). When you click on that button, you'll be shown model-fooling examples for the current task, to give you inspiration.

Add task owner admin interface and task-specific settings

Add task owner gear icon

Add task owner admin modal

Move Export button to that modal

Add settings_json field to the tasks table. Start with two settings:

  • The number of required validations (set this to 3 as default)
  • Whether or not we validate non-model fooling examples task-dependent

When get_example to get examples to validate, first validate all model-fooling ones, and if the settings allow it, then move to non-fooling examples.

Clean up create and validate interface

We should create the right abstractions for turning the create and validate interfaces into things with reusable components, so that we can easily 1) use them in MTurk settings as well; 2) customize them as desired; and 3) share components between creation and validation.

UI issue with overlays on user profile metrics

I think it's very unintuitive from a UI/IX perspective to have an overlay trigger for an entire div without user feedback.

Please make the overlay trigger only wrap around the right hand side on the user profile metrics/error rates (both in your own profile and other users' profiles). Give the user feedback by showing a cursor:pointer when they hover over the metric.

Turn validation actions into radio buttons

  • In the white container area, have three radio buttons: 1. correct 2. incorrect 3. flag.

  • If you select 2, show dropdown with correct label
    -- for QA, tell them "select correct answer; if it's not there flag it"
    -- for HS/Sent, if they select this, tell them that the correct label is X according to them

  • If you select 3, show text input for explaining flag

  • In the gray area, have two buttons: Submit (make this blue, btn-primary) and the current Skip and load new button

Validations are not done in balanced way

Bug report from Bertie: it seems that some examples have 0 validations and others have 5. The annotators have recently been in validation mode, rather than creation mode, so that feels off.

Fix NLI labels

Entailing/entailment/entailed? What is the right way to describe the NLI labels?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.