Code Monkey home page Code Monkey logo

stanford-crfm / helm Goto Github PK

View Code? Open in Web Editor NEW
1.8K 34.0 232.0 103.84 MB

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

Home Page: https://crfm.stanford.edu/helm

License: Apache License 2.0

Python 92.95% JavaScript 2.42% HTML 0.36% CSS 0.08% Shell 0.12% TypeScript 4.06%

helm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helm's Issues

use snakecase in Python

Currently, we are using CamelCase in the backend to make life easier on Javascript, but this makes casing inconsistent on the backend. We should just do the conversion.

The API should be in snake case. Most of this is in settings, so Javascript can just pass everything through.

setup CI

For now, just run the unit tests.

Create a PyPI package

So users can just do pip install crfm-benchmarking and run benchmarking without downloading the code.

Support controlled randomness in requests

Right now, queries are deterministic because of caching.

Add a random: <string> field in request, so that we can get the best of both worlds. This string would have to be propagated down into the cache, which currently is on the raw request, which we can't change.

Also, we would ideally support this feature in a backward compatible way so we don't blow away the existing cache. Not sure what the best way to do this is.

estimate the number of tokens accurately for the quota

We need to know the number of tokens in Service::make_request to keep track of the quotas accurately. Need to figure out how this is actually computed since each service presumably uses a different tokenizer (Jurassic has larger tokens for example).

output more information when benchmarking

When benchmarking, print out

  • the Instances of a Scenario
  • the RequestStates that are produced by the Adapter
  • the Metrics and the resulting Stats.

Output:

  • Summary statistics to stdout using htrack and hlog (this is basically fine)
  • Human readable to a text file.
  • Serializable information to a json file.

Format for text file:

Scenario ${scenario.name}
${scenario.description}
Tags: ${scenario.tags}
${n} instances

------- Instance ${i}/${total}: ${instance.tags}
Input: <input>
Reference(tags): ${reference.text}
...
Reference(tags): ${reference.text}

Where to store these files:

  • Each RunSpec should be given a path (e.g., reasoning/mmlu/chemistry) which a new field of RunSpec.
  • Output to scenario.{txt,json}, scenario_state.{txt,json}, metrics.{txt,json} in that path.

make execution multithreaded

Create a thread pool with some parallelism in Executor, so that we can process the examples in parallel.

        # TODO: make a thread pool to process all of these in parallel up to a certain number
        def render_instance(instance: Instance, pred_output: Optional[str]) -> str:
            tags_str = ",".join(instance.tags)
            gold_output: Optional[str] = None
            if instance.first_correct_reference is not None:
                gold_output = instance.first_correct_reference.output

            if request_state.output_mapping is not None and pred_output is not None:
                pred_output = request_state.output_mapping.get(pred_output.strip())
            correct_str = "CORRECT" if gold_output == pred_output else "WRONG"
            return (
                f'[{tags_str}] "{instance.input[:100]}" => "{gold_output}", predicted "{pred_output}" [{correct_str}]'
            )

reproduce numbers on MMLU

It's technically implemented, but we need to make sure that we're reproducing the numbers from the paper. Make sure all the decoding parameters are the same and that the eval set is the same. There is also some stuff involving truncation which we should take a look at.

Don't include all the dev packages in setup.py

def get_requirements(path: str):
    # TODO: don't include all the dev packages
    requirements = []
    for line in open(path):
        if not line.startswith('-r'):
            requirements.append(line.strip())
    return requirements

Verify package builds correctly.

Leverage a thread pool process instances in Executor

        # TODO: make a thread pool to process all of these in parallel up to a certain number
        def render_instance(instance: Instance, pred_output: Optional[str]) -> str:
            tags_str = ",".join(instance.tags)
            gold_output: Optional[str] = None
            if instance.first_correct_reference is not None:
                gold_output = instance.first_correct_reference.output

            if request_state.output_mapping is not None and pred_output is not None:
                pred_output = request_state.output_mapping.get(pred_output.strip())
            correct_str = "CORRECT" if gold_output == pred_output else "WRONG"
            return (
                f'[{tags_str}] "{instance.input[:100]}" => "{gold_output}", predicted "{pred_output}" [{correct_str}]'
            )

create `RemoteService` to allow for programmatic access

Create a RemoteService class that has the same signature as Service (for the functions used byServer). This can be used client-side. Then demo.py should be:

auth = Authentication(api_key="crfm")
service = RemoteService("http://crfm-models.stanford.edu")

# Make a request
request = Request(prompt="Life is like a box of")
print(service.make_request(auth, request))

# Modify account
account = service.get_account()
account.description = "Updated"
service.update_account(auth, account)

Behind the hood, it will make all the necessary REST calls.

support adding users / editing quotas

Create the following functions (with corresponding REST endpoints) for managing accounts:

  • Service.create_account(auth): creates a new account with a random API key and returns that Account
  • Service.get_accounts(auth): returns a list of Account
  • Service.update_account(auth, account): updates the account given by account.api_key with the fields in account.

Note that these methods should look at auth and the corresponding is_admin and make sure that the following permissions are enforced:

  • If is_admin = True, then anything is possible EXCEPT the used field in any Usage object (because this is changing, we don't want to override it).
  • Otherwise, one can only view their account and edit their account (EXCEPT api_key, usages).
    Any disallowed changes are just discarded silently. This allows the client to have a simple uniform interface for updating an Account.

support any Hugging Face model

We want to be able to run queries against GPT-2 and GPT-J.

Two options:

  1. Write a separate server that allows us to spin up Hugging Face servers locally.
  2. Try to use Hugging Face servers directly (https://api-inference.huggingface.co/docs/python/html/quicktour.html#api-options-and-parameters)

Let us start with 2 since it's simpler (we might want to do 1 later). Either way, need to write a subclass of Client to support the separate server with the HF model.

Note: the existing implementation HuggingFaceClient is broken and tries to load the model directly, which is not a good idea.

Add pytest flag for slow tests in test_service.py

How to: https://www.py4u.net/discuss/204728

# TODO: put a flag on this so that it's easy to use pytest to still run these slow tests
@pytest.mark.skip(reason="Requires production")
def test_prod_continue():
    # Test that we're continuing
    prompt = "Paris is the capital of"
    for model in prod_models:
        request = Request(prompt=prompt, model=model, max_tokens=1, num_completions=1, temperature=0)
        helper_prod_test_service(request, " France")


@pytest.mark.skip(reason="Requires production")
def test_prod_echo():
    # If we're echoing the prompt, make sure we're getting the same thing back
    prompt = "I like pickles."
    for model in prod_models:
        request = Request(prompt=prompt, model=model, max_tokens=0, num_completions=1, echo_prompt=True)
        helper_prod_test_service(request, prompt)

Renew Perspective API key by 7/30/2022

The current API key we are using in production was created with the hai-gcp-models account and allows 200 queries per second:

"Thanks for sharing how you're using Perspective and contacting us about a quota increase. We reviewed your request and are happy to grant your project 200 queries per second until 7/30/2022."

When renewing, fill out the form with your Stanford email address. For GCP ID, put down hai-gcp-models.

presentation framework v1

Executing benchmark-run on the cluster machines should write all the scenarios, scenario states, and metrics to JSON files for each run specification. Display results as a text file for now.

off by one error in codex

In the web interface, try the third example query (or anything with codex). The last token of the prompt seems to be repeated in the first token of the completion.

The first log probability and top choices are None

The first log probability and top choices are None, so we skip. See openai_client.py

            for text, logprob, top_logprobs in zip(
                raw_data["tokens"], raw_data["token_logprobs"], raw_data["top_logprobs"]
            ):
                tokens.append(Token(text=text, logprob=logprob or 0, top_logprobs=dict(top_logprobs or {})))

Sort the predicted outputs

In basic_metrics.py:

        # TODO: Sort the predictions, or take them from the top tokens of the first completion
        preds = [completion.text.strip() for completion in request_state.result.completions]

Add Twitter AAE

Used in the Gopher paper for measuring perplexity.

January 25 update: The LM pipeline has been implemented in #79. TODO: Implement the data filtering process that extracts tweets of different demographics from the raw data.

Implement evaluate_references in basic_metrics.py

def evaluate_references(
    self, adapter_spec: AdapterSpec, reference_request_states: List[RequestState]
) -> List[Stat]:
    """
    Setup: for each reference, we have a model score (log probability) and whether it's correct.
    We define the following metrics:
    - correct_rank: if we sort references by their logprobs, what is the ranking of the first correct reference.
    """
    # TODO
    return []

Update User model and authentication

Make Authentication api_key instead of username and password. The API keys will be randomly generated and assigned to people. The User isn't so much tied to an individual person, but rather a use case. Add a description field, and maybe make emails a list of emails. Also add group and is_admin.

Need to update the web interface too.

organize files into directories

Currently, everything is in src.

  • Move the clients, users, the server code to a separate directory service for managing access to the APIs.
  • Split out models.py into different files. Put common files in the src directory.
  • Create a separate directory benchmark for scenarios, metrics, etc..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.