stanford-crfm / helm Goto Github PK

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

Home Page: https://crfm.stanford.edu/helm

License: Apache License 2.0

Python 92.95% JavaScript 2.42% HTML 0.36% CSS 0.08% Shell 0.12% TypeScript 4.06%

helm's People

Stargazers

Watchers

Forkers

brando90 yancong222 jeanru vishalbelsare drafity ai-natural-language-processing-lab coallaoh joelniklaus techthiyanes standardgalactic jxzhangjhu dineshkumares kaiqiangsong rylanschaeffer xiyang85 briancylui clefourrier danfu09 zhaoyiran924 redbudthu hbding-amz y12uc231 lorrinwww tanaysd lemon031 rpatil524 chorseng faneshion hertera1 liuyuuan jeffwan fuxiaoyi vikashsingh2310 pratik-behera mabuyun godmapper qroam oyvindtafjord sxjscience ncrispino junwan-db qianmuluo jinlmsft junzhang-zj kalchakra13 xiaj1011 srisatish drorivry yaolu shalakasatheesh dfder2 green-sky alexchvl closerforever zymrael bxie yizhongw open-models-platform davdgao ittican-org nelson-liu jon-tow reshinthadithyan buaahsh tgale96 people-art logisticloon crazyivanz csris jupiterepoch luechen nashjojo rayrayraykk iq-scm adtswrp architectureofthings lindakwan dkurra costly-ai srikalyan raghumish ruihong0417 yanndd1 shuoli90 libertfan katrinklug aiomodel sundave1998 markhng525 qbc2016 hewg2008 gongliyu drisspg tne-ai mkly feladorhet vijilai raulppelaez akkarimi gaybro8777

helm's Issues

use snakecase in Python

Currently, we are using CamelCase in the backend to make life easier on Javascript, but this makes casing inconsistent on the backend. We should just do the conversion.

The API should be in snake case. Most of this is in settings, so Javascript can just pass everything through.

Support deleting accounts - RemoteService, ServerService, CLI

def delete_account(self, auth: Authentication, account: Account):
...

take a pass and improve docstrings

Do some general code cleanup.

add RealToxicityPrompts scenarios/metrics

Paper: https://arxiv.org/pdf/2009.11462.pdf

setup CI

For now, just run the unit tests.

Create a PyPI package

So users can just do pip install crfm-benchmarking and run benchmarking without downloading the code.

Support controlled randomness in requests

Right now, queries are deterministic because of caching.

Add a random: <string> field in request, so that we can get the best of both worlds. This string would have to be propagated down into the cache, which currently is on the raw request, which we can't change.

Also, we would ideally support this feature in a backward compatible way so we don't blow away the existing cache. Not sure what the best way to do this is.

Use sqlitedict for caching

Currently, the cache for the LM uses jsonl. Switch it over to using sqlitedict.

Incorporate disparities in metrics.py

Compute the difference between average over instances with some tag. See metric.py

estimate the number of tokens accurately for the quota

We need to know the number of tokens in Service::make_request to keep track of the quotas accurately. Need to figure out how this is actually computed since each service presumably uses a different tokenizer (Jurassic has larger tokens for example).

output more information when benchmarking

When benchmarking, print out

the Instances of a Scenario
the RequestStates that are produced by the Adapter
the Metrics and the resulting Stats.

Output:

Summary statistics to stdout using htrack and hlog (this is basically fine)
Human readable to a text file.
Serializable information to a json file.

Format for text file:

Scenario ${scenario.name}
${scenario.description}
Tags: ${scenario.tags}
${n} instances

------- Instance ${i}/${total}: ${instance.tags}
Input: <input>
Reference(tags): ${reference.text}
...
Reference(tags): ${reference.text}

Where to store these files:

Each RunSpec should be given a path (e.g., reasoning/mmlu/chemistry) which a new field of RunSpec.
Output to scenario.{txt,json}, scenario_state.{txt,json}, metrics.{txt,json} in that path.

Figure out why OpenAI is sending us tokens past the stop sequences

Ask in the OpenAI discussion forum.

More context here: #29

authenticate API key immediately after user types it in

When user types in API key into the web interface, authenticate it right away and show an error message rather than waiting to see the error later.

Implement robustness in metric

Worst case over a group of instances with some tag - see metric.py

use ssl for crfm-models.stanford.edu

Run an nginx that routes to the server on port 1959 and use Let's Encrypt.

make execution multithreaded

Create a thread pool with some parallelism in Executor, so that we can process the examples in parallel.

        # TODO: make a thread pool to process all of these in parallel up to a certain number
        def render_instance(instance: Instance, pred_output: Optional[str]) -> str:
            tags_str = ",".join(instance.tags)
            gold_output: Optional[str] = None
            if instance.first_correct_reference is not None:
                gold_output = instance.first_correct_reference.output

            if request_state.output_mapping is not None and pred_output is not None:
                pred_output = request_state.output_mapping.get(pred_output.strip())
            correct_str = "CORRECT" if gold_output == pred_output else "WRONG"
            return (
                f'[{tags_str}] "{instance.input[:100]}" => "{gold_output}", predicted "{pred_output}" [{correct_str}]'
            )

reproduce numbers on MMLU

It's technically implemented, but we need to make sure that we're reproducing the numbers from the paper. Make sure all the decoding parameters are the same and that the eval set is the same. There is also some stuff involving truncation which we should take a look at.

Don't include all the dev packages in setup.py

def get_requirements(path: str):
    # TODO: don't include all the dev packages
    requirements = []
    for line in open(path):
        if not line.startswith('-r'):
            requirements.append(line.strip())
    return requirements

Verify package builds correctly.

Pass in arguments to BasicMetric

In basic_metrics.py:

        # TODO: pass in arguments to `BasicMetric`
        return compute_metrics("exact_match", exact_match)

Leverage a thread pool process instances in Executor

        # TODO: make a thread pool to process all of these in parallel up to a certain number
        def render_instance(instance: Instance, pred_output: Optional[str]) -> str:
            tags_str = ",".join(instance.tags)
            gold_output: Optional[str] = None
            if instance.first_correct_reference is not None:
                gold_output = instance.first_correct_reference.output

            if request_state.output_mapping is not None and pred_output is not None:
                pred_output = request_state.output_mapping.get(pred_output.strip())
            correct_str = "CORRECT" if gold_output == pred_output else "WRONG"
            return (
                f'[{tags_str}] "{instance.input[:100]}" => "{gold_output}", predicted "{pred_output}" [{correct_str}]'
            )

create `RemoteService` to allow for programmatic access

Create a RemoteService class that has the same signature as Service (for the functions used byServer). This can be used client-side. Then demo.py should be:

auth = Authentication(api_key="crfm")
service = RemoteService("http://crfm-models.stanford.edu")

# Make a request
request = Request(prompt="Life is like a box of")
print(service.make_request(auth, request))

# Modify account
account = service.get_account()
account.description = "Updated"
service.update_account(auth, account)

Behind the hood, it will make all the necessary REST calls.

Resume downloading from partially downloaded file

See general.py: ensure_file_download.

wget:
-c,  --continue                  resume getting a partially-downloaded file

but, wget -c doesn't work for some URLs

unicode isn't being displayed properly in web interface

Unicode contained in the completions - for example, see:
http://crfm-models.stanford.edu/static/index.html?prompt=A%20%24%7Boccupation%7D%20is%20someone%20who&settings=temperature%3A%200.5%0Amax_tokens%3A%20100%0Amodel%3A%20%24%7Bmodel%7D&environments=occupation%3A%20%5Bmathematician%2C%20lawyer%2C%20doctor%2C%20programmer%2C%20president%5D%0Amodel%3A%20%5Bai21%2Fj1-jumbo%2C%20openai%2Fdavinci%5D

support adding users / editing quotas

Create the following functions (with corresponding REST endpoints) for managing accounts:

Service.create_account(auth): creates a new account with a random API key and returns that Account
Service.get_accounts(auth): returns a list of Account
Service.update_account(auth, account): updates the account given by account.api_key with the fields in account.

Note that these methods should look at auth and the corresponding is_admin and make sure that the following permissions are enforced:

If is_admin = True, then anything is possible EXCEPT the used field in any Usage object (because this is changing, we don't want to override it).
Otherwise, one can only view their account and edit their account (EXCEPT api_key, usages).
Any disallowed changes are just discarded silently. This allows the client to have a simple uniform interface for updating an Account.

support any Hugging Face model

We want to be able to run queries against GPT-2 and GPT-J.

Two options:

Write a separate server that allows us to spin up Hugging Face servers locally.
Try to use Hugging Face servers directly (https://api-inference.huggingface.co/docs/python/html/quicktour.html#api-options-and-parameters)

Let us start with 2 since it's simpler (we might want to do 1 later). Either way, need to write a subclass of Client to support the separate server with the HF model.

Note: the existing implementation HuggingFaceClient is broken and tries to load the model directly, which is not a good idea.

Add stats with individual instances

Add statistics with the individual instances too and serialize them out. See metric.py.

Add pytest flag for slow tests in test_service.py

How to: https://www.py4u.net/discuss/204728

# TODO: put a flag on this so that it's easy to use pytest to still run these slow tests
@pytest.mark.skip(reason="Requires production")
def test_prod_continue():
    # Test that we're continuing
    prompt = "Paris is the capital of"
    for model in prod_models:
        request = Request(prompt=prompt, model=model, max_tokens=1, num_completions=1, temperature=0)
        helper_prod_test_service(request, " France")


@pytest.mark.skip(reason="Requires production")
def test_prod_echo():
    # If we're echoing the prompt, make sure we're getting the same thing back
    prompt = "I like pickles."
    for model in prod_models:
        request = Request(prompt=prompt, model=model, max_tokens=0, num_completions=1, echo_prompt=True)
        helper_prod_test_service(request, prompt)

Renew Perspective API key by 7/30/2022

The current API key we are using in production was created with the hai-gcp-models account and allows 200 queries per second:

"Thanks for sharing how you're using Perspective and contacting us about a quota increase. We reviewed your request and are happy to grant your project 200 queries per second until 7/30/2022."

When renewing, fill out the form with your Stanford email address. For GCP ID, put down hai-gcp-models.

Allow non-authenticated users to access cached requests / responses without authenticating

See server_service.py:

        """Actually make a request to an API."""
        # TODO: try to invoke the API even if we're not authenticated, and if
        #       it turns out the results are cached, then we can just hand back the results.

presentation framework v1

Executing benchmark-run on the cluster machines should write all the scenarios, scenario states, and metrics to JSON files for each run specification. Display results as a text file for now.

off by one error in codex

In the web interface, try the third example query (or anything with codex). The last token of the prompt seems to be repeated in the first token of the completion.

add HellaSwag

The first log probability and top choices are None

The first log probability and top choices are None, so we skip. See openai_client.py

            for text, logprob, top_logprobs in zip(
                raw_data["tokens"], raw_data["token_logprobs"], raw_data["top_logprobs"]
            ):
                tokens.append(Token(text=text, logprob=logprob or 0, top_logprobs=dict(top_logprobs or {})))

Sort the predicted outputs

In basic_metrics.py:

        # TODO: Sort the predictions, or take them from the top tokens of the first completion
        preds = [completion.text.strip() for completion in request_state.result.completions]

return probabilities back from the server

Make the RequestResult contain log probabilities.

Write tests to check the contents of the actual prompt

See test_adapter.py

Add dry run option for benchmarking

Dump the scenario states, but skip execution. Estimate the number of tokens in metrics.

stop sequences don't seem to work

When querying GPT-3, stop sequences doesn't seem to be respected.

http://crfm-models.stanford.edu/static/index.html?prompt=Answer%20the%20following%20question%20about%20geography.%0A%0AQuestion%3A%20What%20is%20the%20longest%20river%3F%0AAnswer%3A%20Nile%20%23%23%0A%0AQuestion%3A%20What%20is%20the%20tallest%20mountain%3F%0AAnswer%3A&settings=temperature%3A%200%0Astop_sequences%3A%20%5B%22%23%23%22%5D%0Atop_k_per_token%3A%205&environments=

Calculate perplexity in BasicMetric

In basic_metrics.py:

def compute_metrics(name: str, score_func: Callable[[str, str], float]) -> List[Stat]:

Add Twitter AAE

Used in the Gopher paper for measuring perplexity.

January 25 update: The LM pipeline has been implemented in #79. TODO: Implement the data filtering process that extracts tweets of different demographics from the raw data.

create the initial interfaces and stub classes

Design doc: https://docs.google.com/document/d/1RHIraoaL6NGj8RhDFv__-6PYqZxEWTRIfIpfGIeo2yI/edit#

Implement evaluate_references in basic_metrics.py

def evaluate_references(
    self, adapter_spec: AdapterSpec, reference_request_states: List[RequestState]
) -> List[Stat]:
    """
    Setup: for each reference, we have a model score (log probability) and whether it's correct.
    We define the following metrics:
    - correct_rank: if we sort references by their logprobs, what is the ranking of the first correct reference.
    """
    # TODO
    return []

add The Pile for measuring perplexity

Add ThePileScenario which allows you to sub select different eval sets. Do what Gopher did. Make sure we're not downloading more than needed.

Update User model and authentication

Make Authentication api_key instead of username and password. The API keys will be randomly generated and assigned to people. The User isn't so much tied to an individual person, but rather a use case. Add a description field, and maybe make emails a list of emails. Also add group and is_admin.

Need to update the web interface too.

DEFAULT_QUOTAS = {
    # model group -> {granularity -> quota}
    "gpt3": {"daily": 10000},
    "codex": {"daily": 10000},
    "jurassic": {"daily": 10000},
}

organize files into directories

Currently, everything is in src.

Move the clients, users, the server code to a separate directory service for managing access to the APIs.
Split out models.py into different files. Put common files in the src directory.
Create a separate directory benchmark for scenarios, metrics, etc..

Add end-to-end test for rest server

Test RemoteService.

OpenAI will support counting the number of tokens for us

From https://community.openai.com/t/how-do-i-calculate-the-pricing-for-generation-of-text/11662/5 , "we’re planning to implement a tokenizer into the client to make it easier. And adding an easy cost estimator into the api would be a great addition too."

stanford-crfm / helm Goto Github PK

helm's People

Stargazers

Watchers

Forkers

helm's Issues

Recommend Projects

Recommend Topics

Recommend Org