Code Monkey home page Code Monkey logo

llmjudge's Introduction

LLMs-as-a-judge : Exploring limitations and capabilties

caution. this is an ongoing experiment. the readme is updated continously with new results. i'm actively looking for suggestions!

Evaluating LLMs in a open-ended scenario is difficult, there's a growing consensus that existing benchmarks are lacking and seasoned practitioners prefer to vibe check models themselves. I've resorted to anecdotal evaluations from developers and researchers I trust, with Chatbot Arena being an excellent complement. The motivation behind this repo is the increasingly popular method of using strong LLMs as a judge for models. This method has been around for a few months, with models such as JudgeLM, and more recently MT-Bench.

You may or may not have seen this thread. According to the authors of the tweet at Arize AI, using LLMs-as-a-judge warrants server caution, specifically in regards to the use of numeric score evaluations. It seems that LLMs are very poor at handling continuous ranges, which becomes glaringly obvious when prompting them to evaluate X from 1 to 10. This repo is a living document of experiments attempting to understand and capture the jagged frontier of this problem. Recent work has established a strong correlation between MT-Bench and Human Judgment (Arena Elo), meaning that LLMs are capable of being judges, so what's going on here?

Key Findings

T.B.D

Full Experiment

Below are the full details and results.

Methodology

Due to cost constraints, I'll initially focus on the spelling/misspelling task described in the tweets. I'm slightly worried that the quantitative X of this task is going to contaminate the insights of this experiment, but we'll see. I welcome a more full flegded analysis of this phenomena, my results should be taken with a grain of salt given the limited experiment

Spelling Dataset

I've generated a spelling or misspelling dataset, not sure which name is more appropriate, from the essays of Paul Graham. This choice was mostly out of convenience as I've used the dataset before when pressure testing context windows. I extracted a context of 3,000 words from the essays and insert spelling errors on random words based on the desired misspelling ratio. In pseudocode:

misspell_ratio

words = split context into words
misspell_count = calculate number of words to misspell based on ratio

FOR word = sample(words, misspell_count)
    IF length(word) > 3
        extract random character
    ELSE:
        add random character
END FOR

The complete code is readily available as a notebook.

LLM Evaluator

Given the generated dataset, we prompt LLMs to evaluate the amount of misspelled words in a context using different scoring templates. We're using the following APIs

GPT-4: gpt-4-0125-preview

GPT-3.5: gpt-3.5-turbo-1106

at temperature = 0.

Results

Test 1. Let's confirm that LLMs struggle to handle numeric ranges in a zero-shot setting. We prompt GPT-3.5 and GPT-4 with a numeric scoring template, ranging from score 0 to score 10.

As expected, both misjudge severely.


Test 2. What happens if we reverse the scoring range? Now, a score of 10 represents a perfectly spelt document.

This doesn't seem to make much of a difference.


Test 3. If we were to believe the hypothesis from Arize, we may see improvements if we avoid a scoring rubric and instead use 'labeled grades'. In this case I decided to move down to a 5-point grading scale.

Perhaps slight improvements? Difficult to say honestly. I'm not impressed.


Test 4. What about zero-shot Chain-of-Thought?

gpt-3.5 devolved into gibberish for two of the prompts. As expected, gpt-4 sees improvement when prompted to think-out-loud. Notice how it get's very hesitant to assign a score of 10.


Test 5. As suggested by the author of Prometheus; mapping each score with it's own explanation likely improves the LLMs ability to grade across the entire numeric range. This, combined with CoT, results in:

Continued improvements for gpt-4. It's still very relucent to assign boundary scores 0 & 10.


Test 6. After reading more about MT Bench I decided to test an alternate approach, using a pairwise comparisons as opposed to isolated scoring. Now, normally this would require O(n * log N) comparisons, but because we already know the order I figured we'd just test the hardest cases: comparing 0% misspelling vs 10% misspelling, 10% vs 20% and so on for a total of 10 comparisons. Notice that I used zero-shot CoT as well.

My hypothesis was that GPT-4 would have excelled in a scenario where it got to compare two texts inside it's context window but I was wrong. To my surprise this really didn't improve things at all. Sure, this is the hardest out of all possible comparisons but all in all this is still a straight forward task. Maybe the quantitative aspects of this task are just inherently very difficult for LLMs. Hmm, perhaps I need to find a better proxy task...

Discussion (free-form, continuously updated)

MT Bench

(31/1) I've been going through the internals of MT-Bench, and was very surprised to find they simply ask GPT-4 to score outputs on a scale of 1-10. They do supply alternative grading options such as pairwise comparisons against a baseline but the recommended option is the numeric one. The judgement prompt is also unexpectedly simple:

Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: [rating], for example: "Rating: 5". [Question] {question} [The Start of Assistant's Answer] {answer} [The End of Assistant's Answer]

If one is to believe that this is all there is to judging in MT-Bench, then I'm beginning to question the use of the misspelling task as a proxy task...

Experiments

(2/2) I'm keen on making GPT-4 judge the misspelled texts through a pairwise comparison as opposed to isolated scoring. This is one of the alternative judgment methods for MT Bench (although they do recommend isolated scoring), and I suspect that it is more suitable for this task. The CoT + full mapping results are definitely an improvement but I still think there's work to be done. The drawback with pairwise scoring is of course that you're going to need significantly more API calls to establish the full ranking (in practice).

llmjudge's People

Contributors

leonericsson avatar

Stargazers

Alexander Kledal avatar 黄皓佳 avatar  avatar  avatar Sayantan Das avatar Bander Alsulami avatar

Watchers

 avatar  avatar

Forkers

anupamamurthi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.