Code Monkey home page Code Monkey logo

lmsys-compare's Introduction

lmsys-compare

This repository provides scripts for analyzing raw data from the LMSYS Chatbot Arena Leaderboard, the link to data usually available in the notebook linked from that page (now https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH).

Currently we have one tool, lmsys-compare.py, that compares skills of models using a logistic regression framework, akin to a 1D Item Response Theory (IRT) model. You can see the model written more clearly in the file irt1.stan.

The tool takes a JSON of raw data (clean_battle_*) as input, and outputs a (Polars) data frame of relative skills of the models, including posterior uncertainty of the estimates. There are also various options, most useful of which are probably the ability to specify a languge with -l (like Swedish, English), or the possibility to restrict to code or non-code prompts only. Note that the latest data files from LMSYS have numerous other covariates, including prompt lengths and types, that are not really extracted from the JSON by the script, but which would be easy to extract and use to further refine the comparison to a specific subset of prompts. Options:

  -h, --help            show this help message and exit
  --nmodels NMODELS, -n NMODELS
                        Number of models with most data to include (default: all)
  --language_regex LANGUAGE_REGEX, -l LANGUAGE_REGEX
  --code                Enable code
  --no-code             Disable code
  --models MODELS, -m MODELS
                        Models to report (regex, default: '')
  --min_mse MIN_MSE, -e MIN_MSE
                        Only report models with MSE of skill below this value.

So calling for example python3 lmsys-compare.py clean_battle_20240519_public.json --code -m 'gpt-4o|opus|gpt-3.5|llama-3' should give you a table as below:

Input: clean_battle_20240519_public.json
shape: (8, 5)
┌────────────────────────┬───────┬─────┬───────────┬───────────┐
│ model                  ┆ count ┆ id  ┆ skill     ┆ MSE_skill │
│ ---                    ┆ ---   ┆ --- ┆ ---       ┆ ---       │
│ str                    ┆ u32   ┆ u32 ┆ f64       ┆ f64       │
╞════════════════════════╪═══════╪═════╪═══════════╪═══════════╡
│ gpt-4o-2024-05-13      ┆ 1627  ┆ 39  ┆ 2.072369  ┆ 0.061247  │
│ claude-3-opus-20240229 ┆ 8760  ┆ 10  ┆ 1.629636  ┆ 0.032876  │
│ llama-3-70b-instruct   ┆ 8769  ┆ 47  ┆ 1.093747  ┆ 0.032902  │
│ llama-3-8b-instruct    ┆ 5412  ┆ 48  ┆ 0.557968  ┆ 0.037353  │
│ gpt-3.5-turbo-0125     ┆ 3472  ┆ 30  ┆ 0.496733  ┆ 0.041498  │
│ gpt-3.5-turbo-0613     ┆ 1579  ┆ 32  ┆ 0.466514  ┆ 0.055872  │
│ gpt-3.5-turbo-1106     ┆ 721   ┆ 33  ┆ 0.278692  ┆ 0.081789  │
│ codellama-34b-instruct ┆ 281   ┆ 13  ┆ -0.421149 ┆ 0.130257  │
└────────────────────────┴───────┴─────┴───────────┴───────────┘

It shows that in coding, GPT-4o is now the best model (of those chosen to the comparison, and in fact of all models). Reported differences in skills are actually log-odds for winning the blind comparison, so for example GPT-4o has odds of roughly exp(2.07-1.63)≈1.55 (against 1.0) of winning Opus in a "random" coding prompt. (What users actually prompt at the Chatbot Arena is a good question.)

The IRT model is implemented in JAX and it calculates the estimation errors with a Laplace approximation. In quick experiments, the errors were even slightly (5–10%) too conservative---in general you shouldn't take them to be absolutely accurate. Under the usual gaussian approximation (which seems to hold here quite well), the 95% confidence interval for skill is the usual ±1.96*MSE_skill.

Note that the skill values are unique up to a constant, so meaningful only when compared.

lmsys-compare's People

Contributors

euxoa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.