Code Monkey home page Code Monkey logo

gemini-benchmark's Introduction

An In-depth Look at Gemini's Language Abilities

Repo for the paper An In-depth Look at Gemini's Language Abilities by CMU, Zeno, and BerriAI LiteLLM

In this paper, we do an in-depth exploration of Google Gemini's language abilities, making two contributions:

  • We provide a third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results.
  • we take a closer look at the results, identifying areas where one of the two model classes excels.

Results

We perform this analysis over 10 datasets testing a variety of language abilities, including reasoning, answering knowledge-based questions, solving math problems, translating between languages, generating code, and acting as instruction-following agents. From this analysis, we find that (as of this writing on December 18th, 2023):

  • Gemini's Pro model achieved comparable but slightly inferior accuracy compared to the current version of OpenAI's GPT 3.5 Turbo for all English tasks, but superior ability to translate into other languages.
  • Gemini fails in mathematical reasoning with many digits, and is sensitive to multiple-choice answer ordering, and others.
  • Gemini demonstrates comparably high performance in areas such as generation into non-English languages, handling longer and more complex reasoning chains, and word sorting/rearrangement problems.

The overall results table can be found below:

Task Dataset Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Mixtral
Knowledge-based QA MMLU (5-shot) 65.22 67.75 80.48 68.81
MMLU (CoT) 62.09 70.07 78.95 59.57
Reasoning BIG-Bench-Hard 67.53 71.02 83.90 60.76
Mathematics GSM8K 76.42 78.01 92.72 71.65
SVAMP 81.10 82.30 92.60 81.60
ASDIV 85.31 89.07 92.75 83.16
MAWPS 96.50 98.00 98.67 96.00
Code Generation HumanEval 59.76 74.39 76.83 45.12
ODEX 39.86 52.62 45.79 40.55
Machine Translation FLORES (5-shot) Unblocked 56.14 55.78 57.15 44.27
FLORES (5-shot) All 22.83 43.12 51.63 33.45
Web Agents WebArena 7.12 8.87 14.90 1.39

You can find more details on results from each task, and comprehensive analysis at each of the below links:

File Structure

  • /outputs/{dataset}/{model}: contains the outputs of the systems, separated by dataset and model
  • /benchmarking/{dataset}: contains the code for benchmarking, separated by dataset
  • /visualization: contains the code for visualization, possibly separated by task type

Setup

Create a .env file in the root of the repository with your Zeno API key:

ZENO_API_KEY=your_api_key

This is loaded by dotenv in the visualization files.

gemini-benchmark's People

Contributors

neubig avatar oootttyyy avatar yuzc19 avatar aashiqmuhamed avatar snat1505027 avatar sparkier avatar cabreraalex avatar eltociear avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.