aflah02 / humans-v-s-llm-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them

Python 57.98% Jupyter Notebook 42.02%

humans-v-s-llm-benchmarks's Introduction

LLM Benchmark Quiz Game

Purpose

The primary goal of this tool is to provide a hands-on experience that allows users to not only test their knowledge but also gain a deeper understanding of the challenges and limitations associated with LLM benchmarks. By participating in the quiz game, users can appreciate the nuances involved in evaluating LLMs and how well these models perform on diverse tasks.

Featured Benchmarks

The chosen benchmarks are the ones prominently used for evaluating LLMs on the Open LLM Leaderboard. Here's a brief overview of the benchmarks included:

ARC: A set of grade-school science questions.
HellaSwag: A test of commonsense inference, challenging for state-of-the-art models despite being easy for humans (~95% accuracy).
MMLU: A multitask accuracy test covering 57 diverse tasks, including mathematics, US history, computer science, law, and more.
TruthfulQA: A test to measure a model's tendency to reproduce falsehoods commonly found online.
WinoGrande: An adversarial Winograd benchmark at scale, focusing on commonsense reasoning.
GSM8k: Diverse grade school math word problems to assess a model's ability to solve multi-step mathematical reasoning problems.

How to Use

Hosted Preview -

Simply go to https://play-with-llm-benchmarks.streamlit.app/ and get the full experience

Local Development -

Simply clone the repo and run streamlit run Main.py and enjoy the quiz game based on the selected benchmarks. Answer questions related from these benchmarks and measure your own performance.

Feel free to contribute, report issues, or suggest improvements to enhance the overall experience. Happy quizzing!

Recommend Projects

aflah02 / humans-v-s-llm-benchmarks Goto Github PK

humans-v-s-llm-benchmarks's Introduction

LLM Benchmark Quiz Game

Purpose

Featured Benchmarks

How to Use

Hosted Preview -

Local Development -

humans-v-s-llm-benchmarks's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent