Code Monkey home page Code Monkey logo

stitch's Introduction

What is STITCH?

Ongoing, come back in a few days!

STITCH (Small Tweaks Impacting Task Completion Handling) is a small dataset and mini-framework to run experiments on how seemingly innocuous prompt changes can impact LLM reading comprehension. The aim of this is to provide some numbers to guide our understanding of prompt engineering, understand how models use context, and help make Q/A prompts less alchemical in nature!

What are we evaluating exactly?

How to run STITCH benchmarking?

STITCH uses a .yaml file to define which models, prompt formats and datasets you wish to evaluate. You may check default_run.yaml for an example YAML file, or stay tuned until this section is updated.

The STITCH dataset

STITCH is composed of three main subsets, and two "control" subsets, which are questions that so-called Frontier LLMs, such as GPT-4 and Claude 3 Opus, can answer without the need for context. The three main subsets are from various domains: bsard concerns Belgian Law (in French), biomrc biomedical research papers, and proxima auto-generated academic reports on the future colonisation of Proxima Centauri b and its technical and social implications.

Each of these susbet is composed of 54 questions, and they all aim to evaluate slightly different things:

  • bsard requires both reasoning on non-English (French) documents and combining multiple relevant documents to answer a question.
  • biomrc requires reasoning on a single mid-sized relevant passage to answer a tricky question about experimental settings.
  • proxima requires reading a long document to find the answer in the relevant section.
Name Lang Relevant Document Type Num entries Known to frontier LLMs Domain Source
bsard πŸ‡«πŸ‡· multiple short relevant docs 54 ❌ Legal maastrichtlawtech/bsard
bsard_control πŸ‡«πŸ‡· multiple short relevant docs 54 βœ… Legal maastrichtlawtech/bsard
biomrc πŸ‡¬πŸ‡§/πŸ‡ΊπŸ‡Έ single short relevant doc 54 ❌ Biomedical biomrc
biomrc_control πŸ‡¬πŸ‡§/πŸ‡ΊπŸ‡Έ single short relevant doc 54 βœ… Biomedical biomrc
proxima πŸ‡¬πŸ‡§/πŸ‡ΊπŸ‡Έ long document 54 ❌ Sci-fi Synthetic

Results

TBD

stitch's People

Contributors

bclavie avatar

Stargazers

Kate Silverstein avatar Lau Van Kiet avatar

Watchers

Vik Paruchuri avatar  avatar Griffin Adams avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.