Ongoing, come back in a few days!
STITCH (Small Tweaks Impacting Task Completion Handling) is a small dataset and mini-framework to run experiments on how seemingly innocuous prompt changes can impact LLM reading comprehension. The aim of this is to provide some numbers to guide our understanding of prompt engineering, understand how models use context, and help make Q/A prompts less alchemical in nature!
STITCH uses a .yaml file to define which models, prompt formats and datasets you wish to evaluate. You may check default_run.yaml
for an example YAML file, or stay tuned until this section is updated.
STITCH is composed of three main subsets, and two "control" subsets, which are questions that so-called Frontier LLMs
, such as GPT-4 and Claude 3 Opus, can answer without the need for context. The three main subsets are from various domains: bsard
concerns Belgian Law (in French), biomrc
biomedical research papers, and proxima
auto-generated academic reports on the future colonisation of Proxima Centauri b and its technical and social implications.
Each of these susbet is composed of 54 questions, and they all aim to evaluate slightly different things:
bsard
requires both reasoning on non-English (French) documents and combining multiple relevant documents to answer a question.biomrc
requires reasoning on a single mid-sized relevant passage to answer a tricky question about experimental settings.proxima
requires reading a long document to find the answer in the relevant section.
Name | Lang | Relevant Document Type | Num entries | Known to frontier LLMs | Domain | Source |
---|---|---|---|---|---|---|
bsard | π«π· | multiple short relevant docs | 54 | β | Legal | maastrichtlawtech/bsard |
bsard_control | π«π· | multiple short relevant docs | 54 | β | Legal | maastrichtlawtech/bsard |
biomrc | π¬π§/πΊπΈ | single short relevant doc | 54 | β | Biomedical | biomrc |
biomrc_control | π¬π§/πΊπΈ | single short relevant doc | 54 | β | Biomedical | biomrc |
proxima | π¬π§/πΊπΈ | long document | 54 | β | Sci-fi | Synthetic |
TBD