Code Monkey home page Code Monkey logo

halludial's Introduction

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Build Build Build

This repository contains the data and code for the paper "HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation."

HalluDial

๐Ÿ“‹ Table of Contents

๐Ÿ“– Usage

Downloading the Dataset

The HalluDial dataset can be downloaded from here. After downloading, please extract the contents to the data directory. The file structure should look like this:

HalluDial
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ spontaneous
โ”‚   โ”‚   โ”œโ”€โ”€ spontaneous_train.json
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ induced
โ”‚       โ”œโ”€โ”€ induced_train.json
โ”‚       โ””โ”€โ”€ ...
โ””โ”€โ”€ ...

Documentation and Examples

The HalluDial dataset includes two types of hallucination scenarios: the spontaneous hallucination scenario and the induced hallucination scenario, representing two different data construction processes. Each scenario is further split into training and test sets. The dataset is provided as JSON files, one for each partition: train.json, test.json. The splits are sized as follows:

Spontaneous Hallucination Scenario

Split # Samples
train 55071
test 36714
total 91785

Induced Hallucination Scenario

Split # Samples
train 33042
test 22029
total 55071

Dataset Structure

Each JSON file contains a list of dialogues, where each dialogue is represented as a dictionary. Here is an example dialogue:

{
    "dialogue_id": 0,
    "knowledge": "Use by a wider audience only came in 1995 when restrictions on the use of the Internet to carry commercial traffic were lifted.",
    "dialogue_history": "[Human]: Can you imagine the world without internet access? [Assistant]: Yeah, but once the access to the internet was a rare thing. do you remember? [Human]: I do. What else can you tell me ?",
    "turn": 1,
    "response": "Oh, the internet was widely accessed at homes and businesses across the globe by the late 1980s, actually.",
    "target": "Yes. The hallucination here lies in the claim that the internet was \"widely accessed at homes and businesses across the globe by the late 1980s.\" In reality, broad use of the internet did not occur until 1995 when restrictions were lifted allowing commercial traffic.",
}

where

  • dialogue_id: a unique identifier for the dialogue. Matching dialogue IDs indicate that the knowledge and the dialogue history are the same.
  • knowledge: the knowledge provided for the dialogue.
  • dialogue_history: the dialogue history.
  • turn: the turn in the dialogue.
  • response: the response generated by the model.
  • target: the hallucination evaluation result, including the results of hallucination detection, hallucination localization, and rationale provision.

Loading the Dataset

The data can be loaded the same way as any other JSON file. For example, in Python:

import json

spontaneous_dataset = {
    "train": json.load(open("data/spontaneous/spontaneous_train.json")),
    "test": json.load(open("data/spontaneous/spontaneous_test.json"))
}

induced_dataset = {
    "train": json.load(open("data/induced/induced_train.json")),
    "test": json.load(open("data/induced/induced_test.json"))
}

However, it can be easier to work with the dataset using the HuggingFace Datasets library:

# pip install datasets
from datasets import load_dataset

spontaneous_dataset = load_dataset("FlagEval/HalluDial", "spontaneous")
induced_dataset = load_dataset("FlagEval/HalluDial", "induced")

Evaluating

We provide example scripts to conduct meta-evaluations of the hallucination evaluation ability of Llama-2 on the HalluDial dataset.

Installation

To install the required dependencies, run:

pip install -r requirements.txt

Hallucination Detection

To evaluate the hallucination detection performance, run:

sh example/eval_detect.sh

Hallucination Localization and Rationale Provision

To evaluate the hallucination localization and rationale provision performance, run:

sh example/eval_rationale.sh

๐Ÿ“ Citing

If you use the HalluDial dataset in your work, please consider citing our paper:

@article{luo2024halludial,
  title={HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation},
  author={Luo, Wen and Shen, Tianshu and Li, Wei and Peng, Guangyue and Xuan, Richeng and Wang, Houfeng and Yang, Xi},
  journal={arXiv e-prints},
  pages={arXiv--2406},
  year={2024}
}

halludial's People

Contributors

lllllw-222 avatar xuanricheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

xiyang85

halludial's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.