Code Monkey home page Code Monkey logo

factcheck-gpt's Introduction

Factcheck-GPT

Fact-checking the Output of Generative Large Language Models in both Annotation and Evaluation.

Code License: Apache 2.0

Table of Contents

Pipeline

This work define a framework for document-level fact checking, from decomposation, decontextualisation, check-worthiness identification to evidence retrieval, stance detection and make edits to fix the hallucinated information.

Title

Get Started

Steps to check factuality of a document. See details in reproduce_tutorial.ipynb for subtasks.

import sys
sys.path.append("./src")
import pandas as pd
from pipeline import check_document, check_documents

doc = "MBZUAI ranks 19 globally in its areas of specialization – AI in 2023."  # document to check
label, log = check_document(doc, model = "gpt-3.5-turbo-0613")

Dataset

We construct a dataset --- Factcheck-GPT with human-annotated factual labels for 94 ChatGPT responses.

Statistics

Statitics over three question sources as below:

Title

Claim analysis:

  • Whether raters can de�termine the factuality of a claim depending on the automatically-collected evidence (Yes/No).

  • Does the evidence support the claim (CP:completely support, PS:partially support, RE: refute, IR: irrelevant)

  • Does the claim need to be corrected. NA (17) refers to 16 opinion-claims + 1 not-a-claim

Title

Annotation Tool

We release the annotation tool in this repository.

FactBench

We further gather another three human-annotated datasets (FacTool-KB, FELM-WK, HaluEval) that are used to evaluate the effectiveness of automatic fact-checkers, resulting in FactBench with 4,835 examples.

Baselines

We set five subtasks and show the subtask 4: verfication below. See details in notebook: reproduce_tutorial.

Subtasks

  • Subtask 1. Sentence Checkworthyness: Given a sentence, identify whether the sentence contains factual statement (Yes/No).

  • Subtask 2. Claim Checkworthyness: Given a claim, detect its checkworthiness by categories of factual, opinion, not a claim and other.

  • Subtask 3. Stance Detection: Given a (evidence passage, claim) pair, judge the stance of evidence against the claim, whether it supports, partially supports, refute or is irrelevant to the claim.

  • Subtask 4. Claim Verification: Given a claim without gold evidence, determine whether it is factually true or false, if false, revise.

  • Subtask 5. Edit False Response: Given a list of true claims and the original response, eidt to correct the factual errors while preserving the linguistic features and style of the original.

Verification Results

Title

Citation

The Factcheck-GPT is described in the following arXiv paper:

@article{Wang2023FactcheckGPTEF,
  title={Factcheck-GPT: End-to-End Fine-Grained Document-Level 
         Fact-Checking and Correction of LLM Output},
  author={Yuxia Wang and 
          Revanth Gangi Reddy and 
          Zain Muhammad Mujahid and 
          Arnav Arora and 
          Aleksandr Rubashevskii and 
          Jiahui Geng and 
          Osama Mohammed Afzal and 
          Liangming Pan and 
          Nadav Borenstein and 
          Aditya Pillai and 
          Isabelle Augenstein and 
          Iryna Gurevych and 
          Preslav Nakov},
  journal={ArXiv},
  year={2023},
  volume={abs/2311.09000},
}

factcheck-gpt's People

Contributors

yuxiaw avatar

Stargazers

Lulzx avatar Vishal Srivastava avatar  avatar Nikolaus Schlemm avatar Xinyuan Lu  avatar Kevin Deenanauth avatar Senias avatar  avatar  avatar Bradley Fox avatar zhongxiang_sun avatar Rui Xing avatar Dylan Jones avatar Haolin Deng avatar Zhengping Jiang avatar  avatar Suresh Veeragoni avatar Matt Krzus avatar Chenlong Zhang avatar ACNgeeeekoi avatar  avatar  avatar Shinedog avatar Milan Patel avatar Johannes Busching avatar Omar Beglerovic avatar  avatar Don Park avatar  avatar  avatar  avatar Ray Li avatar Xinrui-Zhang avatar Tong avatar Xiaojie Gu avatar Wei Xiao avatar Ramsey avatar  avatar Young Joon Lee avatar Ding Zhang avatar Hirokazu Kiyomaru avatar Sasi Kiran Malladi avatar xinyan avatar Bo Han avatar Jamin  avatar Ðietrich ₸rautmann avatar Peter Morgan avatar w5688414 avatar Yann Ma avatar  avatar  avatar  avatar  avatar Jeff Carpenter avatar Yuxuan Liu avatar  avatar isspek avatar  avatar Mohammad Reza Taesiri avatar  avatar Zain Muhammad Mujahid avatar

Watchers

Kostas Georgiou avatar Ihsan Soydemir avatar  avatar  avatar

factcheck-gpt's Issues

Human-annotated atomic claims?

Hi, I have some troubles understanding something about the claims inside Factbench.jsonl and factcheck-gpt-benchmark.jsonl. Are the claims present the ones that are human annotated? Same question for "decontext" and "revised-decontext". The last ones are revised in the sense that the factuality is corrected right? are the claims generated by ChatGPT present somewhere too?

Class Balance Presented in Paper Question

Salut! I would like to ask some questions in order to clarify your paper's result and check if you had updates after publishing your paper.

  1. Question:
    You said in paper:

How many examples are factually incorrect? 61 examples contain factual errors and 31 examples are factually correct. Specifically, 53 examples contain false claims, and 19 examples contain claims in which annotators cannot obtained related evidence from the Internet to prove the correctness of the statement.”

Opening pen - factcheck-GPT-benchmark.jsonl (94 records) I can see this amounts of classes...

response_factuality
False    68
True     24
NA        2
Name: count, dtype: int64

Opening subtask4_claim_factuality.jsonl (661 records) I can see this amounts of classes:

 true                   472
 false                  159
 not_enough_evidence     30
 Name: count, dtype: int64

Could you tell me, did you update your benchmark?

  1. Question:
    In this table (Table 5: Verification results) you said your PRECISION for constant TRUE prediction is 0.81. It means - 81% of your data is labeled as TRUE for positive examples during the constant classification. How did you get this number?

screenshot of the table from your paper:
Screenshot 2024-01-19 at 10 59 16

Code for Evaluating Atomic Facts

Hello!
Is it possible to obtain the code you used to calculate the normalized edit-distance, n-gram distance and word overlap between the human created atomic facts and the ones created by the model?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.