Code Monkey home page Code Monkey logo

erroranalysis_prompt's Introduction

ErrorAnalysis Prompt for MT Evaluation in ChatGPT

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. (Full report)

This repository releases the testsets and the scores evaluated by text-davinci-003 and ChatGPT, for the replication of the study.

Abstract

Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks such as machine translation, question answering, text summarization, and natural language understanding. Recent research (Kocmi and Federmann, 2023) has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conducted an investigation into several prompting methods. Our results indicate that by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2022), a new prompting method called Error Analysis Prompting, LLMs like ChatGPT can generate human-like MT evaluations at both the system and segment level. Additionally, we discovered some limitations of ChatGPT as an MT evaluator, such as unstable scoring and biases when provided with multiple translations in a single query. Our findings aim to provide a preliminary experience for appropriately evaluating translation quality on ChatGPT while offering a variety of tricks in designing prompts for in-context learning. We anticipate that this report will shed new light on advancing the field of translation evaluation with LLMs by enhancing both accuracy and reliability of metrics.

Data and Evaluations

For each language pair, we divide the segments from WMT20 testset into four groups based on the number of tokens they contain (15-24, 25-34, 35-44, 45-54). We randomly sample 10 segments from each group and form a new dataset containing 40 segments. We utilize Multidimentional Quality Metric (MQM) as human evaluation. The test data and its corresponding evaluation scores can be obtained in "./data".

The task statistics are shown as follows:

image

An overview of Error Analysis Prompting

An overview of our error analysis prompting. Detailed prompt contexts can be obtained in "./prompts".

image

Results and Findings

  1. ๐Ÿ™‚ Our EA Prompting outperforms standard prompting at the segment level, achieving human-like evaluations at both the system level and segment level.

    System & Segment level performance on our testset:

image
  1. ๐Ÿค” When designing prompts, itemized responses are better than lengthy and detailed explanations of errors. Moreover, splitting the instruction into two identifying errors and scoring translation can improve evaluation stability.

    An comparison on different prompt designs, and their prompt contexts:

image
image
  1. ๐Ÿ˜ The boosted performance from EA prompting is observed in the zero-shot scenario on text-davinci-003 rather than in the few-shot scenario, which indicates that we need to adjust our settings when utilizing other GPT models.
  2. โ— Despite its good performance, we show that ChatGPT is NOT a stable evaluator and may score the same translation differently.
image
  1. โ— It is NOT advisable to combine multiple translations into a single query input, as ChatGPT has a preference for former translations.
image

Please refer to our full report & arXiv preprint for more details.

Citation

If you find this work helpful, please consider citing as follows:

@article{Lu2023EAPrompt,
  title={Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT},
  author={Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng},
  journal={arXiv preprint},
  url={https://arxiv.org/pdf/2303.13809.pdf},
  year={2023}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.