Code Monkey home page Code Monkey logo

kodialogbench's Introduction

KoDialogBench

This is the official repository for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" accepted at LREC-COLING 2024.

Data

KoDialogBench is a benchmark designed to assess the conversational capabilities of language models in Korean language. To this end, we collected native Korean dialogues on daily topics from public sources (e.g., AI Hub), or translated dialogues from other languages such as English and Chinese. We then structured these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. This benchmark consists of 21 test sets, encompassing various aspects of open-domain colloquial dialogues (e.g., topic, emotion, dialog act).

We uploaded the datasets on ๐Ÿค—Hugging Face Hub.

Sources

We collected native Korean dialogues from AI Hub:

  • K-SNS stands for Korean SNS (ํ•œ๊ตญ์–ด SNS)
  • K-TDD stands for Thematic Daily Dialogues (์ฃผ์ œ๋ณ„ ํ…์ŠคํŠธ ์ผ์ƒ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ)
  • K-ED stands for Emotional Dialogues (๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜)
  • K-DS stands for Dialogue Summary (ํ•œ๊ตญ์–ด ๋Œ€ํ™” ์š”์•ฝ)

We translated public datasets from other languages:

Statistics

The dataset has 82,962 examples in total.

Task Subtask Source # Options # Examples
Dialogue Comprehension Topic Classification K-SNS 6 1200
Dialogue Comprehension Topic Classification K-TDD 19 1900
Dialogue Comprehension Topic Classification SocialDial 4 400
Dialogue Comprehension Emotion Recognition K-ED 6 1200
Dialogue Comprehension Emotion Recognition DailyDialog 5 470
Dialogue Comprehension Emotion Recognition Empathetic Dialogues 2 2000
Dialogue Comprehension Relation Classification SocialDial (Distance) 4 524
Dialogue Comprehension Relation Classification SocialDial (Relation) 3 330
Dialogue Comprehension Location Classification SocialDial 4 376
Dialogue Comprehension Dialog Act Classification K-TDD 4 520
Dialogue Comprehension Dialog Act Classification DailyDialog 4 1000
Dialogue Comprehension Fact Identification K-DS 4 1200
Dialogue Comprehension Fact Identification PersonaChat 4 1000
Dialogue Comprehension Fact Identification Empathetic Dialogues 4 2394
Response Selection K-SNS 5 10295
Response Selection K-TDD 5 10616
Response Selection K-ED 5 17818
Response Selection PersonaChat 5 7801
Response Selection DailyDialog 5 6740
Response Selection Empathetic Dialogues 5 7941
Response Selection SocialDial 5 7237

Usage

lm-evaluation-harness is used for zero-shot and few-shot evaluation.

TODO: merge the KoDialogBench task to lm-evaluation-harness

Installation

Install lm-eval first before cloning this repo.

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
python -m venv venv
pip install -e .
pip install -e ".[multilingual]"
pip install sentencepiece

Task registration

After cloning this repo, copy task configs to lm-eval

cp -r kodialogbench ../lm-evaluation-harness/lm_eval/tasks

Evaluation

You can evaluate the subsets using the following arguments to --tasks:

  • kodialogbench_dc: 14 dialogue comprehension tasks
  • kodialogbench_rs: 7 response selection tasks
  • kodialogbench_dc_topic: 3 topic classification tasks
  • kodialogbench_dc_emotion: 3 emotion classification tasks
  • kodialogbench_dc_relation: 2 relation classification tasks
  • kodialogbench_dc_dialog_act: 2 dialog act classification tasks
  • kodialogbench_dc_fact: 3 fact identification tasks
lm_eval --model hf \
    --model_args pretrained=EleutherAI/polyglot-ko-1.3b \
    --tasks kodialogbench \
    --device cuda:0 \
    --batch_size auto \
    --num_fewshot 0

If you want to change prompts, modify doc_to_text functions in utils.py.

Limitations

Our benchmark may suffer from a chronic problem of benchmark contamination. Due to the scarcity of Korean language resources, there is a possibility that the held-out sources utilized to construct the benchmark might overlap with training data used for some language models.

Ethics Statement

Our benchmark dataset is designed to assess capabilities related to various situations and aspects of conversations in Korean language. To achieve this, we utilized conversational content from publicly available datasets from various sources, either without modification or with translation if necessary. During this process, there is a possibility that harmful content or inappropriate biases existing in the original data may have been conveyed, or may have arisen due to limitations of translation tools. We reject any form of violence, discrimination, or offensive language, and our benchmark dataset and experimental results does not represent such values. If any harmful content or privacy infringement is identified within the dataset, we kindly request immediate notification to the authors. In the event of such cases being reported, we will apply the highest ethical standards and take appropriate actions.

Citation

@misc{jang2024kodialogbench,
      title={KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark}, 
      author={Seongbo Jang and Seonghyeon Lee and Hwanjo Yu},
      year={2024},
      eprint={2402.17377},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Point of Contact

Seongbo Jang

kodialogbench's People

Contributors

sb-jang avatar

Stargazers

Jaeyoon Jung avatar Kim Kang Min avatar Yohan Na avatar ์ด์ƒ๋ฏผ avatar Jaehoon Lee avatar Myungchul Shin avatar Jihoon Lee avatar seoyeon avatar Jiwung Hyun avatar Byeongchang Kim avatar Hyunwoo Kim avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.