Code Monkey home page Code Monkey logo

cped's Introduction

made-with-python arxiv GitHub stars GitHub license GitHub repo size Code style: black GitHub last commit

README: English | 中文
This repository provides the implementation details for the paper:
CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI

For more information, please refer to our paper.

The dataset is also available in luge.ai: https://www.luge.ai/#/luge/dataDetail?id=41

We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge. The table below shows a comparison of CPED with some other common conversation data sets.

  • We build a multiturn Chinese Personalized and Emotional Dialogue dataset called CPED. To the best of our knowledge, CPED is the first Chinese personalized and emotional dialogue dataset. CPED contains 12K dialogues and 133K utterances with multi-modal context. Therefore, it can be used in both complicated dialogue understanding and human-like conversation generation.
  • CPED has been annotated with 3 character attributes (name, gender age), Big Five personality traits, 2 types of dynamic emotional information (sentiment and emotion) and DAs. The personality traits and emotions can be used as prior external knowledge for open-domain conversation generation, making the conversation system have a good command of personification capabilities.
  • We propose three tasks for CPED: personality recognition in conversations (PRC), emotion recognition in conversations (ERC), and personalized and emotional conversation (PEC). A set of experiments verify the importance of using personalities and emotions as prior external knowledge for conversation generation.

dataset_comparison

In order for the dialogue system to learn emotional expression and personalized expression abilities, we provide multiple types of annotation labels listed in the following Table.

# of annos. Labels Num.
Sentiment positive, neutral, and negative 3
Emotion happy, grateful, relaxed, other-positive, neutral, angry, sad, feared, depressed, disgusted, astonished, worried and other-negative 13
Gender male, female, and unknown 3
Age group children, teenager, young, middle-aged, elderly and unknown 6
Big Five high, low, and unknown 3
DA greeting (g), question (q), answer (ans), statement-opinion (sv), statement-non-opinion (sd), apology (fa), command (c), agreement/acceptance (aa), disagreement (dag), acknowledge (a), appreciation (ba), interjection (ij), conventional-closing (fc), thanking (ft), quotation (^q), reject(rj), irony (ir), comfort (cf) and other (oth) 19
Scene home, office, school, mall, hospital, restaurant, sports-venue, entertainment-venue, car, outdoor and other-scene 11

Distribution of Gender, Age Group, Sentiment, Emotion and DA in CPED Dataset are shown in the following figure.

The statistics of CPED are listed in the following table.

Statistics Train Dev Test
# of modalities (v,a,t) (v,a,t) (v,a,t)
# of TV plays 26 5 9
# of dialogues 8,086 934 2,815
# of utterances 94,187 11,137 27,438
# of speakers 273 38 81
Avg. # utt. per dial. 11.6 11.9 9.7
Max # utt. per dial. 75 31 34
Avg. # of emot. per dial. 2.8 3.4 3.2
Avg. # of DAs per dial. 3.6 3.7 3.2
Avg. utt. length 8.3 8.2 8.3
Max utt. length 127 42 45
Avg. duration of an utterance 2.1s 2.12s 2.21s

CPED allows evaluation of both conversational cognitive tasks and conversation generation tasks, e.g. speaker modeling, personality recognition in conversations, emotion recognition in conversations, DA recognition in conversations, emotion prediction for response, emotional conversation generation, personalized conversation generation, empathetic conversation etc. By being multimodal, CPED can also be applied in multimodal personality or emotion recognition, multimodal conversation generation. It will play a positive role in promoting the development of cognitive intelligence.
We introduced 3 tasks in the project:

You can create the python virtual environment through the following bash script:

conda create -n py38 python=3.8
conda activate py38
pip install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install tensorflow==2.2.0
pip install transformers==4.18.0
python -m pip install paddlepaddle-gpu==2.3.0 -i https://mirror.baidu.com/pypi/simple
pip install pytorch-ignite==0.4.8
pip install notebook
pip install pandas
pip install chardet
pip install matplotlib==3.5.2
python -m pip install paddlenlp -i https://mirrors.aliyun.com/pypi/simple/
python -m pip install ppasr -i https://mirrors.aliyun.com/pypi/simple/ -U
pip install nltk
pip install bert-score

some version of the used packages are as follows:

python=3.8
torch==1.9.0+cu102 
torchvision==0.10.0+cu102 
torchaudio==0.9.0
tensorflow==2.2.0
tensorboard==2.2.2
transformers==4.18.0
paddlepaddle-gpu==2.3.0
paddlenlp==2.3.2
pytorch-ignite==0.4.8
matplotlib==3.5.2
notebook==6.4.11
pandas==1.4.2
chardet==4.0.0
nltk==3.7
bert-score==0.3.11

Please cite our paper if you use CPED or this project:

@article{chen2022cped,
	title={{CPED}: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI},
	author={Yirong Chen and Weiquan Fan and Xiaofen Xing and Jianxin Pang and Minlie Huang and Wenjing Han and Qianfeng Tie and Xiangmin Xu},
	journal={arXiv preprint arXiv:2205.14727},
	year={2022},
	url={https://arxiv.org/abs/2205.14727}
}

Engineering Research Ceter of Ministry of Education on Human Body Perception

cped's People

Contributors

scutcyr avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.