smartyfh / multiwoz2.4 Goto Github PK

MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset

License: MIT License

Python 100.00%

task-oriented-dialogue dialogue-state-tracking noisy-label-learning

multiwoz2.4's Introduction

MultiWOZ 2.4

This is the dataset described in the paper: MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. Fanghua Ye, Jarana Manotumruksa, Emine Yilmaz. SIGDIAL 2022. [paper]

MultiWOZ 2.4 is a refined version of MultiWOZ 2.1. Specifically, we carefully rectified (almost) all the annotation errors in the validation set and test set. We keep the training set intact.

MultiWOZ 2.4 shares exactly the same format as MultiWOZ 2.1, thus it is pretty easy for us to run existing models that are built upon MultiWOZ 2.1 on MultiWOZ 2.4.

Data Preprocessing

MultiWOZ 2.4 can be preprocessed by the script create_data.py

❱❱❱ python3 create_data.py

or simply by the script split.py

❱❱❱ python3 split.py

Benchmark Results

We test the performance of nine SOTA dialogue state tracking models on MultiWOZ 2.4. All the chosen models demonstrate much higher performance (joint goal accuracy), benefiting from the improved test set.

Model	MultiWOZ 2.1	MultiWOZ 2.4
SUMBT	49.01%	61.86%
CHAN	53.38%	68.25%
STAR	56.36%	73.62%
TRADE	45.60%	55.05%
PIN	48.40%	58.92%
SOM-DST	51.24%	66.78%
SimpleTOD	51.75%	57.18%
SAVN	54.86%	60.55%
TripPy	55.18%	64.75%
IC-DST (GPT3)	50.65%	62.43%
Seq2Seq	54.4%	67.10%
TripPy-R	55.99%	69.87%
D3ST (XXL)	57.8%	75.90%
ASSIST (STAR)	-	79.41%
MetaASSIST (STAR)	-	80.10%

multiwoz2.4's People

Contributors

Stargazers

Watchers

Forkers

akhyar-ahmed yangpuhai marziehngh bobycv06fpm kingb12 yushi-hu yutongli18

multiwoz2.4's Issues

which results of CHAN-DST should we use?

I notice that CHAN-DST and STAR are both in your git repo, and in paper of STAR you rerun the CHAN and report its joint goal as 53.38% at multi-woz2.1, and in its original paper it was 58.55% at multiwoz2.1, so which is offical?

Has the system_acts.json file been renamed?

Hello. Thanks for your awesome work.

However, I have one problem, I can't find the system_acts.json .
Did the system_acts.json file get renamed to dialog_acts.json by any chance?
I'm opening an issue to make sure it's the same file.

E2E Modeling

Hi, nice work!
Would this refinement of belief states affects the performance on E2E modeling? In particularly, inform rate and success rate.

Could you please provide the results of STAR on MultiWOZ 2.4?

Thanks for your very good work and it is helpful.
But the result of my rerunning the STAR on MultiWOZ 2.4 is 72.39% which is lower than reported in the paper. I think it may be caused by some hardware environment. Could you please upload the result (like exp.txt) for a fair comparison?

Add benchmark results from new work on leaderboard?

First, thanks for the nice work!

There are some new papers on your dataset that are reporting new results. Can we update your benchmark leaderboard by pull requests? (just like the leaderboard in the original MultiWOZ repo)

Pipe ("|") separated slot values

Hi there,

I am using this version of the dataset and I have noticed that for 436 dialogue turns there are slot values that are separated by a pipe ("|"). For example

Why is this the case?
It appears the correct one is always the first when splitting on the "|" character; is it correct to do this?

Thanks a lot in advance for your time!

In the paper, the Joint goal accuracy for TriPy on MWZ2.4 is 59.62%, however, in the README table, the accuracy is 64.75%. Which one is valid? Thank you!

Ask about experimental results in paper

Thanks for your meaningful work! 👍🏻

There are a few questions I would like to ask about the performance recorded in the paper.

When the existing model trained MW2.4, the performance reported in the paper is not reproduced. I wonder if you have any plans to upload the checkpoint file of the model.
(Experimental Model: TRADE, SOMDST)
During the experiment, I would like to ask if each model's parameter was used as is or additionally parameter tuning.

Thank you.