Code Monkey home page Code Monkey logo

multiwoz2.4's Introduction

MultiWOZ 2.4

This is the dataset described in the paper: MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. Fanghua Ye, Jarana Manotumruksa, Emine Yilmaz. SIGDIAL 2022. [paper]

MultiWOZ 2.4 is a refined version of MultiWOZ 2.1. Specifically, we carefully rectified (almost) all the annotation errors in the validation set and test set. We keep the training set intact.

MultiWOZ 2.4 shares exactly the same format as MultiWOZ 2.1, thus it is pretty easy for us to run existing models that are built upon MultiWOZ 2.1 on MultiWOZ 2.4.

Data Preprocessing

MultiWOZ 2.4 can be preprocessed by the script create_data.py

❱❱❱ python3 create_data.py

or simply by the script split.py

❱❱❱ python3 split.py

Benchmark Results

We test the performance of nine SOTA dialogue state tracking models on MultiWOZ 2.4. All the chosen models demonstrate much higher performance (joint goal accuracy), benefiting from the improved test set.

Model MultiWOZ 2.1 MultiWOZ 2.4
SUMBT 49.01% 61.86%
CHAN 53.38% 68.25%
STAR 56.36% 73.62%
TRADE 45.60% 55.05%
PIN 48.40% 58.92%
SOM-DST 51.24% 66.78%
SimpleTOD 51.75% 57.18%
SAVN 54.86% 60.55%
TripPy 55.18% 64.75%
IC-DST (GPT3) 50.65% 62.43%
Seq2Seq 54.4% 67.10%
TripPy-R 55.99% 69.87%
D3ST (XXL) 57.8% 75.90%
ASSIST (STAR) - 79.41%
MetaASSIST (STAR) - 80.10%

multiwoz2.4's People

Contributors

smartyfh avatar yushi-hu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

multiwoz2.4's Issues

which results of CHAN-DST should we use?

I notice that CHAN-DST and STAR are both in your git repo, and in paper of STAR you rerun the CHAN and report its joint goal as 53.38% at multi-woz2.1, and in its original paper it was 58.55% at multiwoz2.1, so which is offical?

Has the system_acts.json file been renamed?

Hello. Thanks for your awesome work.

However, I have one problem, I can't find the system_acts.json .
Did the system_acts.json file get renamed to dialog_acts.json by any chance?
I'm opening an issue to make sure it's the same file.

E2E Modeling

Hi, nice work!
Would this refinement of belief states affects the performance on E2E modeling? In particularly, inform rate and success rate.

Could you please provide the results of STAR on MultiWOZ 2.4?

Thanks for your very good work and it is helpful.
But the result of my rerunning the STAR on MultiWOZ 2.4 is 72.39% which is lower than reported in the paper. I think it may be caused by some hardware environment. Could you please upload the result (like exp.txt) for a fair comparison?

Add benchmark results from new work on leaderboard?

First, thanks for the nice work!

There are some new papers on your dataset that are reporting new results. Can we update your benchmark leaderboard by pull requests? (just like the leaderboard in the original MultiWOZ repo)

Pipe ("|") separated slot values

Hi there,

I am using this version of the dataset and I have noticed that for 436 dialogue turns there are slot values that are separated by a pipe ("|"). For example

Screenshot 2022-11-08 at 16 49 40

  • Why is this the case?
  • It appears the correct one is always the first when splitting on the "|" character; is it correct to do this?

Thanks a lot in advance for your time!

`dontcare` values

Hi,

When the slot value is dontcare, should we expect a model to predict dontcare, or should we assume that any model prediction is ok? E.g. is it ok for a model to predict hotel=guest house when the ground truth is hotel=dontcare?

preprocessing details?

Thank you for this effort!

What do create_data.py and split.py do exactly? Why are they interchangeable?

Which TripPy result is valid?

Hi,

Thanks for your hard work.

In the paper, the Joint goal accuracy for TriPy on MWZ2.4 is 59.62%, however, in the README table, the accuracy is 64.75%. Which one is valid? Thank you!

Ask about experimental results in paper

Thanks for your meaningful work! 👍🏻

There are a few questions I would like to ask about the performance recorded in the paper.

  1. When the existing model trained MW2.4, the performance reported in the paper is not reproduced. I wonder if you have any plans to upload the checkpoint file of the model.
    (Experimental Model: TRADE, SOMDST)

  2. During the experiment, I would like to ask if each model's parameter was used as is or additionally parameter tuning.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.