Code Monkey home page Code Monkey logo

mts-dialog's Introduction

Introduction

This repository contains the data and source code for the EACL 2023 paper: An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters

- An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. 
- Asma Ben Abacha, Wen-wai Yim, Yadan Fan and Thomas Lin. 
- EACL, May 3-5, 2023, Dubrovnik, Croatia. 

    @inproceedings{mts-dialog,
      title     = {An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters},
        author = "Ben Abacha, Asma  and
          Yim, Wen-wai  and
          Fan, Yadan  and
          Lin, Thomas",
        booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
        month = may,
        year = "2023",
        address = "Dubrovnik, Croatia",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2023.eacl-main.168",
        pages = "2291--2302"
    }

Datasets, Code & Annotations

Main Dataset

The MTS-Dialog dataset is a new collection of 1.7k short doctor-patient conversations and corresponding summaries (section headers and contents).
  • The training set consists of 1,201 pairs of conversations and associated summaries.

  • The validation set consists of 100 pairs of conversations and their summaries.

  • MTS-Dialog includes 2 test sets; each test set consists of 200 conversations and associated section headers and contents:

The full list of normalized section headers:

    1. fam/sochx [FAMILY HISTORY/SOCIAL HISTORY]
    2. genhx [HISTORY of PRESENT ILLNESS]
    3. pastmedicalhx [PAST MEDICAL HISTORY]
    4. cc [CHIEF COMPLAINT]
    5. pastsurgical [PAST SURGICAL HISTORY]
    6. allergy
    7. ros [REVIEW OF SYSTEMS]
    8. medications
    9. assessment
    10. exam
    11. diagnosis
    12. disposition
    13. plan
    14. edcourse [EMERGENCY DEPARTMENT COURSE]
    15. immunizations
    16. imaging
    17. gynhx [GYNECOLOGIC HISTORY]
    18. procedures
    19. other_history
    20. labs

Augmented dataset

The augmented dataset consists of 3.6k pairs of medical conversations and associated summaries created from the original 1.2k training pairs via back-translation using two languages French and Spanish, as described in the paper (cf. Section 4.2).

We provide the full augmented training set that we used in the experiments, as well as the separate datasets created using the French and Spanish translation models.

Source Code

The source code for the summarization of doctor-patient conversations and the automatic generation of clinical notes.

Manual Scores for Correlation Study

  • Manual fact-based scores for the evaluation of 400 automatic summaries generated using four summarization models from the validation set of 100 conversations and notes.

  • The Factual P/R/F1 Scores, Hallucination and Omission Rates, and Levenshtein Edit Distance are computed based on the fact-based manual counts and correction.

  • We used the manual scores to evaluate the performance of several evaluation metrics (e.g., ROUGE, BERTScore, and BLEURT) by computing the Pearson's correlation coefficients between the automatic and manual scores, as described in the paper (cf. Section 5.2 and Section 5.3).

  • We provide all the data needed to perform this correlation study on other evaluation metrics.

Challenges & Evaluation Scripts

License

Contact

-  Asma Ben abacha (abenabacha at microsoft dot com)
 - Wen-wai Yim (yimwenwai at microsoft dot com)

mts-dialog's People

Contributors

abachaa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mts-dialog's Issues

Consultation About the data split

Hello, I am follwing this paper recently, but i have problem about the data split between the github and paper

On the github says:

The training set consists of 1,201 pairs of conversations and associated summaries.

The validation set consists of 100 pairs of conversations and their summaries.

MTS-Dialog includes 2 test sets; each test set consists of 200 conversations and associated section headers and 

But In the paper:

We use a test set of 100 conversations and notes, randomly selected from the MTS-DIALOG dataset. The remaining pairs are used for training (1,201 pairs) and validation (400 pairs).

The training set data seems to be idetical, but val and test set is totally different, Could you please specific the data split method detail?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.