Code Monkey home page Code Monkey logo

p2mcq's Introduction

Towards Process-Oriented, Modular, and Versatile Question Generation that Meets Educational Needs

Codebase and pre-trained models for NAACL-2022 submission Towards Process-Oriented, Modular, and Versatile Question Generation that Meets Educational Needs by Xu Wang, Simin Fan, Jessica Houghton and Lu Wang.


P2MCQ Dataset

The P2MCQ dataset archives 160 multiple-choice 307 questions with 629 question options in total (197 correct answers and 432 incorrect answers or distractors) from HCI-101 course. The dataset could be downloaded here.

Data Preprocessing

Set up

# Suggested: create a virtual environment
conda create -n p2mcq python=3.8
conda activate p2mcq

# Requirement
pip install -r requirements.txt

Parsing PDF Document

As for the PDF document preprocessing, we first use scipdf-parser to parse the PDF into sections in plain text format.

To keep the parser running, make sure the GROBID is running backend by executing the following commands in your command line before processing your custom data:

pip install git+https://github.com/titipata/scipdf_parser

git clone https://github.com/titipata/scipdf_parser.git

bash /scipdf_parser/serve_grobid.sh

You can process your own pdf-document with the code:

python /Data/preprocessing.py --pdf_path <path2pdf_doc> --save_path <path to save processed data> --save_format <save format, default as csv>

The pdf_path could be the path on your local file directory, or a public accessible link (e.g. https://arxiv.org/pdf/1908.08345.pdf )

Task1. Make input for Neural-based Sentence Selection

We follow the extractive summarization methodology introduced by (Liu and Lapata, 2019) to select salient sentences from the give paragraph.

python /Data/task1.py --input_path <path to input passages> --src_write_into <path to save processed input> --tgt_path <path to target summary (not required)> --tgt_write_into   <path to save processed target>

Modularized Automatic Models

We propose a list of on-the-shelf and fine-tuned models for the purpose of modularizing the end-to-end MCQ generation process. The subtasks include [T1-sentence selection]; [T2-Abstractive Paragraph Summarization]; [T3-Sentence Simplification]; [T4-Paraphrasing]; [T5-Negation Generation].

task Instruction Reference
Sentence Selection (i.e. extractive summarization) BertSUMEXT The implementation is based on the original codebase released by Liu and Lapata
Abstractive Summarization BertSUMEXTABS Bart-HCI
Sentence Simplification ACCESS MUSS The implementation is based on the original codebase(ACCESS MUSS) released by Martin et al.
Paraphrasing Bart-para-SCI Finetuned on ParaSCI by Dong et al.
Negation CrossAUG The implementation is based on the original codebase released by Lee et al.

Evaluation

The quality of the generated texts is evaluated with BLEU, ROUGE-1, ROUGE-2 and ROUGE-L scores. The references are supposed to be provided.

python ./evaluation.py --input_path <input_filepath (txt)> --pred_path <pred_filepath (txt)> --gold_path <gold_filepath (txt)>

Potential Pitfall

  1. If you see the following error message

    oserror: libcublas.so.10: cannot open shared object file: no such file or directory

    Check whether your torch and cuda version is compatible with your operating system. You can check your CUDA version by nividia-smi.

p2mcq's People

Contributors

olivia-fsm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

launchnlp lwlxy

p2mcq's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.