Code Monkey home page Code Monkey logo

lalm's Introduction

The source code for Multi-Lingual Question Generation with Language Agnostic Language Model

This repo is the source code for our ACL/IJCNLP2021 finding paper Multi-Lingual Question Generation with Language Agnostic Language Model

Pre-processing

1.Download and precess the wikidumps

First of all, you should download the wikipedia dumps from https://dumps.wikimedia.org/ , basicly, there are 10 languages used in this paper for pre-training.

Language Short name Size
Chinese zh 1.4G
English en 14G
Korean ko 679M
French fr 4.4G
Hindi hi 430M
Burmese bu 208M
German de 5.8G
Vietnam vi 979M
Japanese Ja 2.8G
Chinese Minnan mi 124M

Note that the number of pre-training languages could be larger than the fine-tuning languages.

You should downloads the original wikidumps file containing the whole articles, such as jawiki-20200420-pages-articles.xml.bz2. And then using WikiExtractor to extract the paragraphs. For example:

python3 WikiExtractor.py  -b 100m -o raw/ jawiki-20200420-pages-articles.xml.bz2

This extract the text of japanese wikidumps into the raw directory.

2. Merge the text into a single file

Then we merge the wiki text files processed by Wikextractor into a single file, which are then used to train the sentencepiece tokenizer in [step3](###3. Train the sentencepiece tokenizer) The code we provided is preprocess/process_wiki_data.py

def check_valid_wiki_passage(txt):
    if txt.startswith('<doc id'):
        return False
    if txt.startswith('</doc>'):
        return False
    if len(txt) < 10:
        return False
    return True
  
def get_all_wiki_data(language='english'):
    raw_file_paths = get_dir_files('data/{}/raw'.format(language))
    data = []
    for one_file in tqdm(raw_file_paths):
        for line in get_file_info(one_file):
            if check_valid_wiki_passage(line):
                data.append(line.strip())
    output_filename = 'data/{}/wiki.all.txt'.format(language)
    print('dump {} wiki total size is {}'.format(language, len(data)))
    write_lst_to_file(data, output_filename)
    print('{} done!'.format(language))

Where we get the text between two xml tags and filter out the lines shorter than 10.

3. Train the sentencepiece tokenizer

For each language, we use google's sentencepiece library to learn the tokenizer automatically. We use the unigram language model. And the code is preprocess/process_wiki_data.py

def train_vocab(language='english', vocab_size=30000):
    sp_path = 'data/{}/wiki.all.txt'.format(language)
    content = '--input=' + sp_path + ' ' \
                                     '--model_prefix=/search/odin/bingning/data/LALM/language_slot/vocab.my_size --vocab_size=my_size ' \
                                     '--character_coverage=0.9999 ' \
                                     '--num_sub_iterations=2 ' \
                                     '--max_sentencepiece_length=36 ' \
                                     '--model_type=unigram --num_threads=40 --max_sentence_length=15000 ' \
                                     '--input_sentence_size=2000000 '

    content = content.replace('my_size', str(vocab_size))
    content = content.replace('language_slot', language)
    spm.SentencePieceTrainer.Train(
        content
    )

where the language parameters are the respective language processed in previous step. The default vocabulary size is 30,000.

Pre-training

After processing the original wiki dumps and obtaining the tokenizer, we then pretrain the LALM.

There are two types of model we can pre-train,

  • LALM_shared, where we don't discriminate any languages, and the low-level module is directly fed to the high-level module.
  • LALM, we add a discriminator and adopt adversarial training to make the high-level module more language-agnostic.

Our pre-training code is provided in train/pre_train_lalm.py . And you can run the script by :

sh run_pre_train.sh

where the run_pre_train.sh is

#!bin/bash

python3 -m torch.distributed.launch --nproc_per_node=4 pre_train_lalm.py \
--batch_size=24 \
--max_learning_rate==1e-4 \ 
--max_length==512 \
--n_embedding==128 \
--n_hidden==768 \
--n_layer==12 \
--n_head==12 \
--reload=False \
--epoch=-1 \
--type=shared \
--zh_path='data/chinese/wiki.all.txt' \
--en_path='data/chinese/wiki.all.txt' \
--ko_path='data/korean/wiki.all.txt' \
--fr_path='data/french/wiki.all.txt' \
--hi_path='data/hindi/wiki.all.txt' \
--bu_path='data/burmese/wiki.all.txt' \
--de_path='data/german/wiki.all.txt' \
--vi_path='data/vietnam/wiki.all.txt' \
--ja_path='data/japanese/wiki.all.txt' \
--mi_path='data/minnan/wiki.all.txt'

Where the reload parameter means whether to reload previous pre-trained model from checkpoints, and epoch is the pre-load checkpoint's number.

type can be either shared or lalm

We also provid the training curve from our enviroments, where we 4 NVIDIA-V100 16Gb GPUs.

learning_curve

Note that the final average loss is around 3.5 for LALM_share.

  • you can download our pre-trained model via: base and large .

Fine-tuning

1.Preprocess the training question generation data

First of all, we should pre-process the question answering datasets. We process the data and transform it into a list of items, where each item is [context, document_features, questions] and each question is [question, question_features]. The process script is in preprocess directory, an example of English SQuAD is shown below.

sp = spm.SentencePieceProcessor()
sp.load('../data/vocab/english.30000.model')


def one_paragraphs(paragraphs):
    one_data = []
    for paragraph in paragraphs["paragraphs"]:
        context = paragraph['context']
        doc_ids = sp.EncodeAsIds(context)
        questions = []
        for qa in paragraph["qas"]:
            question_text = qa["question"]
            question_ids = sp.EncodeAsIds(question_text)
            questions.append([question_text, question_ids])
        one_data.append([context, doc_ids, questions])
    return one_data


def process(filename):
    with open(filename) as dataset_file:
        dataset_json = json.load(dataset_file)
        dataset = dataset_json['data']
    output = multi_process(one_paragraphs, dataset, num_cores=40)
    output = [y for x in output for y in x]
    if 'train' in filename:
        output = [[x[1], y[1]] for x in output for y in x[2]]
    print('{} proceed done, have {} samples'.format(filename, len(output)))
    return output


def get_squad():
    dev = process('../data/qg/english.dev.json')
    dump_file(dev, '../data/qg/dev.en.obj')
    train = process('../data/qg/english.train.json')
    dump_file(train, '../data/qg/train.en.obj')


if __name__ == '__main__':
    get_squad()

Since we do not hold the licence to distribute the data, so we refer them below so you can obtain this data by yourself.

Language Training data size Dev data size url Info
English 87,599 2,067 https://rajpurkar.github.io/SQuAD-explorer/ english.dev.json/english.train.json, V1.1
Korean 60,407 964 https://github.com/graykode/KorQuAD-beginner/tree/master/config
French 20,731 768 https://fquad.illuin.tech/
Hindi 4,000 2,555 https://www.cse.iitb.ac.in/~ganesh/HiQuAD/clqg/clqg_data.tar.gz decompress the tar file into hindi directory.
Chinese 180,000 44,962

2. Fine-tuning

todo

Reference

If you wish to use our data in your research, please cite:

@article{wangmulti,
  title={Multi-Lingual Question Generation with Language Agnostic Language Model},
    booktitle = "Findings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)",
    month = Aug,
    year = "2021",
    address = "Virtual",
  author={Wang, Bingning and Yao, Ting and Chen, Weipeng and Xu, Jingfang and Wang, Xiaochuan}
}

lalm's People

Contributors

benywon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.