Code Monkey home page Code Monkey logo

tqp's Introduction

tqp

Dataset and pretrained model for question paraphrasing in Turkish

Acknowledgement

Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉

Overview

I am releasing a relatively large dataset and a pretrained model for question paraphrasing in Turkish in this repo.

Motivation

Question paraphrasing is an important task in natural language processing with a diverse set of application areas, but Turkish lacks a publically usable dataset for this task. Only this dataset provides 1377 question paraphrases in Turkish, but it is more suitable for paraphrasing detection than a generation task. Therefore, I decided to form and share a larger dataset to be used to train a question paraphrasing model in Turkish.

Usage

You can use thepretrained model through Transformers Library. The following simple code is enough to generate 5 paraphrases for the input question.

from transformers import AutoTokenizer, T5ForConditionalGeneration
model_name = "mys/mt5-small-turkish-question-paraphrasing"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

tokens = tokenizer.encode_plus("Yarın toplantı kaçta başlıyor?", return_tensors='pt')
paraphrases = model.generate(tokens['input_ids'], max_length=128, num_return_sequences=5, num_beams=5)
tokenizer.batch_decode(paraphrases, skip_special_tokens=True)

And the output will be something like:

['Yarın toplantı ne zaman başlıyor?',
 'Yarın toplantı saat kaçta başlıyor?',
 'Yarın toplantı saat kaçta başlar?',
 'Yarın toplantı ne zaman başlayacak?',
 'Yarın toplantı ne zaman başlar?']

Building Dataset

I first downloaded TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages and extracted tr.txt which contains 56277 paraphrasing groups with each one having a variable number of paraphrases. Then, I selected 7384 paraphrasing groups which ends in a question mark (?). The first 84 groups were separated as a development dataset and the remaining 7300 paraphrasing groups were used as the training dataset. By rolling out sentences in each paraphrasing groups, I obtained 4588 pairs in the dev set and 51620 pairs in the train set. You can find them in the data directory in this repo.

Considerations and Future work

A quick review of the dataset revealed that some sentences does not sound natural in Turkish (i.e., probably translated from another language), or some pairs may be questionable from one or more semantic perspectives. An upcoming study should fix these issues and further clean the data, but the dataset will be available as V0.1 until then as the pretrained model seems to be giving satisfactory results.

Citation

If you find this repo useful for your research, please consider citation.

License

Creative Commons Attribution 2.0 Generic (CC BY 2.0)

tqp's People

Contributors

monatis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

yunti123

tqp's Issues

Fine-tuning script

Nice work! I just wanted to know how did you fine-tune the mt5 model on your turkish paraphrase set as I wanted to do something similar in German. Would you mind sharing your fine-tuning script?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.