Code Monkey home page Code Monkey logo

nlp-data-augmentation's Introduction

NLP Data Augmentation

(Augmentating Textual Data Using NLP Libraries)

Augmentation” is the process of enlarging in size or amount and here in this article, we’ll work out how we can increase the size of the data using the data augmentation techniques for textual data. Also as the neural architectures rely on large parallel corpora, synthetically generating data (which is called data augmentation) can be of huge help.

As mentioned in “A Survey of Data Augmentation Approaches for NLP”[b], some of the Data Augmentation Techniques are:

  1. Rule-Based: Easy Data Augmentation(EDA)
  2. Example Interpolation Techniques: MIXUP, SEQ2MIXUP
  3. Model-Based Techniques: Seq2seq, language model, backtranslation, fine-tuning GPT-2, paraphrasing.

Under Rule-Based, the basic and most commonly used technique is EDA: Easy data augmentation techniques. The EDA techniques are:

  1. Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
  2. Random Deletion: Randomly remove each word in the sentence with probability p.
  3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
  4. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Vairous Data Augmentation Task:

  1. Summarization
  2. Question Answering
  3. Sequence Tagging
  4. Parsing
  5. Grammatical Error Correction
  6. Neural Machine Translation
  7. Data to Text
  8. Dialogue

Various Libraries available:

  1. TextAugment
  2. Augly
  3. NLPAug
  4. Parrot paraphrase
  5. Pegasus paraphrase

Working Code of each libraries can be found here:

Sample Output:

TextAugment

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

image

Augly

Facebook just recently released the AugLy package to the public domain. AugLy library is divided into four sub-libraries, each for different kinds of data modalities (audio, images, videos and texts).
image

NLPAug

NLPAug is a library for textual augmentation in machine learning experiments. The goal is improving deep learning model performance by generating textual data.

image image
Back translation involves taking the translated version of a document or file and then having a separate independent translator (who has no knowledge of or contact with the original text) translate it back into the original language.

Parrot paraphrase

Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.
image

Pegasus paraphrase

PEGASUS is a standard Transformer encoder-decoder. PEGASUS uses GSG to pre-train a Transformer encoder-decoder on large corpora of documents.
image

REF:

[a] Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, Rishemjit Kaur. 2021. **Neural Machine Translation for Low-Resource Languages: A Survey.

[b] Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP.

[c] EDA from scratch: https://jovian.ai/abdulmajee/eda-data-augmentation-techniques-for-text-nlp

[d]TextAugment https://github.com/dsfsi/textaugment

[e] Augly https://analyticsarora.com/how-to-use-augly-on-image-video-audio-and-text/

[f] nlpaug https://github.com/makcedward/nlpaug

[g] Parrot Paraphraser https://github.com/PrithivirajDamodaran/Parrot_Paraphraser

[h] Pegasus Paraphraser https://huggingface.co/tuner007/pegasus_paraphrase

[I] Improving short text classification through global augmentation methods.

[j] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization https://arxiv.org/abs/1912.08777

nlp-data-augmentation's People

Contributors

pemagrg1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

ntdxyg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.