NLP Data Augmentation

(Augmentating Textual Data Using NLP Libraries)

“Augmentation” is the process of enlarging in size or amount and here in this article, we’ll work out how we can increase the size of the data using the data augmentation techniques for textual data. Also as the neural architectures rely on large parallel corpora, synthetically generating data (which is called data augmentation) can be of huge help.

As mentioned in “A Survey of Data Augmentation Approaches for NLP”[b], some of the Data Augmentation Techniques are:

Rule-Based: Easy Data Augmentation(EDA)
Example Interpolation Techniques: MIXUP, SEQ2MIXUP
Model-Based Techniques: Seq2seq, language model, backtranslation, fine-tuning GPT-2, paraphrasing.

Under Rule-Based, the basic and most commonly used technique is EDA: Easy data augmentation techniques. The EDA techniques are:

Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Deletion: Randomly remove each word in the sentence with probability p.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Vairous Data Augmentation Task:

Summarization
Question Answering
Sequence Tagging
Parsing
Grammatical Error Correction
Neural Machine Translation
Data to Text
Dialogue

Various Libraries available:

TextAugment
Augly
NLPAug
Parrot paraphrase
Pegasus paraphrase

Working Code of each libraries can be found here:

Sample Output:

TextAugment

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

Augly

Facebook just recently released the AugLy package to the public domain. AugLy library is divided into four sub-libraries, each for different kinds of data modalities (audio, images, videos and texts).

NLPAug

NLPAug is a library for textual augmentation in machine learning experiments. The goal is improving deep learning model performance by generating textual data.

Back translation involves taking the translated version of a document or file and then having a separate independent translator (who has no knowledge of or contact with the original text) translate it back into the original language.

Parrot paraphrase

Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Pegasus paraphrase

PEGASUS is a standard Transformer encoder-decoder. PEGASUS uses GSG to pre-train a Transformer encoder-decoder on large corpora of documents.

REF:

[a] Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, Rishemjit Kaur. 2021. **Neural Machine Translation for Low-Resource Languages: A Survey.

[b] Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP.

[c] EDA from scratch: https://jovian.ai/abdulmajee/eda-data-augmentation-techniques-for-text-nlp

[d]TextAugment https://github.com/dsfsi/textaugment

[e] Augly https://analyticsarora.com/how-to-use-augly-on-image-video-audio-and-text/

[f] nlpaug https://github.com/makcedward/nlpaug

[g] Parrot Paraphraser https://github.com/PrithivirajDamodaran/Parrot_Paraphraser

[h] Pegasus Paraphraser https://huggingface.co/tuner007/pegasus_paraphrase

[I] Improving short text classification through global augmentation methods.

[j] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization https://arxiv.org/abs/1912.08777

pemagrg1 / nlp-data-augmentation Goto Github PK