Code Monkey home page Code Monkey logo

nlp_toolkit's Introduction

NLP Toolkit

Library containing state-of-the-art models for Natural Language Processing tasks
The purpose of this toolkit is to allow for easy training/inference of state-of-the-art models, for various NLP tasks.
*See To do list


Contents

Tasks:

  1. Classification
  2. Automatic Speech Recognition
  3. Text Summarization
  4. Machine Translation
  5. Natural Language Generation
  6. Punctuation Restoration
  7. Named Entity Recognition
  8. Part of Speech Tagging
  9. Unsupervised Style Transfer
  10. Text Clustering
  11. Grammatical Error Correction

Benchmark Results
References


Pre-requisites

torch==1.4.0 ; spacy==2.1.8 ; torchtext==0.4.0 ; seqeval==0.0.12 ; pytorch-nlp==0.4.1
For mixed precision training (-fp16=1), apex must be installed: apex==0.1
For chinese support in Translation: jieba==0.39
For ASR: librosa==0.7.0 ; soundfile==0.10.2
For Unsupervised Style Transfer: fasttext == 0.8.3 ; kenlm (for evaluation only)
For more details, see requirements.txt

** Pre-trained PyTorch models (XLNet, BERT, GPT-2, CTRL, XLMRoBERTa, ALBERT) are courtesy of huggingface (https://github.com/huggingface/pytorch-transformers)
** GAT model adapted from https://github.com/Diego999/pyGAT
** Style-Transformer training codes adapted from https://github.com/fastnlp/style-transformer
** Semsim pre-trained models courtesy of https://github.com/icml-2020-nlp/semsim
** GECToR training & pre-trained models courtesy of https://github.com/grammarly/gector

Package Installation

git clone https://github.com/plkmo/NLP_Toolkit.git
cd NLP_Toolkit
pip install .
python -m spacy download en_core_web_lg

# to uninstall if required to re-install after updates,
# since this repo is still currently in active development
pip uninstall nlptoolkit 

Alternatively, you can just use it as a non-packaged repo after git clone.


1) Classification

The goal of classification is to segregate documents into appropriate classes based on their text content. Currently, the classification toolkit uses the following models:

  1. Text-based Graph Convolution Networks (GCN) (model_no: 0)
  2. Bidirectional Encoder Representations from Transformers (BERT) (model_no: 1)
  3. XLNet (model_no: 2)
  4. Graph Attention Network (GAT) (model_no: 3)
  5. ALBERT (model_no: 4)
  6. XLMRoBERTa (model_no: 5)
  7. Graph Isomorphism Network (GIN) (model_no: 6)

Format of datasets files

The training data (default: train.csv) should be formatted into two columns 'text' and 'label' respectively, with rows being the documents index. 'text' contains the raw text and 'label' contains the corresponding label (integers 0, 1, 2... depending on the number of classes)

The infer data (default: infer.csv) should be formatted into at least one column 'text' being the raw text and rows being the documents index. Optional column 'label' can be added and --train_test_split argument set to 1 to use infer.csv as the test set for model verification.

  • IMDB datasets for sentiment classification available here.

Running the model

Run classify.py with arguments below.

classify.py [-h] 
	[--train_data TRAIN_DATA (default: "./data/train.csv")] 
	[--infer_data INFER_DATA (default: "./data/infer.csv")]            
	[--max_vocab_len MAX_VOCAB_LEN (default: 7000)]  
	[--hidden_size_1 HIDDEN_SIZE_1 (default: 330)]
	[--hidden_size_2 HIDDEN_SIZE_2 (default: 130)]  
	[--batched BATCHED (default: 0)]  
	[--hidden HIDDEN (default: 8)]
	[--nb_heads NB_HEADS (default: 8)]
	[--tokens_length TOKENS_LENGTH (default: 200)] 
	[--num_classes NUM_CLASSES (default: 2)]
	[--train_test_split TRAIN_TEST_SPLIT (default: 0)]
	[--test_ratio TEST_RATIO (default: 0.1)] 
	[--batch_size BATCH_SIZE (default: 32)]      
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 1)]
	[--max_norm MAX_NORM (default: 1)] 
	[--num_epochs NUM_EPOCHS (default: 1700)] 
	[--lr LR default=0.0031]
	[--use_cuda USE_CUDA]
	[--model_no MODEL_NO (default: 0 (0: GCN, 1: BERT, 2: XLNet, 3: GAT))] 
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from trained model))]

The script outputs a results.csv file containing the indexes of the documents in infer.csv and their corresponding predicted labels.

Or if used as a package:

from nlptoolkit.utils.config import Config
from nlptoolkit.classification.models.BERT.trainer import train_and_fit
from nlptoolkit.classification.models.infer import infer_from_trained

config = Config(task='classification') # loads default argument parameters as above
config.train_data = './data/train.csv' # sets training data path
config.infer_data = './data/infer.csv' # sets infer data path
config.num_classes = 2 # sets number of prediction classes
config.batch_size = 32
config.model_no = 1 # sets BERT model
config.lr = 0.001 # change learning rate
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")
inferer.infer_from_input()

Sample output:

Type input sentence (Type 'exit' or 'quit' to quit):
This is a good movie.
Predicted class: 1

Type input sentence (Type 'exit' or 'quit' to quit):
This is a bad movie.
Predicted class: 0

Pre-trained models

Download and zip contents of downloaded folder into ./data/ folder.

  1. BERT for IMDB sentiment analysis (includes preprocessed data, vocab, and saved results files)
  2. XLNet for IMDB sentiment analysis (includes preprocessed data, vocab, and saved results files)

2) Automatic Speech Recognition

Automatic Speech Recognition (ASR) aims to convert audio signals into text. This library contains the following models for ASR:

  1. Speech-Transformer (model_no: 0)
  2. Listen-Attend-Spell (LAS) (model_no: 1)

Format of dataset files

The folder containing the dataset should have the following structure: folder/speaker/chapter Within the chapter subdirectory, the audio files (in .flac format) are named speaker-chapter-file_id (file_id In running order) The transcript .txt file for the files within the chapter should be located in the chapter subdirectory. In the transcript file, each row should consist of the speaker-chapter-file_id (space) transcript.

Running the model

Run speech.py with arguments below

speech.py [-h] 
	[--folder FOLDER (default: train-clean-5")] 
	[--level LEVEL (default: word")]   
	[--use_lg_mels USE_LG_MELS (default: 1)]
	[--use_conv USE_CONV (default: 1)]
	[--n_mels N_MELS (default: 80)]
	[--n_mfcc N_MFCC (default: 13)]
	[--n_fft N_FFT (default: 25)]
	[--hop_length HOP_LENGTH (default: 10)]
	[--max_frame_len MAX_FRAME_LEN (default: 1000)]
	[--d_model D_MODEL (default: 64)]
	[--ff_dim FF_DIM (default: 128)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 4)]
	[--batch_size BATCH_SIZE (default: 30)]
	[--fp16 FP16 (default:1)]  
	[--num_epochs NUM_EPOCHS (default: 8000)] 
	[--lr LR default=0.003]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 4)]
	[--max_norm MAX_NORM (default: 1)] 
	[--T_max T_MAX (default: 5000)]  
	[--model_no MODEL_NO (default: 0 (0: Transformer, 1: LAS))]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

3) Text Summarization

Text summarization aims to distil a paragraph chunk into a few sentences that capture the essential information. This library contains the following models for text summarization:

  1. Convolutional Transformer (model_no: 0)
  2. Seq2Seq (LAS architecture) (model_no: 1)
  3. Semsim (model_no: 2) (for infer only)

Format of dataset files

One .csv file for each text/summary pair. Within the text/summary .csv file, text is followed by summary, with summary points annotated by @highlights (summary) Eg. example.csv

Main text here
@highlight

Summary 1

@highlight

Summary 2

Running the model

Run summarize.py with arguments below

summarize.py [-h] 
	[--data_path DATA_PATH] 
	[--level LEVEL (default: bpe")]   
	[--bpe_word_ratio BPE_WORD_RATIO (default: 0.7)]
	[--bpe_vocab_size BPE_VOCAB_SIZE (default: 7000)]
	[--max_features_length MAX_FEATURES_LENGTH (default: 200)]
	[--d_model D_MODEL (default: 128)]
	[--ff_dim FF_DIM (default: 128)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 4)]
	[--LAS_embed_dim LAS_EMBED_DIM (default: 128)]
	[--LAS_hidden_size LAS_HIDDEN_SIZE (default: 128)]
	[--batch_size BATCH_SIZE (default: 32)]  
	[--fp16 FP16 (default: 1)]  
	[--num_epochs NUM_EPOCHS (default: 8000)] 
	[--lr LR default=0.003]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 4)]
	[--max_norm MAX_NORM (default: 1)] 
	[--T_max T_MAX (default: 5000)]  
	[--model_no MODEL_NO (default: 0 (0: Transformer, 1: LAS))]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

Or if used as a package:

from nlptoolkit.utils.config import Config
from nlptoolkit.summarization.trainer import train_and_fit
from nlptoolkit.summarization.infer import infer_from_trained

config = Config(task='summarization') # loads default argument parameters as above
config.data_path = "./data/cnn_stories/cnn/stories/"
config.batch_size = 32
config.lr = 0.0001 # change learning rate
config.model_no = 0 # set model as Transformer
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")
inferer.infer_sentence(sent)

Pre-trained models

Download contents of downloaded folder into ./data/ folder.

  1. Semsim

4) Machine Translation

The goal of machine translation is to translate text from one form of language to another. This library contains the following models to accomplish this:

  1. Transformer (model_no: 0)

Currently supports translation between: English (en), French (fr), Chinese (zh)

Format of dataset files

A source .txt file with each line containing the text/sentence to be translated, and a target .txt file with each line containing the corresponding translated text/sentence

Running the model

Run translate.py with arguments below

translate.py [-h]  
	[--src_path SRC_PATH]
	[--trg_path TRG_PATH] 
	[--src_lang SRC_LANG (en, fr, zh)] 
	[--trg_lang TRG_LANG (en, fr, zh)] 
	[--batch_size BATCH_SIZE (default: 50)]
	[--d_model D_MODEL (default: 512)]
	[--ff_dim FF_DIM (default: 2048)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 8)]
	[--max_encoder_len MAX_ENCODER_LEN (default: 80)]
	[--max_decoder_len MAX_DECODER_LEN (default: 80)]	
	[--fp16 FP_16 (default: 1)]
	[--num_epochs NUM_EPOCHS (default: 500)] 
	[--lr LR default=0.0001]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 1)]
	[--max_norm MAX_NORM (default: 1)] 
	[--T_max T_MAX (default: 5000)] 
	[--model_no MODEL_NO (default: 0 (0: Transformer))]  
	[--train TRAIN (default:1)]  
	[--evaluate EVALUATE (default:0)]
	[--infer INFER (default: 0)]
	

Or if used as a package:

from nlptoolkit.utils.config import Config
from nlptoolkit.translation.trainer import train_and_fit
from nlptoolkit.translation.infer import infer_from_trained

config = Config(task='translation') # loads default argument parameters as above
config.src_path = './data/translation/eng_zh/news-commentary-v13.zh-en.en' # sets source language data path
config.trg_path = './data/translation/eng_zh/news-commentary-v13.zh-en.zh' # sets target language data path
config.src_lang = 'en' # sets source language
config.trg_lang = 'zh' # sets target language
config.batch_size = 16
config.lr = 0.0001 # change learning rate
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")
inferer.infer_from_input()

Sample output:

Type input sentence (Type 'exit' or 'quit' to quit):
The reason is simple.
Stepwise-translated:
, 这 也 是 一件 容易 的 。

Final step translated words: 
同样 至少 就是 是 最 容易 的 事情

Pre-trained models

Download and zip contents of downloaded folder into ./data/ folder.

  1. Transformer for English-Chinese translation (includes preprocessed data, vocab, and saved results files)

5) Natural Language Generation

Natural Language generation (NLG) aims to generate text based on past context. For instance, a chatbot can generate text replies based on the context of chat history. We currently have the following models for NLG:

  1. Generative Pre-trained Transformer 2 (GPT 2) (model_no: 0)
  2. Conditional Transformer Language Model (CTRL) (model_no: 1)
  3. DialoGPT (model_no: 2)

Format of dataset files

  1. Generate free text from GPT 2 pre-trained models
  2. Generate conditional free text from CTRL pre-trained model

Running the model

Run generate.py

generate.py [-h]  
	[--model_no MODEL_NO (0: GPT 2 ; 1: CTRL)]

Or if used as a package:

from nlptoolkit.utils.config import Config
from nlptoolkit.generation.infer import infer_from_trained

config = Config(task='generation') # loads default argument parameters as above
config.model_no = 1 # sets model to CTRL
inferer = infer_from_trained(config, tokens_len=70, top_k_beam=3)
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")
inferer.infer_from_input()

Sample output:

Type your input sentence: 
Questions Q: Who is Lee Kuan Yew? A:
10/24/2019 05:17:58 PM [INFO]: Generating...
Singaporean politician and Prime Minister, and a founding father 
 
 Q: What was the last film to win an Oscar for Best Picture and was directed by:* * * 
 Q: What was a film released in 1956? * 
 A: A Man Named Charlie * 
 A: The Man with a Movie Face 
 Q: Which actor played the role of: The Joker from

Type your input sentence: 
Questions Q: When is Lee Kuan Yew born? A:
10/24/2019 05:18:35 PM [INFO]: Generating...
August 16, 1950 
 A: August 22 
 Q:- How old is Lee Hsiao-ping? 
 A:- 21 years 
 Q: How many children are born each year at the hospital where the hospital is located? How many children have died in the hospital’s history! What is the average age at which children die? A: about 1 per 1000 live births*

6) Punctuation Restoration

Given unpunctuated (and perhaps un-capitalized) text, punctuation restoration aims to restore the punctuation of the text for easier readability. Applications include punctuating raw transcripts from audio speech data etc. Currently supports the following models:

  1. Transformer (PuncTransformer) (model_no: 0)
  2. Bi-LSTM with attention (PuncLSTM) (model_no: 1)

Format of dataset files

Currently only supports TED talk transcripts format, whereby punctuated text is annotated by <transcripts> tags. Eg. <transcript> "punctuated text" </transcript>. The "punctuated text" is preprocessed and then used for training.

  • TED talks dataset can be downloaded here.

Running the model

Run punctuate.py

punctuate.py [-h] 
	[--data_path DATA_PATH] 
	[--level LEVEL (default: bpe")]   
	[--bpe_word_ratio BPE_WORD_RATIO (default: 0.7)]
	[--bpe_vocab_size BPE_VOCAB_SIZE (default: 7000)]
	[--batch_size BATCH_SIZE (default: 32)]
	[--d_model D_MODEL (default: 512)]
	[--ff_dim FF_DIM (default: 2048)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 8)]
	[--max_encoder_len MAX_ENCODER_LEN (default: 80)]
	[--max_decoder_len MAX_DECODER_LEN (default: 80)]	
	[--LAS_embed_dim LAS_EMBED_DIM (default: 512)]
	[--LAS_hidden_size LAS_HIDDEN_SIZE (default: 512)]
	[--num_epochs NUM_EPOCHS (default: 500)] 
	[--lr LR default=0.0005]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 2)]
	[--max_norm MAX_NORM (default: 1.0)] 
	[--T_max T_MAX (default: 5000)] 
	[--model_no MODEL_NO (default: 0 (0: Transformer))]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

Or, if used as a package,

from nlptoolkit.utils.config import Config
from nlptoolkit.punctuation_restoration.trainer import train_and_fit
from nlptoolkit.punctuation_restoration.infer import infer_from_trained

config = Config(task='punctuation_restoration') # loads default argument parameters as above
config.data_path = "./data/train.tags.en-fr.en"' # sets training data path
config.batch_size = 32
config.lr = 5e-5 # change learning rate
config.model_no = 1 # sets model to PuncLSTM
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt") # infer from input file
inferer.infer_from_input()

Sample output:

Input sentence to punctuate:
hi how are you
Predicted Label:  Hi. How are you?

Input sentence to punctuate:
this is good thank you very much
Predicted Label:  This is good. Thank you very much.

Pre-trained models

Download and zip contents of downloaded folder into ./data/ folder.

  1. PuncLSTM (includes preprocessed data, vocab, and saved results files)

7) Named Entity Recognition

In Named entity recognition (NER), the task is to recognise entities such as persons, organisations. Current models for this task:

  1. BERT (model_no: 0)

Format of dataset files

Dataset format for both train & test follows the Conll2003 dataset format. Specifically, each row in the .txt file follows the following format:

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Here, the first column represents the word within the sentence, second column represents the parts-of-speech tag (not used), third column represents the tree chunk tag (not used), the fourth column is the NER tag. Only the first and fourth columns are used for this task and the rest are ignored. (A placeholder is still required for the second and third columns)

  • Conll2003 dataset can be downloaded here.

Running the model

Run ner.py

ner.py [-h] 
	[--train_path TRAIN_PATH] 
	[--test_path TEST_PATH]
	[--num_classes NUM_CLASSES]
	[--batch_size BATCH_SIZE]
	[--tokens_length TOKENS_LENGTH]
	[--max_steps MAX_STEPS]
	[--warmup_steps WARMUP_STEPS]
	[--weight_decay WEIGHT_DECAY]
	[--adam_epsilon ADAM_EPSILON]
	[--gradient_acc_steps GRADIENT_ACC_STEPS]
	[--num_epochs NUM_EPOCHS]
	[--lr LR]
	[--model_no MODEL_NO]
	[--model_type MODEL_TYPE]
	[--train TRAIN (default:1)]  
	[--evaluate EVALUATE (default:0)]

Or if used as a package:

from nlptoolkit.utils.config import Config
from nlptoolkit.ner.trainer import train_and_fit
from nlptoolkit.ner.infer import infer_from_trained

config = Config(task='ner') # loads default argument parameters as above
config.train_path = './data/ner/conll2003/eng.train.txt' # sets training data path
config.test_path = './data/ner/conll2003/eng.testa.txt' # sets test data path
config.num_classes = 9 # sets number of NER classes
config.batch_size = 8
config.lr = 5e-5 # change learning rate
config.model_no = 0 # sets model to BERT
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")
inferer.infer_from_input()

Sample output:

Type input sentence: ('quit' or 'exit' to terminate)
John took a flight from Singapore to China, but stopped by Japan along the way.
Words --- Tags:
john (I-PER) 
took (O) 
a (O) 
flight (O) 
from (O) 
singapore (I-LOC) 
to (O) 
china, (I-LOC) 
but (O) 
stopped (O) 
by (O) 
japan (I-LOC) 
along (O) 
the (O) 
way. (O) 

Pre-trained models

Download and zip contents of downloaded folder into ./data/ folder.

  1. BERT (includes preprocessed data, vocab, and saved results files)

8) POS Tagging

In Parts-of-speech tagging, each word in a sentence is assigned a tag that indicates its grammatical role. Current models for this task:

  1. BERT (model_no: 0)

Format of dataset files

Dataset format for both train & test follows the Conll2003 dataset format. Specifically, each row in the .txt file follows the following format:

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Here, the first column represents the word within the sentence, second column represents the parts-of-speech tag, third column represents the tree chunk tag (not used), the fourth column is the NER tag (not used). Only the first and second columns are used for this task and the rest are ignored. (A placeholder is still required for the third and fourth columns)

  • Conll2003 dataset can be downloaded here.

Running the model

Run pos.py

pos.py [-h] 
	[--train_path TRAIN_PATH] 
	[--test_path TEST_PATH]
	[--num_classes NUM_CLASSES]
	[--batch_size BATCH_SIZE]
	[--tokens_length TOKENS_LENGTH]
	[--max_steps MAX_STEPS]
	[--warmup_steps WARMUP_STEPS]
	[--weight_decay WEIGHT_DECAY]
	[--adam_epsilon ADAM_EPSILON]
	[--gradient_acc_steps GRADIENT_ACC_STEPS]
	[--num_epochs NUM_EPOCHS]
	[--lr LR]
	[--model_no MODEL_NO]
	[--model_type MODEL_TYPE]
	[--train TRAIN (default:1)]  
	[--infer INFER (default:1)]

Or if used as a package:

from nlptoolkit.utils.config import Config
from nlptoolkit.pos.trainer import train_and_fit
from nlptoolkit.pos.infer import infer_from_trained

config = Config(task='pos') # loads default argument parameters as above
config.train_path = './data/pos/conll2003/eng.train.txt' # sets training data path
config.test_path = './data/pos/conll2003/eng.testa.txt' # sets test data path
config.num_classes = 45 # sets number of NER classes
config.batch_size = 16
config.lr = 5e-5 # change learning rate
config.model_no = 0 # sets model to BERT
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt")
inferer.infer_from_input()

Sample output:

Type input sentence: ('quit' or 'exit' to terminate)
I like to eat chicken.
Words --- Tags:
i (PRP)
like (VB)
to (TO)
eat (VB)
chicken. (NN)

Pre-trained models

Download and zip contents of downloaded folder into ./data/ folder.

  1. BERT (includes preprocessed data, vocab, and saved results files)

9) Unsupervised Style Transfer

In unsupervised style transfer, the task is to convert the style of a sentence into another style, while preserving the content. The datasets used are of non-parallel nature, hence the task is unsupervised. Current models for this task:

  1. Style Transformer

Format of dataset files

The training dataset for one style (eg. negative) should be stored in train.neg, while that for the other style (eg. positive) should be stored in train.pos. Within each file, we should have sentences (separated by newline) of the corresponding style, tokenized by spaces.

Running the model

Run style_transfer.py

style_transfer.py [-h] 
	[--data_path DATA_PATH] 
	[--num_classes NUM_CLASSES]
	[--max_features_length MAX_FEATURES_LENGTH]
	[--d_model D_MODEL]
	[--num NUM]
	[--n_heads N_HEADS]
	[--batch_size BATCH_SIZE]
	[--lr_F LR_F]
	[--lr_D LR_D]  
	[--gradient_acc_steps GRADIENT_ACC_STEPS]  
	[--num_iters NUM_ITERS]
	[--save_iters SAVE_ITERS]
	[--train TRAIN (default:1)]  
	[--infer INFER (default:1)]
	[--train_from_checkpoint TRAIN_FROM_CHECKPOINT]  
	[--checkpoint_Fpath CHECKPOINT_FPATH]
	[--checkpoint_Dpath CHECKPOINT_DPATH]
	[--checkpoint_config CHECKPOINT_CONFIG]

Inference after training (see style_transfer.py),

inferer.infer_sentence(sent='The food here is really good.', target_style=0)

Sample output:

the food here is really unclean .

Pre-trained models & example dataset

Download and zip contents of downloaded folder into ./data/ folder.

  1. Style Transformer (includes dataset & pretrained model)

10) Text Clustering

Current models:

  1. Deep Graph Infomax

Format of dataset files

train.csv, with one column labelled 'text', whose rows contain the text of the documents to be clustered.

Running the model

Run cluster.py

cluster.py [-h] 
	[--train_data]   
	[--window]  
	[--max_vocab_len]  
	[--hidden_size_1]  
	[--batch_size BATCH_SIZE]  
	[--gradient_acc_steps GRADIENT_ACC_STEPS]  
	[--max_norm MAX_NORM]
	[--num_epochs NUM_EPOCHS]  
	[--lr LR]  
	[--model_no MODEL_NO]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default:1)]

Analyze clustering results

from nlptoolkit.clustering.models.DGI.infer import infer_from_trained

inferer = infer_from_trained()
inferer.infer_embeddings() # infer node embeddings from trained model
pca, pca_embeddings = inferer.PCA_analyze(n_components=2) # plot PCA
tsne_embeddings = inferer.plot_TSNE(plot=True) # plot TSNE

# Do Agglomerative clustering on TSNE embeddings
result = inferer.cluster_tsne_embeddings(tsne_embeddings,\
                                         n_start=4, n_stop=30, method='ac', plot=True)
node_clusters = inferer.get_clustered_nodes(result['labels']) # get clustered nodes

11) Grammatical Error Correction

Current models:

  1. Gector

Running the model

For training & inference, see gec.py for more details on arguments.

gec.py [-h]
inferer.infer_sentence('He has dog')

Sample output:

He has a dog

Pre-trained models

Download and zip contents of downloaded folder into ./data/ folder.

  1. GECToR (includes pre-trained model)

Benchmark Results

1) Classification (IMDB dataset : 25000 train, 25000 test data points)

Fine-tuned XLNet English Model (12-layer, 768-hidden, 12-heads, 110M parameters)

Fine-tuned BERT English Model (12-layer, 768-hidden, 12-heads, 110M parameters)

4) Machine Translation (English-Chinese: 206K pair sentences)

Transformer (12-layer, 768-hidden, 12-heads, 110M parameters)

6) Punctuation Restoration (TED dataset)

Punc-LSTM (Embedding dim=512, LSTM hidden size=512)

7) Named Entity Recognition (Conll2003 dataset)

Fine-tuned BERT English Model (uncased, 12-layer, 768-hidden, 12-heads, 110M parameters)

8) POS Tagging (Conll2003 dataset)

Fine-tuned BERT English Model (uncased, 12-layer, 768-hidden, 12-heads, 110M parameters)


References

  1. Attention Is All You Need, Vaswani et al, https://arxiv.org/abs/1706.03762
  2. Graph Convolutional Networks for Text Classification, Liang Yao et al, https://arxiv.org/abs/1809.05679
  3. Speech-Transformer: A No-Recurrence Sequence-To-Sequence Model For Speech Recognition, Linhao Dong et al, https://ieeexplore.ieee.org/document/8462506
  4. Listen, Attend and Spell, William Chan et al, https://arxiv.org/abs/1508.01211
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al, https://arxiv.org/abs/1810.04805
  6. XLNet: Generalized Autoregressive Pretraining for Language Understanding, Yang et al, https://arxiv.org/abs/1906.08237
  7. Investigating LSTM for punctuation prediction, Xu et al, https://ieeexplore.ieee.org/document/7918492
  8. HuggingFace's Transformers: State-of-the-art Natural Language Processing, Thomas Wolf et al, https://arxiv.org/abs/1910.03771
  9. Graph Attention Networks, Petar et al, https://arxiv.org/pdf/1710.10903.pdf
  10. Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation, Ning et al, https://arxiv.org/abs/1905.05621
  11. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Zhenzhong Lan et al, https://arxiv.org/abs/1909.11942
  12. Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau et al, https://arxiv.org/pdf/1911.02116.pdf
  13. How Powerful Are Graph Neural Networks?, Keyulu Xu et al, https://arxiv.org/pdf/1810.00826.pdf
  14. Deep Graph Infomax, Petar et al, https://arxiv.org/abs/1809.10341
  15. Learning by Semantic Similarity Makes Abstractive Summarization Better, Yoon et al, https://arxiv.org/pdf/2002.07767.pdf
  16. GECToR -- Grammatical Error Correction: Tag, Not Rewrite, Kostiantyn Omelianchuk et al, https://arxiv.org/abs/2005.12592

To do list

In order of priority:

  • Include package usage info for classification, ASR, summarization, translation, generation, punctuation_restoration, NER, POS
  • Include benchmark results for classification, ASR, summarization, translation, generation, punctuation_restoration, NER, POS
  • Include pre-trained models + demo based on benchmark datasets for classification, ASR, summarization, translation, generation, punctuation_restoration, NER, POS
  • Include more models for punctuation restoration, translation, NER, POS
  • Clean up style transfer
  • Document clustering

nlp_toolkit's People

Contributors

dependabot[bot] avatar plkmo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nlp_toolkit's Issues

Problem in installing the package

I installed nlptoolkit package through pip. But the following line is repeatedly giving me an error.

 from nlptoolkit.utils.config import Config

I tried upgrading pandas, tqdm as these are shown as some possible fixes here. But, I still was not able to resolve the error. Please provide some guidelines to get through this error.**

_ImportError                               Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py in pandas(tclass, *targs, **tkwargs)

ImportError: cannot import name 'DataFrameGroupBy'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
5 frames
<ipython-input-14-3c9cb5b4622f> in <module>()
      1 # try:
----> 2 from nlptoolkit.utils.config import Config
      3 # except Exception as e:
      4   # print(e)
      5 try:

/usr/local/lib/python3.6/dist-packages/nlptoolkit/__init__.py in <module>()
      1 from . import ASR
----> 2 from . import classification
      3 from . import clustering
      4 from . import generation
      5 from . import ner

/usr/local/lib/python3.6/dist-packages/nlptoolkit/classification/__init__.py in <module>()
----> 1 from . import models

/usr/local/lib/python3.6/dist-packages/nlptoolkit/classification/models/__init__.py in <module>()
      6 from . import XLMRoBERTa
      7 from . import GIN
----> 8 from . import infer

/usr/local/lib/python3.6/dist-packages/nlptoolkit/classification/models/infer.py in <module>()
     12 import logging
     13 
---> 14 tqdm.pandas(desc="prog-bar")
     15 logging.basicConfig(format='%(asctime)s [%(levelname)s]: %(message)s', \
     16                     datefmt='%m/%d/%Y %I:%M:%S %p', level=logging.INFO)

/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py in pandas(tclass, *targs, **tkwargs)

ImportError: cannot import name 'PanelGroupBy'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below._



Got some questions about text GCN.

Hi,

I have some questions regarding text GCN. Do you mind telling me your email address?

My question is that if I have multiple graphs(isolated with each other), is it possible to train with your model?

Cheers,
Derek

Problem while using the punctuation example

Hi,
while using the example for punctuation i got a errormsg:

Traceback (most recent call last):
  File "punctuation.py", line 10, in <module>
    train_and_fit(config) # starts training with configured parameters
  File "/yyy/NLP_Toolkit/LSTM_env/lib/python3.6/site-packages/nlptoolkit/punctuation_restoration/trainer.py", line 28, in train_and_fit
    df, train_loader, train_length, max_features_length, max_output_len = load_dataloaders(args)
  File "/yyy/NLP_Toolkit/LSTM_env/lib/python3.6/site-packages/nlptoolkit/punctuation_restoration/preprocessing_funcs.py", line 315, in load_dataloaders
    df = create_TED_datasets(args)
  File "/yyy/NLP_Toolkit/LSTM_env/lib/python3.6/site-packages/nlptoolkit/punctuation_restoration/preprocessing_funcs.py", line 306, in create_TED_datasets
    df)
  File "/yyy/NLP_Toolkit/LSTM_env/lib/python3.6/site-packages/nlptoolkit/punctuation_restoration/preprocessing_funcs.py", line 34, in save_as_pickle
    with open(completeName, 'wb') as output:
FileNotFoundError: [Errno 2] No such file or directory: './data/eng.pkl'

The slightly changed sourcecode:

from nlptoolkit.utils.config import Config
from nlptoolkit.punctuation_restoration.trainer import train_and_fit
from nlptoolkit.punctuation_restoration.infer import infer_from_trained

config = Config(task='punctuation_restoration') # loads default argument parameters as above
config.data_path = "train.tags.en-fr.en" # sets training data path
config.batch_size = 32
config.lr = 5e-5 # change learning rate
config.model_no = 1 # sets model to PuncLSTM
train_and_fit(config) # starts training with configured parameters
inferer = infer_from_trained(config) # initiate infer object, which loads the model for inference, after training model
inferer.infer_from_input() # infer from user console input
# inferer.infer_from_file(in_file="./data/input.txt", out_file="./data/output.txt") # infer from input file

issues with punctuate.py

issue with calinski_harabasz_score, cannot be imported
issue with google.cloud, cannot be imported
error message OSError: [E050] Can't find model 'en_core_web_lg'

In addition, add fasttext and kenlm to requirements.

Punctuation restoration for different language

Dear Plkmo you have done some nice piece of work. I was looking at your punctuation restoration task and trained the Lstm model. Now I am trying to train for a different language. Could you give me some instructions on how to do that?

PuncTransformer model not uploaded

Hi.
Great work on this repo.
Can you upload the trained PuncTransformer model for punctuation restoration shown as model_no:0 in the config?

PuncTransformer model not uploaded

Hi.
Great work on this repo.
Can you upload the trained PuncTransformer model for punctuation restoration shown as model_no:0 in the config?

Punctuation Restoration. RuntimeError: shape '[32, 22, 2048]'

Hi. Thanks for the implementation. I tried to train the punctuation restoration model with the default set of parameters but it failed with an error.

Here is the script:

from nlptoolkit.utils.config import Config
from nlptoolkit.punctuation_restoration.trainer import train_and_fit
from nlptoolkit.punctuation_restoration.infer import infer_from_trained

config = Config(task='punctuation_restoration') # loads default argument parameters as above
config.data_path = "./data/train.tags.en-fr.en" # sets training data path
config.batch_size = 32
config.lr = 5e-5 # change learning rate
config.model_no = 1 # sets model to PuncLSTM
train_and_fit(config) # starts training with configured parameters

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vladimir/repo1/punctuation_restoration/NLP_Toolkit/nlptoolkit/punctuation_restoration/trainer.py", line 94, in train_and_fit
    outputs, outputs2 = net(src_input, trg_input, trg2_input)
  File "/home/vladimir/miniconda3/envs/punc_restore/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vladimir/repo1/punctuation_restoration/NLP_Toolkit/nlptoolkit/punctuation_restoration/models/LSTM_attention_model.py", line 314, in forward
    x = self.listener(x)
  File "/home/vladimir/miniconda3/envs/punc_restore/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vladimir/repo1/punctuation_restoration/NLP_Toolkit/nlptoolkit/punctuation_restoration/models/LSTM_attention_model.py", line 34, in forward
    output, _ = self.layer2(output); #print("Listener output2:", output.shape)
  File "/home/vladimir/miniconda3/envs/punc_restore/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vladimir/repo1/punctuation_restoration/NLP_Toolkit/nlptoolkit/punctuation_restoration/models/LSTM_attention_model.py", line 21, in forward
    x = x.contiguous().view(x.shape[0], int(x.shape[1]/2), 2*x.shape[2])
RuntimeError: shape '[32, 22, 2048]' is invalid for input of size 1474560

Dataset md5 hash

5d575762da3d9e628f0494b6ce8abeb9  ./data/train.tags.en-fr.en

Conda environment

name: punc_restore
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - ca-certificates=2020.1.1=0
  - certifi=2020.4.5.1=py36_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.3=he6710b0_1
  - libgcc-ng=9.1.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - ncurses=6.2=he6710b0_1
  - openssl=1.1.1g=h7b6447c_0
  - pip=20.0.2=py36_3
  - python=3.6.10=h7579374_2
  - readline=8.0=h7b6447c_0
  - setuptools=47.1.1=py36_0
  - sqlite=3.31.1=h62c20be_1
  - tk=8.6.8=hbc83047_0
  - wheel=0.34.2=py36_0
  - xz=5.2.5=h7b6447c_0
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - audioread==2.1.8
    - beautifulsoup4==4.9.1
    - blis==0.2.4
    - boto==2.49.0
    - boto3==1.9.238
    - botocore==1.12.253
    - cffi==1.14.0
    - chardet==3.0.4
    - click==7.1.2
    - cycler==0.10.0
    - cymem==2.0.3
    - dask==2.1.0
    - decorator==4.4.2
    - docutils==0.15.2
    - en-core-web-lg==2.1.0
    - fairseq==0.8.0
    - fastbpe==0.1.0
    - fasttext==0.9.1
    - filelock==3.0.12
    - h5py==2.10.0
    - idna==2.9
    - jieba==0.39
    - jmespath==0.10.0
    - joblib==0.15.1
    - kenlm==0.0.0
    - keras==2.3.1
    - keras-applications==1.0.8
    - keras-preprocessing==1.1.2
    - kiwisolver==1.2.0
    - librosa==0.7.0
    - llvmlite==0.32.1
    - matplotlib==3.1.0
    - murmurhash==1.0.2
    - networkx==2.3
    - nltk==3.5
    - numba==0.49.1
    - numpy==1.16.4
    - pandas==0.24.2
    - pillow==7.1.2
    - plac==0.9.6
    - portalocker==1.7.0
    - preshed==2.0.1
    - pybind11==2.5.0
    - pycparser==2.20
    - pyparsing==2.4.7
    - python-dateutil==2.8.1
    - pytorch-nlp==0.4.1
    - pytz==2020.1
    - pyyaml==5.3.1
    - regex==2019.8.19
    - requests==2.23.0
    - resampy==0.2.2
    - s3transfer==0.2.1
    - sacrebleu==1.4.1
    - sacremoses==0.0.34
    - scikit-learn==0.21.2
    - scipy==1.4.1
    - sentencepiece==0.1.83
    - seqeval==0.0.12
    - six==1.12.0
    - soundfile==0.10.2
    - soupsieve==2.0.1
    - spacy==2.1.8
    - srsly==1.0.2
    - thinc==7.0.8
    - tokenizers==0.7.0
    - toolz==0.10.0
    - torch==1.4.0
    - torchtext==0.4.0
    - torchvision==0.5.0
    - tqdm==4.32.1
    - typing==3.7.4.1
    - urllib3==1.25.9
    - wasabi==0.6.0

Note:
The example in README has an unexpectable ' in the line with config.data_path = "./data/train.tags.en-fr.en" which throws an error but it is not related to this issue.

Problem installing NLP_Toolkit

Hi there,

I was trying to install your NLP-Toolkit, but ran into difficulties getting the following error:

image

Apparently, the torchvision version is not correct, but I can't install the version torchvision==0.4.0a0+6b959ee as it doesn't exist, but this is the one given in the "requirements.txt" file. Do you have any suggestions on how to successfully install your NLP-Toolkit?

Thank you!

Pretrained punctuation model produces mangled output

Hi, when running the pretrained biLSTM for punctuation-restoration, I get the following output from inferer.infer_sentence("hi how are you"):

Predicted Label: .hI. How.

and inferer.infer_from_file("./data/input.txt", out_file="./data/output.txt") produces the following output.txt:

hi how are you,.hI. How.
i am fine thanks,"I, am fine thanks."

Does the pretrained model not work?

Tested on both OSX and Linux

problem in installing package

while installing NLP_toolkit i am having the following issue
Collecting sentencepiece==0.1.83
Using cached sentencepiece-0.1.83.tar.gz (497 kB)
ERROR: Command errored out with exit status 1:
command: /root/.local/share/virtualenvs/Trucasing-yvmNDbks/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ra5d94ra/sentencepiece/setup.py'"'"'; file='"'"'/tmp/pip-install-ra5d94ra/sentencepiece/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-ra5d94ra/sentencepiece/pip-egg-info
cwd: /tmp/pip-install-ra5d94ra/sentencepiece/
Complete output (7 lines):
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-ra5d94ra/sentencepiece/setup.py", line 29, in
with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
File "/usr/lib/python3.8/codecs.py", line 905, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Running TextGCN processing crashes due to high RAM usage

Hi. I'm exploring the usage of the TextGCN implementation in the toolkit. I saw the sample using Bible Text but decided to explore using the toolkit instead as it is easier to use the package on Google Colab. I managed to clone the repo and import the library. Using the data from the Bible sample, the code runs until the train_and_fit(config) part. My setup is as follows:

config = Config(task='classification') # loads default argument parameters as above
config.train_data = 't_bbe.csv' # sets training data path
config.infer_data = 't_bbe.csv' # sets infer data path
config.num_classes = 66 # sets number of prediction classes
config.batch_size = 32
config.model_no = 0
config.lr = 0.001 # change learning rate
config.num_epochs = 10
config.max_vocab_len = 400

I set the train_data and infer_data to the same csv file first just to see if I could get the model to run but it seems that I couldn't get through preprocessing. train_and_fit(config) runs upto building document-word edges but then spikes RAM usage to 12Gb and crashes colab (running with GPU). Output before crashing is as follows:

06/16/2020 07:40:01 PM [INFO]: Loading data...
06/16/2020 07:40:01 PM [INFO]: Building datasets and graph from raw data... Note this will take quite a while...
06/16/2020 07:40:01 PM [INFO]: Preparing data...
06/16/2020 07:40:18 PM [INFO]: Calculating Tf-idf...
06/16/2020 07:40:19 PM [INFO]: Building graph (No. of document, word nodes: 62206, 400)...
06/16/2020 07:40:19 PM [INFO]: Adding document nodes to graph...
06/16/2020 07:40:19 PM [INFO]: Adding word nodes to graph...
06/16/2020 07:40:19 PM [INFO]: Building document-word edges...
100%|██████████| 62206/62206 [04:25<00:00, 234.45it/s]

I initially didn't set a value max_vocab_len but it couldn't get past 3% on building document-word edges. I limited it to 400 and was able to reach 100% but it crashes after that. I'm afraid setting it lower would essentially remove most of the data.

My actual data has around double the number of documents in the Bible sample so I was wondering if there's a way to minimize RAM consumption to get it to work without needing more than 12Gb of RAM.

--
Edit: I tried using the suggested dataset (IMDB Sentiment Classification) with max vocab of 200 but it crashes during building of adjacency matrix as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.