prithivirajdamodaran / parrot_paraphraser Goto Github PK

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

License: Apache License 2.0

Python 100.00%

paraphrase-generation paraphrase paraphrased-data nlu slot-filling rasa-nlu intents

parrot_paraphraser's Introduction

Parrot

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Why Parrot?
Getting started
Scope
What makes a paraphraser a good augmentor for NLU? (Details)
- Sample NLU data (Rasa format)
Power of Augmentation - Metrics and Comparison
Current Features
Roadmap
Current Limitations/Known issues
References
Citation

Why Parrot?

Huggingface lists 16 paraphrase generation models, (as of this writing) RapidAPI lists 7 fremium and commercial paraphrasers like QuillBot, Rasa has discussed an experimental paraphraser for augmenting text data here, Sentence-transfomers offers a paraphrase mining utility and NLPAug offers word level augmentation with a PPDB (a multi-million paraphrase database). While these attempts at paraphrasing are great, there are still some gaps and paraphrasing is NOT yet a mainstream option for text augmentation in building NLU models....Parrot is a humble attempt to fill some of these gaps.

What is a good paraphrase? Almost all conditioned text generation models are validated on 2 factors, (1) if the generated text conveys the same meaning as the original context (Adequacy) (2) if the text is fluent / grammatically correct english (Fluency). For instance Neural Machine Translation outputs are tested for Adequacy and Fluency. But a good paraphrase should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the 3 key metrics that measures the quality of paraphrases are:

Adequacy (Is the meaning preserved adequately?)
Fluency (Is the paraphrase fluent English?)
Diversity (Lexical / Phrasal / Syntactical) (How much has the paraphrase changed the original sentence?)

Parrot offers knobs to control Adequacy, Fluency and Diversity as per your needs.

What makes a paraphraser a good augmentor? For training a NLU model we just don't need a lot of utterances but utterances with intents and slots/entities annotated. Typical flow would be:

Given an input utterance + input annotations a good augmentor spits out N output paraphrases while preserving the intent and slots.
The output paraphrases are then converted into annotated data using the input annotations that we got in step 1.
The annotated data created out of the output paraphrases then makes the training dataset for your NLU model.

But in general being a generative model paraphrasers doesn't guarantee to preserve the slots/entities. So the ability to generate high quality paraphrases in a constrained fashion without trading off the intents and slots for lexical dissimilarity makes a paraphraser a good augmentor. More on this in section 3 below

Getting started

Install

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

Trying to install for AMD GPUs?

Demo notebook

Quickstart

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

''' 
uncomment to get reproducable paraphrase generations
def random_state(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

random_state(1234)
'''

#Init models (make sure you init ONLY once if you integrate this to your code)
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")

phrases = ["Can you recommend some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False)
  for para_phrase in para_phrases:
   print(para_phrase)

----------------------------------------------------------------------
Input_phrase: Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------
list some excellent restaurants to visit in new york city?
what upscale restaurants do you recommend in new york?
i want to try some upscale restaurants in new york?
recommend some upscale restaurants in newyork?
can you recommend some high end restaurants in newyork?
can you recommend some upscale restaurants in new york?
can you recommend some upscale restaurants in newyork?
----------------------------------------------------------------------
Input_phrase: What are the famous places we should not miss in Russia
----------------------------------------------------------------------
what should we not miss when visiting russia?
recommend some of the best places to visit in russia?
list some of the best places to visit in russia?
can you list the top places to visit in russia?
show the places that we should not miss in russia?
list some famous places which we should not miss in russia?

Getting syntactic and phrasal diversity/variety in your paraphrases ?

You can play with the do_diverse knob (checkout the next section for more knobs). Consider this example: do_diverse = False (default)*

------------------------------------------------------------------------------
Input_phrase: How are the new Macbook Pros with M1 chips?
------------------------------------------------------------------------------
'how do you rate the new macbook pros? '
'how are the new macbook pros? '
'how is the new macbook pro doing with new chips? '
'how do you like the new macbook pro m1 chip? '
'what is the use of the new macbook pro m1 chips? '

do_diverse = True

------------------------------------------------------------------------------
Input_phrase: How are the new Macbook Pros with M1 chips?
------------------------------------------------------------------------------
'what do you think about the new macbook pro m1? '
'how is the new macbook pro m1? '
'how are the new macbook pros? '
'what do you think about the new macbook pro m1 chips? '
'how good is the new macbook pro m1 chips? '
'how is the new macbook pro m1 chip? '
'do you like the new macbook pro m1 chips? '
'how are the new macbook pros with m1 chips? '

Other Knobs

 para_phrases = parrot.augment(input_phrase=phrase,
                               use_gpu=False,
                               diversity_ranker="levenshtein",
                               do_diverse=False, 
                               max_return_phrases = 10, 
                               max_length=32, 
                               adequacy_threshold = 0.99, 
                               fluency_threshold = 0.90)

Scope

In the space of conversational engines, knowledge bots are to which we ask questions like "when was the Berlin wall teared down?", transactional bots are to which we give commands like "Turn on the music please" and voice assistants are the ones which can do both answer questions and action our commands. Parrot mainly foucses on augmenting texts typed-into or spoken-to conversational interfaces for building robust NLU models. (So usually people neither type out or yell out long paragraphs to conversational interfaces. Hence the pre-trained model is trained on text samples of maximum length of 32.)

While Parrot predominantly aims to be a text augmentor for building good NLU models, it can also be used as a pure-play paraphraser.

What makes a paraphraser a good augmentor for NLU? (Details)

To enable automatic training data generation, a paraphraser needs to keep the slots in intact. So the end to end process can take input utternaces, augment and convert them into NLU training format goo et al or rasa format (as shown below). The data generation process needs to look for the same slots in the output paraphrases to derive the start and end positions.(as shown in the json below)

Ideally the above process needs an UI like below to collect to input utternaces along with annotations (Intents, Slots and slot types) which then can be agumented and converted to training data.

Sample NLU data (Rasa format)

{
    "rasa_nlu_data": {
        "common_examples": [
            {
                "text": "i would like to find a flight from charlotte to las vegas that makes a stop in st. louis",
                "intent": "flight",
                "entities": [
                    {
                        "start": 35,
                        "end": 44,
                        "value": "charlotte",
                        "entity": "fromloc.city_name"
                    },
                    {
                        "start": 48,
                        "end": 57,
                        "value": "las vegas",
                        "entity": "toloc.city_name"
                    },
                    {
                        "start": 79,
                        "end": 88,
                        "value": "st. louis",
                        "entity": "stoploc.city_name"
                    }
                ]
            },
            ...
        ]
    }
}

Original: I would like a list of round trip flights between indianapolis and orlando florida for the 27th
Paraphrase useful for augmenting: what are the round trip flights between indianapolis and orlando for the 27th
Paraphrase not-so-useful for augmenting: what are the round trip flights between chicago and orlando for the 27th.

Dataset for paraphrase model

THe following datasets where analysed, but the paraphrase generation model prithivida/parrot_paraphraser_on_T5 has been fine-tuned on some of them

Power of Augmentation - Metrics and Comparison

Intent Classification task:

Experimental setup: From each dataset increasing number of random utternaces per intent were taken to form the raw training data. The same data was then augmented with parrot paraphraser for Nx times(where N =10 or 15 depending the dataset) to form the augmented training data. Now models are trained on both raw data and augmented data to compare the performance. Being a multiclass classification model weighted F1 was used as a metric. The experiment was repeated 4 times for each number of utterance and F1 has been averaged to remove randomness in the trend. I have used 6 prominent NLU datasets from across domains. Below charts reveal that with a "very modest number" utterances and paraphrase augmentation we can achieve good classfication performance on day 1. "Very modest" varies between 4 to 6 utterances per intent in some datasets and 5 to 7 for some datasets.

Semantic slot-filling task:

TBD

Current Features

TBD

Roadmap

TBD

Current Limitations/Known issues

The diversity scores are not normalised each of the diversity rankers scores paraphrases differently
Some command style input phrases generate less adequate paraphrases

Installation for AMD GPUs

If you're using an AMD GPU and want to use the AMD ROCm Platform, follow the steps below. Note that as of writing, ROCm is only available for Linux users! The steps are tested and verified on Ubuntu 22.04 using a Radeon RX 6650 XT GPU.

Install the dependencies:

git clone https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
cd Parrot_Paraphraser
pip install -r requirements-rocm.txt

After the installation is finished, you can verify your installation by running the following:

python3 -c 'import torch; print(torch.cuda.is_available())' # should print 'True'

If the output of the above command is False, you can try "fooling" the ROCm driver by setting the environment variable HSA_OVERRIDE_GFX_VERSION (as per this issue):

HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 -c 'import torch; print(torch.cuda.is_available())' # should print 'True'
# OR 
export HSA_OVERRIDE_GFX_VERSION=10.3.0
python3 -c 'import torch; print(torch.cuda.is_available())' # should print 'True'

References

TBD

Citation

To cite Parrot in your work, please use the following bibtex reference:

@misc{prithivida2021parrot,
  author       = {Prithiviraj Damodaran},
  title        = {Parrot: Paraphrase generation for NLU.},
  year         = 2021,
  version      = {v1.0}
}

parrot_paraphraser's People

Contributors

Stargazers

Watchers

Forkers

adbmd maxqai felipeescallon afiqmuzaffar stjordanis nimesh0505 matteo-grella l3str4nge analystsubranjit xjohnxjohn prabhkaran jm0216 dumpmemory c00renut davexanatos cyatreya coderboy24x7 kishore-25 stungkit blackcat84 andreteixeira1998 nomiluks ivandmitry7 reactivetype razr02 adhithiyaraj evilc3 trendingtechnology maximedb nkdatascientist jgera beatricekiplagat sourabhsinha396 demudu vishal2241 regisamon doken-tokuyama timuster kamukaz atosz33 manikant92 pritam-patra ichoake alan-ai-learner mishav78 laoli2046 magnus167 luffycodes steeljardas suryacoder onuratakan prakashgarg91 sohaibcs1 korallin chikkaudayasai akashgowtham1 phoenixsecularbird jaingaurav3 desis123 joapfel jsonalike gfdac kp-forks bellyfat fachryrizano bondsmith loventheair tusharj0810 lenverse chinganc zarakolagar120 psraju123 sagarmandiya techthiyanes isinghgithub subhamio haradhansharma aspnetcs m3hrdadfi student1043 yurikim2145 harshvardhansharma smyja dvircohen0 chetan01101993 tonyle9 abuhorainhero a-leut optible gablans creatorrr hanhaohh archfool nguyenphat57 victagne javadev984 typesdigital leaguedora ifoxchat yedpodtrzitko

parrot_paraphraser's Issues

Killed while running sample script

Hi, I try to run this project on a VPS with 2GB RAM. When i try to enter script parrot = Parrot the python terminal is Killed. Can you help me please?

Python 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from parrot import Parrot
>>> import torch
>>> import warnings
>>> warnings.filterwarnings("ignore")
>>> def random_state(seed):
...   torch.manual_seed(seed)
...   if torch.cuda.is_available():
...     torch.cuda.manual_seed_all(seed)
...
>>> random_state(1234)
>>>
>>> parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")
Killed

how to compute adequacy score?

Hi, thank you for open sourcing this.

Could you please let me know where can I find the training code of the adequacy score?
or is there anyway I can re-implement to calculate the adequacy score?

Thanks,

Failed to build tokenizers ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

error: can't find Rust compiler

If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.

To update pip, run:

  pip install --upgrade pip

and then retry package installation.

If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.

ERROR: Failed building wheel for tokenizers
Successfully built parrot
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

Installation issue, torch >=1.6.0, Raspberry Pi OS 64 bit, Raspberry Pi 4b 8 gig

Would love to get this working. Getting two errors - first error here is the last thing the install reports before exiting back to the terminal prompt:

Collecting torch>=1.6.0 (from sentence-transformers->parrot==1.0) Could not find a version that satisfies the requirement torch>=1.6.0 (from sentence-transformers->parrot==1.0) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2) No matching distribution found for torch>=1.6.0 (from sentence-transformers->parrot==1.0)

Tried running the quick start script anyway, get this error:

Traceback (most recent call last): File "paraphrase.py", line 1, in <module> from parrot import Parrot ModuleNotFoundError: No module named 'parrot'

Raspberry Pi 4B 8 Gig RAM version, running Raspberry Pi OS 64 Bit.
Apologies if I have not provided sufficient info here - let me know how else I may help to figure out why this won't run on my build. Thanks

Deprecated .egg files when running the install command

when running >!pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git, the output for this part is:

Collecting git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
Cloning https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git to c:\users\32067\appdata\local\temp\pip-req-build-aaa3h5n1
Resolved https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git to commit 720a87a
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: transformers in c:\programdata\anaconda3\lib\site-packages (from parrot==1.0) (4.32.1)
Requirement already satisfied: sentencepiece in c:\programdata\anaconda3\lib\site-packages\sentencepiece-0.1.99-py3.11-win-amd64.egg (from parrot==1.0) (0.1.99)
Requirement already satisfied: python-Levenshtein in c:\programdata\anaconda3\lib\site-packages\python_levenshtein-0.23.0-py3.11.egg (from parrot==1.0) (0.23.0)
Requirement already satisfied: sentence-transformers in c:\programdata\anaconda3\lib\site-packages\sentence_transformers-2.2.2-py3.11.egg (from parrot==1.0) (2.2.2)
Requirement already satisfied: fuzzywuzzy in c:\programdata\anaconda3\lib\site-packages\fuzzywuzzy-0.18.0-py3.11.egg (from parrot==1.0) (0.18.0)
Requirement already satisfied: Levenshtein==0.23.0 in c:\programdata\anaconda3\lib\site-packages\levenshtein-0.23.0-py3.11-win-amd64.egg (from python-Levenshtein->parrot==1.0) (0.23.0)
Requirement already satisfied: rapidfuzz<4.0.0,>=3.1.0 in c:\programdata\anaconda3\lib\site-packages\rapidfuzz-3.6.1-py3.11-win-amd64.egg (from Levenshtein==0.23.0->python-Levenshtein->parrot==1.0) (3.6.1)
Requirement already satisfied: tqdm in c:\programdata\anaconda3\lib\site-packages (from sentence-transformers->parrot==1.0) (4.65.0)
Requirement already satisfied: torch>=1.6.0 in c:\programdata\anaconda3\lib\site-packages\torch-2.1.2-py3.11-win-amd64.egg (from sentence-transformers->parrot==1.0) (2.1.2)
Requirement already satisfied: torchvision in c:\programdata\anaconda3\lib\site-packages\torchvision-0.16.2-py3.11-win-amd64.egg (from sentence-transformers->parrot==1.0) (0.16.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from sentence-transformers->parrot==1.0) (1.24.3)
Requirement already satisfied: scikit-learn in c:\programdata\anaconda3\lib\site-packages (from sentence-transformers->parrot==1.0) (1.3.0)
Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (from sentence-transformers->parrot==1.0) (1.11.1)
Requirement already satisfied: nltk in c:\programdata\anaconda3\lib\site-packages (from sentence-transformers->parrot==1.0) (3.8.1)
Requirement already satisfied: huggingface-hub>=0.4.0 in c:\programdata\anaconda3\lib\site-packages (from sentence-transformers->parrot==1.0) (0.15.1)
Requirement already satisfied: filelock in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (3.9.0)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (23.1)
Requirement already satisfied: pyyaml>=5.1 in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (6.0)
Requirement already satisfied: regex!=2019.12.17 in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (2022.7.9)
Requirement already satisfied: requests in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (2.31.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (0.13.2)
Requirement already satisfied: safetensors>=0.3.1 in c:\programdata\anaconda3\lib\site-packages (from transformers->parrot==1.0) (0.3.2)
Requirement already satisfied: fsspec in c:\programdata\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers->parrot==1.0) (2023.4.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\programdata\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers->parrot==1.0) (4.7.1)
Requirement already satisfied: sympy in c:\programdata\anaconda3\lib\site-packages (from torch>=1.6.0->sentence-transformers->parrot==1.0) (1.11.1)
Requirement already satisfied: networkx in c:\programdata\anaconda3\lib\site-packages (from torch>=1.6.0->sentence-transformers->parrot==1.0) (3.1)
Requirement already satisfied: jinja2 in c:\programdata\anaconda3\lib\site-packages (from torch>=1.6.0->sentence-transformers->parrot==1.0) (3.1.2)
Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from tqdm->sentence-transformers->parrot==1.0) (0.4.6)
Requirement already satisfied: click in c:\programdata\anaconda3\lib\site-packages (from nltk->sentence-transformers->parrot==1.0) (8.0.4)
Requirement already satisfied: joblib in c:\programdata\anaconda3\lib\site-packages (from nltk->sentence-transformers->parrot==1.0) (1.2.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\programdata\anaconda3\lib\site-packages (from requests->transformers->parrot==1.0) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->transformers->parrot==1.0) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests->transformers->parrot==1.0) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->transformers->parrot==1.0) (2023.11.17)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->sentence-transformers->parrot==1.0) (2.2.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\programdata\anaconda3\lib\site-packages (from torchvision->sentence-transformers->parrot==1.0) (10.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in c:\programdata\anaconda3\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers->parrot==1.0) (2.1.1)
Requirement already satisfied: mpmath>=0.19 in c:\programdata\anaconda3\lib\site-packages (from sympy->torch>=1.6.0->sentence-transformers->parrot==1.0) (1.3.0)
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\fuzzywuzzy-0.18.0-py3.11.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\levenshtein-0.23.0-py3.11-win-amd64.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\python_levenshtein-0.23.0-py3.11.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\rapidfuzz-3.6.1-py3.11-win-amd64.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\sentencepiece-0.1.99-py3.11-win-amd64.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\sentence_transformers-2.2.2-py3.11.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\torch-2.1.2-py3.11-win-amd64.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\torchvision-0.16.2-py3.11-win-amd64.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..
Running command git clone --filter=blob:none --quiet https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git 'C:\Users\32067\AppData\Local\Temp\pip-req-build-aaa3h5n1'

No errors occurred when executing the whole program, but the final outputs are always "No paraphrases returned" even if I have lowered down the thresholds.
I have already installed these packages, but it seems like there is a problem with them.

Parrot returns very similar description without paraphrasing for some sentences

Hi Prithviraj,

Good Day!

Awesome work on building this library! I tried to use it for a personal project from the fashion domain and here's what I observed:

Have a look at the two sentences above. I have provided the input sentence and the paraphrased sentence obtained using parrot. Except some punctuation and contractions, there's not much that the model is able to do.

Such is the case even for most of the descriptions that I have scraped through fashion retailers. Could you advise how can I use parrot to obtain better paraphrased suggestions please?

Thanks & Regards,
Vinayak Nayak.

error: legacy-install-failure

When trying to install I get this:

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> python-Levenshtein

How can I fix it?

Which T5 model was used for fine-tuning?

Hi,
could you kindly tell which T5 model (T5-small, T5-base, T5-large, T5-3B, ....) the Parrot Project is based on?

Kind regards

TypeError: 'NoneType' object is not iterable

Hi,

Why do I get the error when running the following code?

`phrases = ["Can you recommend some upscale restaurants in Newyork?",
"What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
print("-"*100)
print("Input_phrase: ", phrase)
print("-"*100)
para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False, do_diverse=True, diversity_ranker="levenshtein")
for para_phrase in para_phrases:
print(para_phrase)`

Error:

`TypeError Traceback (most recent call last)
/home/user/Code/Parrot/main.ipynb Cell 4 in <cell line: 5>()
8 print("-"*100)
9 para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False, do_diverse=True, diversity_ranker="levenshtein")
---> 10 for para_phrase in para_phrases:
11 print(para_phrase)

TypeError: 'NoneType' object is not iterable`

Parrot Library Package missing

I tried across multiple environments, every single one said the Parrot module does exist in Parrot.

The line causing this error:
from parrot import Parrot

Error:
from parrot import Parrot
ImportError: cannot import name 'Parrot' from 'parrot'

What is causing this error?

How to create embeddings vector from input sentence?

Inference Taking too Long

I am doing inference on GPU even then the inference of a single statement is taking around 65 seconds. How can we reduce the inference time?

Is there any paper published?

Hi,

Can you please provide more info on the model that was used to finetune and the metrics achieved. Also is there any paper published for this?

Thanks,
Gokul Raj

Trained the model

Hi,
Can I ask about supporting Arabic language!
Why can't Parrot paraphrase Arabic language?

Understanding adequacy metric

Hi, I have been using the filters file from this repo to experiment on evaluating some paraphrases I created using various different models, but I noticed that the adequacy score gives some unexpected results so I was wondering if you could tell me some more about how it was trained?
I noticed that if the paraphrase and the original are the exact same, the adequacy is quite low (around 0.7-0.80). If the paraphrase is shorter or longer than the original, it generally has a much higher score. Ex. Original: "I need to buy a house in the neighborhood" -> Paraphrase: "I need to buy a house" the paraphrase has a score of 0.98. Paraphrase: "I need to buy a house in the neighborhood where I want to live" results in an even higher score of .99 while the paraphrase "I need to buy a house in the neighborhood" (which is the same exact sentence as the original) gets a score of 0.7 and the same sentence with a period at the end gets 0.8.
This makes me think that the adequacy model takes into account how much the new sentence has changed from the original in addition to how well its meaning was preserved in some way.
Since the ReadMe states that adequacy measures whether or not the paraphrase preserves the meaning of the original, it is confusing to me that using the same sentence for original and paraphrase does not get a high score, could you clarify?

Error: You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Hi,

I am facing the error pasted below:

I already have 'sentencepiece' installed.
I also tried using 'use_fast=False' flag in parrot.py file (pasted below), as suggested in many other discussions for the resolution of this error.

However, I am still facing this issue. Can you please help me to resolve it?

Thanks,
Rahul

Any chance it could work on paraphrasing profane words ?

Hi,

Any chance you have a model that can paraphrase sentences with profane words to filter profanity and express that emotion without profane words ?

About this model paper

Excuse me, where can I find your paper on Parrot Paraphraser generation？

ImportError: cannot import name 'Parrot' from partially initialized module 'parrot' (most likely due to a circular import) (D:\python\paraphrase\parrot.py)

I am getting this issue. I have installed Parrot corectly but stil showing this eeor. Pls help.

How was the fluency model trained?

As I understand, this is just a BERT with a binary classification head.
If that is the case, then what was the training data for this model?

Unable to use the model due to Huggingface updated API

Hii.
You might be unable to use the Parrot model due to an error something like this
...is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'...

To solve this I already created a pull request. Till then you can open the Parrot Library source code in your code editor and add these update these lines lines (13, 14 lines most probably)

self.tokenizer = AutoTokenizer.from_pretrained(model_tag, use_auth_token = <your auth token>)
self.model     = AutoModelForSeq2SeqLM.from_pretrained(model_tag, use_auth_token = <your auth token>)

How to set other language exept english

How can we phrase other language let say French for example ?

Installation Issues on Windows

It appears there is a broken dependency on python-Levenshtein. You should update your import to point to Levenshtein package:

Installing collected packages: python-Levenshtein
Running setup.py install for python-Levenshtein ... error
ERROR: Command errored out with exit status 1:
command: 'C:\Users\gablanco\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\gablanco\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"'; file='"'"'C:\Users\gablanco\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\gablanco\AppData\Local\Temp\pip-record-8_vcaac_\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\gablanco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Include\python-Levenshtein'
cwd: C:\Users\gablanco\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1
Complete output (28 lines):
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.9
creating build\lib.win-amd64-3.9\Levenshtein
copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.9\Levenshtein
copying Levenshtein_init_.py -> build\lib.win-amd64-3.9\Levenshtein
running egg_info
writing python_Levenshtein.egg-info\PKG-INFO
writing dependency_links to python_Levenshtein.egg-info\dependency_links.txt
writing entry points to python_Levenshtein.egg-info\entry_points.txt
writing namespace_packages to python_Levenshtein.egg-info\namespace_packages.txt
writing requirements to python_Levenshtein.egg-info\requires.txt
writing top-level names to python_Levenshtein.egg-info\top_level.txt
reading manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*pyc' found anywhere in distribution
warning: no previously-included files matching '*so' found anywhere in distribution
warning: no previously-included files matching '.project' found anywhere in distribution
warning: no previously-included files matching '.pydevproject' found anywhere in distribution
adding license file 'COPYING'
writing manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
copying Levenshtein_levenshtein.c -> build\lib.win-amd64-3.9\Levenshtein
copying Levenshtein_levenshtein.h -> build\lib.win-amd64-3.9\Levenshtein
running build_ext
building 'Levenshtein.levenshtein' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"'; file='"'"'C:\Users\AppData\Local\Temp\pip-install-3g1ne2jo\python-levenshtein_9f5029b6aae44944bceb3f676daf71a1\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\AppData\Local\Temp\pip-record-8_vcaac\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Include\python-Levenshtein' Check the logs for full command output.

Is it possible to keep the capitalization?

All my results are in lowercase. The API test on Huggingface has upper case. How do I enable that? :)

use_gpu=True Error

In Google Colab.

INSTALLED:
!pip install -qqq git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

MY CODE:

from parrot import Parrot

def random_state(seed):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
random_state(1234)

parrot_gpu = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=True)

phrases = ['i drive a ford pickup truck.', 'i am very conservative.', 'my family lives down the street from me.',
'i go to church every sunday.', 'i have three guns and love hunting.']

para_phrases_gpu = parrot_gpu.augment(input_phrase=phrases[0], use_gpu=True, max_return_phrases = 10)

ERROR:

RuntimeError Traceback (most recent call last)
in ()
----> 1 para_phrases_gpu = parrot_gpu.augment(input_phrase=phrases[0], use_gpu=True, max_return_phrases = 10)

/usr/local/lib/python3.7/dist-packages/parrot/parrot.py in augment(self, input_phrase, use_gpu, diversity_ranker, do_diverse, max_return_phrases, max_length, adequacy_threshold, fluency_threshold)
128
129
--> 130 adequacy_filtered_phrases = self.adequacy_score.filter(input_phrase, paraphrases, adequacy_threshold, device )
131 if len(adequacy_filtered_phrases) > 0 :
132 fluency_filtered_phrases = self.fluency_score.filter(adequacy_filtered_phrases, fluency_threshold, device )

/usr/local/lib/python3.7/dist-packages/parrot/filters.py in filter(self, input_phrase, para_phrases, adequacy_threshold, device)
13 x = self.tokenizer(input_phrase, para_phrase, return_tensors='pt', max_length=128, truncation=True)
14 self.adequacy_model = self.adequacy_model.to(device)
---> 15 logits = self.adequacy_model(**x).logits
16 probs = logits.softmax(dim=1)
17 prob_label_is_true = probs[:,1]

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1213 output_attentions=output_attentions,
1214 output_hidden_states=output_hidden_states,
-> 1215 return_dict=return_dict,
1216 )
1217 sequence_output = outputs[0]

/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
844 token_type_ids=token_type_ids,
845 inputs_embeds=inputs_embeds,
--> 846 past_key_values_length=past_key_values_length,
847 )
848 encoder_outputs = self.encoder(

/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
126
127 if inputs_embeds is None:
--> 128 inputs_embeds = self.word_embeddings(input_ids)
129 token_type_embeddings = self.token_type_embeddings(token_type_ids)
130

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
158 return F.embedding(
159 input, self.weight, self.padding_idx, self.max_norm,
--> 160 self.norm_type, self.scale_grad_by_freq, self.sparse)
161
162 def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2181 # remove once script supports set_grad_enabled
2182 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2184
2185

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Installation Issue on Ubuntu

I am trying to install the library on Ubuntu but its not allowing me to install

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Tried installing in kaggle - Parrot_Paraphraser. It got successfully installed but i tried using it in code it shows ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

My internet connection is also ON on kaggle. Can you please check and tell. I have added screenshot too.

Init models download error

https://www.loom.com/share/06e3fa28b5684761864b3e9460fd5a8a

How to generate in batches?

Hi Prithiviraj, thank you for the great work!
Is it possible to run this model with batches of input sentences so that we can leverage using the GPU much better? At the moment, setting use_gpu to True doesn't achieve much performance gains because we're not parallelizing across input phrases. Unless I missed something in the source code, in which case please let me know (and this would be good instruction to better emphasize in the documentation, at least in my case and I'm sure for many others if they try using this model for paraphrasing phrases in the 1mil+ data sizes)

How is the fluency scorer being fine-tuned?

Thanks for the great work Prithiviraj. I am curious what dataset and procedure did you use to fine-tune the fluency scorer (https://huggingface.co/prithivida/parrot_fluency_on_BERT)?

Unable to use the model

I followed the instructions on Readme, including the huggingface cli-login and I get the following error
OSError: There was a specific connection error when trying to load prithivida/parrot_paraphraser_on_T5: <class 'requests.exceptions.HTTPError'>

Also this is what I see on the huggingface repo, Access to model prithivida/parrot_paraphraser_on_T5 is restricted and you are not in the authorized list. Visit https://huggingface.co/prithivida/parrot_paraphraser_on_T5 to ask for access.

Really appreciate it if anyone has any suggestions here. Thanks.

Make it work without removing HTML tags

Hello everyone, im using this paraphraser model and it works well. But i would like to passphrase my text without breaking html tags.

Example of the original text:

`In Java, an interface specifies the behavior of a class by providing an abstract type. As one of Java's core concepts, abstraction, polymorphism, and multiple inheritance are supported through this technology. Interfaces are used in Java <b>to achieve abstraction</b>.`

Example of the paraphrased text:

`An interface in Java gives the behavior of a class. One of the core concepts of Java is the use of abstraction, polymorphism, and multiple inheritance. Java uses interface to achieve abstraction.`

As you can see its removing the HTML tag, sometimes it doesn't remove it but break it them like this: "/li>li>"

Any help is appreciated, i can pay if you can solve this for me

Is there any way to fine tune this model?

Publish to PyPI

@PrithivirajDamodaran any plans on publishing this to PyPI?

Poor quality for German input

Paraphrase output for German text is essentially not usable. Is there anything more to be considered? From quickly looking into the paper it looks like it should be possible, but I honestly didn't read deeply into it and I'm no NLP expert.

How to install this library with poetry?

Does this library support installation via poetry?

Question about model training

Hello, your work is wonderful, I'd like to create something like this in my native language (Persian).
Could you please let me know how you trained those T5s?
I have access to translated Quora question pairs, and I think the training process looks like the following

filter similar sentences in the dataset
train a text generation model from sentence 1 to sentence 2
and from sentence 2 to sentence 1
this model is a text2text generation
I mean just training no include postprocessing
is it correct or not?

Training The model

Is there any documentation for training this model?

Issue in model output. Model's output changed to NoneType

I used the following code to generate the augmented paraphrased sentences. After the last model's output changed to NoneType. Although I am iterating over my question lists, it outputs None. I am not sure about it and could not find any issues.

import time
ts = time.time()
augmented_questions = []
for question in finance_list:
  para_phrases = parrot.augment(input_phrase=question.lower(), use_gpu=False)
  print(f"Length of para phrased by model are: {len(para_phrases)}")
  for aug in para_phrases:
    print(f"Printing aug in this: {aug}")
    augmented_questions.append(aug)
    print(f"Appended the {aug} with type {type(aug)}")

te = time.time()
print(len(augmented_questions))
print(f"time taken to augment " + str(len(questions)) + " is "+ str(te-ts) + " seconds ")

ChatGPT suggested to add the line of code for checking the NoneType in output (line 7 below). Anyone has information why I am getting this NoneType error from model output?

import time
ts = time.time()
augmented_questions = []
#19 thy without .lower()
for question in finance_list:
  para_phrases = parrot.augment(input_phrase=question.lower(), use_gpu=False)
  if para_phrases is not None:
    print(f"Length of para phrased by model are: {len(para_phrases)}")
    for aug in para_phrases:
      print(f"Printing aug in this: {aug}")
      augmented_questions.append(aug)
      print(f"Appended the {aug} with type {type(aug)}")

te = time.time()
print(len(augmented_questions))
print(f"time taken to augment " + str(len(questions)) + " is "+ str(te-ts) + " seconds ")

Traceback (most recent call last):
  File "$HOME/src/try-parrot/demo.py", line 19, in <module>
    parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)
  File "$HOME/.local/lib/python3.8/site-packages/parrot/parrot.py", line 10, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_tag, use_auth_token=False)
  File "$HOME/.local/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "$HOME/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained
    return cls._from_pretrained(
  File "$HOME/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "$HOME/.local/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5_fast.py", line 133, in __init__
    super().__init__(
  File "$HOME/.local/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "$HOME/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1162, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "$HOME/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 438, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "$HOME/.local/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 92, in <module>
    _descriptor.EnumValueDescriptor(
  File "$HOME/.local/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates