prithivirajdamodaran / gramformer Goto Github PK

A framework for detecting, highlighting and correcting grammatical errors on natural language text. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

License: MIT License

Python 100.00%

grammar-checker grammar grammar-error-correction grammar-correction

gramformer's Introduction

👋 I am Prithivida !

25 Million+ Model downloads in 🤗 | Cited in NeurIPS, ICLR, ACL | 3K+ ⭐️ GitHub.

gramformer's People

Contributors

Stargazers

Watchers

Forkers

appidiabhinav adbmd trawely daywatch georgi-petkov benedict-erwin chunde prashant118 mysticaltech turangojayev trendingtechnology mohammedgomaa coderboy24x7 andreajparker ahmad225 kevinlolochum diandiaye codeaudit stjordanis koterpillar aditya-zutshi karndeepsingh thecooltechguy satichandrala inf800 bharathpalanivelu lisaterumi yotofu sivashankar-s tcmle kishore-25 shahabyounas aneeshaasc mrviometal kish2011 anirbansaha96 poudelanuj andreteixeira1998 nomiluks mayank-sharma-97 prabhkaran datthanhnguyen113 ashishmahendra amshar05 shashankdeshpande manikant92 fanshijianpharmacy trucnguyenlam ssundaranathan minjamiladinovic mauriziocasciano chschaitanya gokulsg salujarohit pul95 tanaidan afiqmuzaffar bfsujason pasumarthi akashgowtham1 manjunath-ss iamirmasoud sethips svillaza hercules261188 madhurimapaul-87 bobycv06fpm meet2674 smy0648 puru991 admariner literatu rodincode satpalsr a-akram-bs solvio-ai mtoub techthiyanes brianserp-ai edgarsegundo xgboosting tedros sagaraussizz firatsarlar edemgold booste-io curious-machines-ai korallin yschohere zerocoder9324 jnepal manmeeth viveksingh-ctrl boringresearch bangush copperdong prakharalokchaudhary k98vishwas markussagen dariush-saberi

gramformer's Issues

Question

HI, I love what you've created!!
I was curious, is there a table for error classification that you used?
or, is there one already that i just can't find..!!
THank you for your work!!

How to train Gramformer on non-English languages.

Hey @PrithivirajDamodaran , Great work on building Gramformer, ive played with it and the results are amazing.

I work on pushing nlp forward in under represented languages, and hence i humbly request you to please tell me how do i train gramformer on non-English sentences ?

I checked out your HuggingFace page 'https://huggingface.co/prithivida/grammar_error_correcter'
but coudn't find any resources on how to train gramformer from scratch. If you could help me in training Gramformer on non-English langauages it would really mean a lot to me. Do let me know.

Thanks

Error loading the tokenizer in transformers==4.4.2

I'm getting error when initializing the class object, specifically at tokenizer loading:

In [6]: correction_tokenizer = AutoTokenizer.from_pretrained(correction_model_tag)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-6-d34dd9c5fe99> in <module>
----> 1 correction_tokenizer = AutoTokenizer.from_pretrained(correction_model_tag)

~/anaconda3/envs/npe/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    414             tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    415             if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 416                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    417             else:
    418                 if tokenizer_class_py is not None:

~/anaconda3/envs/npe/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1703
   1704         return cls._from_pretrained(
-> 1705             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1706         )
   1707

~/anaconda3/envs/npe/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1774         # Instantiate tokenizer.
   1775         try:
-> 1776             tokenizer = cls(*init_inputs, **init_kwargs)
   1777         except OSError:
   1778             raise OSError(

~/anaconda3/envs/npe/lib/python3.6/site-packages/transformers/models/t5/tokenization_t5_fast.py in __init__(self, vocab_file, tokenizer_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, **kwargs)
    134             extra_ids=extra_ids,
    135             additional_special_tokens=additional_special_tokens,
--> 136             **kwargs,
    137         )
    138

~/anaconda3/envs/npe/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py in __init__(self, *args, **kwargs)
     85         if fast_tokenizer_file is not None and not from_slow:
     86             # We have a serialization from tokenizers which let us directly build the backend
---> 87             fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
     88         elif slow_tokenizer is not None:
     89             # We need to convert a slow tokenizer to build the backend

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 1 column 329667

transformers==4.4.2.

The installation package didn't specify the transformers version that this library is using. What should be the correct version? Or is it version independent and it's something else?

Figma Gramformer Plugin

Figma is used in creating a lot of digital interfaces today, a Gramformer Figma plugin would go a long way. I'll be willing to design the interface for the plugin but I don't know how to make the plugin itself. I hope someone takes this up. This is a link to get started https://www.figma.com/plugin-docs/setup/

Commercial use issue

Hey @PrithivirajDamodaran

The readme states that Gramformer versions above 1.0 are allowed for commercial use - however, this is not currently the case as the grammar_error_correcter_v1 model has been trained using the non-commercial WI&Locness data, even though the documentation states otherwise:

The grammar_error_correcter_v1 model is actually identical to the previous grammar_error_correcter model which is trained using the non-commercial WI&Locness data – they have identical weights, which you can verify with this script

As the models are the same, this means that both models have been trained using the non-commercial WI&Locness data, and the grammar_error_correcter_v1 model along with Gramformer v1.1 and v1.2 should not be allowed for commercial use.

Could you please update the readme to clarify this, or upload a new model that has not been trained using WI&Locness?

Thanks

No module named 'annotated_text' in streamlit_app.py

I was trying to run streamlit app, but encountered this error:

ModuleNotFoundError: No module named 'annotated_text'
Traceback:
File "C:\.........\lib\site-packages\streamlit\script_runner.py", line 333, in _run_script
    exec(code, module.__dict__)
File ".........\Gramformer\streamlit_app.py", line 1, in <module>
    from annotated_text import annotated_text

And I couldn't find annotated_text anywhere in this repo.

Please how to solve this problem?
Thank you in advance.

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

OSError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_9376\2706950954.py in
25
26
---> 27 gf = Gramformer(models = 1, use_gpu=False) # 1=corrector, 2=detector
28
29 influent_sentences = [

~\anaconda3_9\envs\python37\lib\site-packages\gramformer\gramformer.py in init(self, models, use_gpu)
7 import errant
8 #self.annotator = errant.load('en_core_web_sm')
----> 9 self.annotator = errant.load('en') # en is deprecated from spacy 3.0 onwards
10
11 if use_gpu:

~\anaconda3_9\envs\python37\lib\site-packages\errant_init_.py in load(lang, nlp)
17
18 # Load spacy
---> 19 nlp = nlp or spacy.load(lang, disable=["ner"])
20
21 # Load language edit merger

~\anaconda3_9\envs\python37\lib\site-packages\spacy_init_.py in load(name, **overrides)
28 if depr_path not in (True, False, None):
29 warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
---> 30 return util.load_model(name, **overrides)
31
32

~\anaconda3_9\envs\python37\lib\site-packages\spacy\util.py in load_model(name, **overrides)
173 elif hasattr(name, "exists"): # Path or Path-like to model data
174 return load_model_from_path(name, **overrides)
--> 175 raise IOError(Errors.E050.format(name=name))
176
177

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Gramformer Highlight function not working

Hello...
I'm trying to get the edits between two sentences, but the highlight function is not working. Has anybody faced the same issue?
Many thanks in advance

pip install is erroring out,

I am unable to do pip install of the package, here is the error:

Collecting git+https://github.com/PrithivirajDamodaran/[email protected]
Cloning https://github.com/PrithivirajDamodaran/Gramformer.git (to revision v0.1) to c:\users\sumit\appdata\local\temp\pip-req-build-sw54k_0h
ERROR: Error [WinError 2] The system cannot find the file specified while executing command git clone -q https://github.com/PrithivirajDamodaran/Gramformer.git 'C:\Users\Sumit\AppData\Local\Temp\pip-req-build-sw54k_0h'
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

I also tried directly downloading the repo and tried executing the package. Model is not present in location(correction_model_tag = "prithivida/grammar_error_correcter"). Any way to download the pretrain model.

Gramformer on pypi?

Hi,

Thanks for the great tool. I'm working on a language server for grammatical error detection (https://github.com/hangyav/textLSP) and included Gramformer as one option for text analysis. However, I'm not able to include it in the dependencies, since pypi does not accept github links in setup.py, thus it needs to be installed separately. I was wondering if you are planning to release Gramformer on pypi? Thanks!

Fix Can't find model "en" error by directly loading en_core_web_sm

OSError: [E941] Can't find model 'en'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead:

nlp = spacy.load("en_core_web_sm")

For more details on the available models, see the models directory: https://spacy.io/models. If you want to create a blank model, use spacy.blank: nlp = spacy.blank("en")

[Spacy error] Can't find model 'en'

Hello I have successfully installed the Gramformer on my windows PC. but when I run, it gives the following error.

Traceback (most recent call last):
  File "main.py", line 27, in <module>
    grammar_correction = Gramformer(models = 1, use_gpu=True)
  File "~~\.conda\envs\nlp-transformer\lib\site-packages\gramformer\gramformer.py", line 8, in __init__
    self.annotator = errant.load('en')
  File "~~\.conda\envs\nlp-transformer\lib\site-packages\errant\__init__.py", line 16, in load
    nlp = nlp or spacy.load(lang, disable=["ner"])
  File "~~\.conda\envs\nlp-transformer\lib\site-packages\spacy\__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "~~\.conda\envs\nlp-transformer\lib\site-packages\spacy\util.py", line 175, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Output Issue with Gramformer - Script Long Strings vs. List

Hello @PrithivirajDamodaran,

I am reaching out to highlight an observed inconsistency while using the Gramformer library.

Issue Overview:
When utilizing Gramformer with a script provided as a single string, the output appears to be incomplete. However, when the script is presented as a list of sentences, the output is both complete and accurate.

Code Reference:

# Original code with incomplete output
script = "We cleaned all the kitchen... If I’m stressed... and so on."
corrected_sentences = gf.correct(script, max_candidates=1)
print("[Input] ", script)
print("[Corre] ", corrected_sentence)

# Updated code with a list of sentences
script = ["We cleaned all the kitchen...", "If I’m stressed...", ...]
for script_sentence in script:
    corrected_sentences = gf.correct(script_sentence, max_candidates=1)
    print("[Input] ", script_sentence)
    for corrected_sentence in corrected_sentences:
        print("[Corre] ", corrected_sentence)
    print("-" * 100)

Expected Output:
Consistency is expected in the output, irrespective of whether the input script is provided as a string or a list of sentences.

Additional Information:

Python version: 3.9.7
Operating System: Windows

I would appreciate your insights and guidance on resolving this matter.

Thank you for your time and assistance.

Best regards,
Rutika.

Use corrector for highligher

Hi @PrithivirajDamodaran

This is a great framework. Is it possible (for now) to use model corrector (model=2) for the highlighter(model=1)?
After getting some correction, match it to the input and give prefix and suffix () for the mismatch?

Thanks

Mulitlingual Grammatical Error Correction

Is the Gramformer v1 based on a multilingual model?
I think the T5 model ist english centric, isn't?
Yet, the Input gets copied for other languages.

ERRANT should also be usable for the languages of mT5 or FLAN-T5.
The typo-corpus of github has a subset of those languages, though only 3000 thousend sentences besides English.

Highlight failed - throwing out of bound error

gf.highlight is failing for following sentence

"the collection of letters was original used by the ancient Romans. dada d amdnalkdad aldnmald alda"

Word limit

The model is having trouble with long sentences. Specially if the words in the sentences are in upper case. It outputs only limited sentence as an output and the rest neglected sentence is shown as error.

Training dataset

Hi Prithiviraj,

Is there any chance you'd be able to release the training dataset you used to train the Gramformer huggingface model? I see that there are some details on the slices of data that you brought together in the Readme, but it would be useful to be able to use the same data that you used.

The main reason I'm asking is I'd like to create a model that can take correct text and add grammatical errors to it. So I was thinking I could take the dataset you used to train Gramformer and use the inverse to train a model that does the inverse. I can go through the data prep process as you did, but it would definitely be easier if I were able to reuse yours, and it might be useful for reproducibility for others as well.

Edit start and end position

Are the start and end positions of edits spaCy token indices or is another way of tokenization used?

Installation runs endlessly in Colab with Transfomers version 4.9.0

Installation doesn't error out but pip is resolving package versions endlessly for multiple packages. (attached)

Module not found Error

I am facing this issue. It will be a great pleasure if you tell me where I am doing wrong here.

Inference Issue !!!

OSError Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
241 if resolved_config_file is None:
--> 242 raise EnvironmentError
243 config_dict = cls._dict_from_json_file(resolved_config_file)

OSError:

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)

3 frames

in ()
----> 1 correction_tokenizer = AutoTokenizer.from_pretrained("prithivida/grammar_error_correcter")
2 correction_model = AutoModelForSeq2SeqLM.from_pretrained("prithivida/grammar_error_correcter")
3 print("[Gramformer] Grammar error correction model loaded..")
4
5

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
204 config = kwargs.pop("config", None)
205 if not isinstance(config, PretrainedConfig):
--> 206 config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
207
208 if "bert-base-japanese" in str(pretrained_model_name_or_path):

/usr/local/lib/python3.7/dist-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
201
202 """
--> 203 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
204
205 if "model_type" in config_dict:

/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
249 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
250 )
--> 251 raise EnvironmentError(msg)
252
253 except json.JSONDecodeError:

OSError: Can't load config for 'prithivida/grammar_error_correcter'. Make sure that:

'prithivida/grammar_error_correcter' is a correct model identifier listed on 'https://huggingface.co/models'
or 'prithivida/grammar_error_correcter' is the correct path to a directory containing a config.json file

Solutions for this issue????

Retrain with FLAN-T5-base

We evaluated your and various T5-models on syntax: Evaluating PaLM-FLAN-T5 and previous models on syntax

Gramformer is the the grammar-tuned dot:

The evaluation suggests using FLAN-T5-base

OSError: Can't load config for 'prithivida/grammar_error_correcter'

Hi, I have been using your code for the last few days. Suddenly, it started to crash.

Have a look at the code and error given below:

Code (Link: https://huggingface.co/prithivida/grammar_error_correcter_v1):

from gramformer import Gramformer
import torch

def set_seed(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(1212)


gf = Gramformer(models = 2, use_gpu=False) # 0=detector, 1=highlighter, 2=corrector, 3=all 

influent_sentences = [
    "Matt like fish",
    "the collection of letters was original used by the ancient Romans",
    "We enjoys horror movies",
    "Anna and Mike is going skiing",
    "I walk to the store and I bought milk",
    "We all eat the fish and then made dessert",
    "I will eat fish for dinner and drank milk",
    "what be the reason for everyone leave the company",
]   

for influent_sentence in influent_sentences:
    corrected_sentence = gf.correct(influent_sentence)
    print("[Input] ", influent_sentence)
    print("[Correction] ",corrected_sentence[0])
    print("-" *100)

Error

404 Client Error: Not Found for url: https://huggingface.co/prithivida/grammar_error_correcter/resolve/main/config.json
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    491                 use_auth_token=use_auth_token,
--> 492                 user_agent=user_agent,
    493             )

7 frames
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   1278             use_auth_token=use_auth_token,
-> 1279             local_files_only=local_files_only,
   1280         )

/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
   1441             r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout)
-> 1442             r.raise_for_status()
   1443             etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag")

/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
    942         if http_error_msg:
--> 943             raise HTTPError(http_error_msg, response=self)
    944 

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/prithivida/grammar_error_correcter/resolve/main/config.json

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-10-0f43e537fe87> in <module>
     10 
     11 
---> 12 gf = Gramformer(models = 2, use_gpu=False) # 0=detector, 1=highlighter, 2=corrector, 3=all
     13 
     14 influent_sentences = [

/usr/local/lib/python3.7/dist-packages/gramformer/gramformer.py in __init__(self, models, use_gpu)
     14 
     15     if models == 2:
---> 16         self.correction_tokenizer = AutoTokenizer.from_pretrained(correction_model_tag)
     17         self.correction_model     = AutoModelForSeq2SeqLM.from_pretrained(correction_model_tag)
     18         self.correction_model     = self.correction_model.to(device)

/usr/local/lib/python3.7/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    400         kwargs["_from_auto"] = True
    401         if not isinstance(config, PretrainedConfig):
--> 402             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    403 
    404         use_fast = kwargs.pop("use_fast", True)

/usr/local/lib/python3.7/dist-packages/transformers/models/auto/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    428         """
    429         kwargs["_from_auto"] = True
--> 430         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
    431         if "model_type" in config_dict:
    432             config_class = CONFIG_MAPPING[config_dict["model_type"]]

/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    502                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
    503             )
--> 504             raise EnvironmentError(msg)
    505 
    506         except json.JSONDecodeError:

OSError: Can't load config for 'prithivida/grammar_error_correcter'. Make sure that:

- 'prithivida/grammar_error_correcter' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'prithivida/grammar_error_correcter' is the correct path to a directory containing a config.json file
![Screenshot from 2021-07-01 18-36-07](https://user-images.githubusercontent.com/4704211/124133526-5a9da900-da9b-11eb-9733-61df46ab01e1.png)

Possible Solution:

Rename this link from: https://huggingface.co/prithivida/grammar_error_correcter/ to: https://huggingface.co/prithivida/grammar_error_correcter_v1/

Please help me fix this. thank you

what is the VERB:SVA stand for ?

i need to know the what kind of types are available and how many types it will return.
is there any documentation to read

Compatibility issues with spacy

Only since the last few days, getting the error "Can't find model 'en'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead: nlp = spacy.load("en_core_web_sm")"

There seems to be an incompatibility between en-core-web-md 3.7.0 requiring spacy 3.7.0 and gramformer requiring spacy 2.3.9.

Suggestions to improve the grammar results for short sentences

Hello..!

I have used Gramformer model and I think this could be quite useful for checking and correcting some grammar points, especially for correcting singular/plural, verb forms and tenses, and spelling. However, some other grammar points (like correcting sentence structure, comparative/superlative forms, pronoun cases, etc.) seem to be still tricky.

Note: I need to use the model on short sentences.

The biggest challenge I faced in my case is: (Please suggest how to avoid it or improve it or changing some parameters...)
1 - Since it corrects grammar by generating text, most of the time it completely changes the sentence and rephrase it. How can we avoid this.

whose bags you can bring? --> Which bags you can bring? (Just a sample, and sometime it generates totally changed verbose sentence)

2 - Every time I give the same sentence as input, it generates different outputs:

I go can there: three outputs in three different run ("I go, there"., "can I go there?", "I go back there.")

Thanks!

prithivirajdamodaran / gramformer Goto Github PK

gramformer's Introduction

👋 I am Prithivida !

gramformer's People

Contributors

Stargazers

Watchers

Forkers

gramformer's Issues

Recommend Projects

Recommend Topics

Recommend Org