dadmatech / dadmatools Goto Github PK

DadmaTools is a Persian NLP tools developed by Dadmatech Co.

License: Apache License 2.0

Python 100.00%

chunker constituency-parser dataset-loader dependency-parser embedding-vectors embeddings lemmatizer natural-language-processing ner nlptoolkit persian persian-nlp postagger spacy tokenizer

dadmatools's People

Contributors

Stargazers

Watchers

dadmatools's Issues

Set version 3.0.0 as a default version.

Replacing 'sentences' with 'sentence'

Improve NER

NER need some important labels like date, and also we should improve generalization of the model.

The 'sklearn' PyPI package is deprecated, use 'scikit-learn'

When installing the package, the following error occurs:
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'

Cannot load fa_tokenizer.pt

I have downloaded fa_tokenizer.pt manually from the URL https://www.dropbox.com/s/bajpn68bp11o78s/fa_ewt_tokenizer.pt?dl=1. It's 636k in size. Its md5 is:

2097a125c5f85b36d569857bd60d51b7  fa_tokenizer.pt

It cannot be loaded, however:

import dadmatools.pipeline.language as language

# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh' 
nlp = language.Pipeline(pips)

# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))

# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که می‌گفت، گاهی حرص می‌خورد!')

Model fa_tokenizer exists in /Users/evar/.pernlp/fa_tokenizer.pt
2022-11-21 09:05:41,580 Cannot load model from /Users/evar/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/saved_models/fa_tokenizer/fa_tokenizer.pt
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [5], in <cell line: 6>()
      3 # here lemmatizer and pos tagger will be loaded
      4 # as tokenizer is the default tool, it will be loaded as well even without calling
      5 pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh' 
----> 6 nlp = language.Pipeline(pips)
      8 # you can see the pipeline with this code
      9 print(nlp.analyze_pipes(pretty=True))

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:258, in Pipeline.__new__(cls, pipeline)
    257 def __new__(cls, pipeline):
--> 258     language = NLP('fa', pipeline)
    259     nlp = language.nlp
    260     return nlp

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:64, in NLP.__init__(self, lang, pipelines)
     58 # if 'def-norm' in pipelines:
     59 #     global normalizer_model
     60 #     normalizer_model = normalizer.load_model()
     61 #     self.nlp.add_pipe('normalizer', first=True)
     63 global tokenizer_model
---> 64 tokenizer_model = tokenizer.load_model()
     65 self.nlp.add_pipe('tokenizer')
     67 global mwt_model

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenizer.py:125, in load_model()
    123 mwt_dict = load_mwt_dict(args['mwt_json_file'])
    124 use_cuda = args['cuda'] and not args['cpu']
--> 125 trainer = Trainer(model_file=args['save_dir'], use_cuda=use_cuda)
    126 loaded_args, vocab = trainer.args, trainer.vocab
    128 for k in loaded_args:

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:19, in Trainer.__init__(self, args, vocab, model_file, use_cuda)
     16 self.use_cuda = use_cuda
     17 if model_file is not None:
     18     # load everything from file
---> 19     self.load(model_file)
     20 else:
     21     # build model from scratch
     22     self.args = args

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:85, in Trainer.load(self, filename)
     83 def load(self, filename):
     84     try:
---> 85         checkpoint = torch.load(filename, lambda storage, loc: storage)
     86     except BaseException:
     87         logger.error("Cannot load model from {}".format(filename))

File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:713, in load(f, map_location, pickle_module, **pickle_load_args)
    711             return torch.jit.load(opened_file)
    712         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)

File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:938, in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    936 assert key in deserialized_objects
    937 typed_storage = deserialized_objects[key]
--> 938 typed_storage._storage._set_from_file(
    939     f, offset, f_should_read_directly,
    940     torch._utils._element_size(typed_storage.dtype))
    941 if offset is not None:
    942     offset = f.tell()

RuntimeError: unexpected EOF, expected 312321 more bytes. The file might be corrupted.

I am using dadmatools==1.5.2, Python 3.10, macOS 12.2.1.

Error on loading parsbert

Hi
I have an issue while loading models with python3.9:

RuntimeError Traceback (most recent call last)
Input In [31], in
1 import dadmatools.pipeline.language as language
3 pips = 'tok,lem,pos,dep,chunk,cons'
----> 4 nlp = language.Pipeline(pips)
5 def dadmatokenize(text):
6 doc = nlp(text)

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/pipeline/language.py:216, in Pipeline.new(cls, pipeline)
215 def new(cls, pipeline):
--> 216 language = NLP('fa', pipeline)
217 nlp = language.nlp
218 return nlp

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/pipeline/language.py:71, in NLP.init(self, lang, pipelines)
69 if 'dep' or 'chunk' in pipelines:
70 global depparser_model
---> 71 depparser_model = dp.load_model()
72 self.nlp.add_pipe('dependancyparser')
74 if 'pos' or 'chunk' in pipelines:

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/dependancy_parser.py:148, in load_model()
145 config['target_dir'] = prefix + config['target_dir']
146 config['embeddings']['BertEmbeddings-0']['bert_model_or_path'] = prefix + config['embeddings-saved-dir']
--> 148 student=create_model(config)
149 base_path=Path(config['target_dir'])/config['model_name']
151 return student

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/dependancy_parser.py:120, in create_model(config)
118 tagger = tagger.load(base_path / "final-model.pt")
119 elif (base_path).exists():
--> 120 tagger = tagger.load(base_path)
121 else:
122 assert 0, str(base_path)+ ' not exist!'

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/flair/nn.py:105, in Model.load(cls, model, device)
102 # load_big_file is a workaround by https://github.com/highway11git to load models on some Mac/Windows setups
103 # see flairNLP/flair#351
104 f = flair.file_utils.load_big_file(str(model_file))
--> 105 state = torch.load(f, map_location=device)
107 model = cls._init_model_with_state_dict(state, testing = device=='cpu')
109 model.eval()

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/torch/serialization.py:600, in load(f, map_location, pickle_module, **pickle_load_args)
595 if _is_zipfile(opened_file):
596 # The zipfile reader is going to advance the current file position.
597 # If we want to actually tail call to torch.jit.load, we need to
598 # reset back to the original position.
599 orig_position = opened_file.tell()
--> 600 with _open_zipfile_reader(opened_file) as opened_zipfile:
601 if _is_torchscript_zip(opened_zipfile):
602 warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
603 " dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to"
604 " silence this warning)", UserWarning)

File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/torch/serialization.py:242, in _open_zipfile_reader.init(self, name_or_buffer)
241 def init(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

update readthedoc for new version

pip can't install dadmatools

I tried to use colab you provided, but unfortunately It can't install dadmatools properly!!!

Here I provide the error below for you:

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dadmatools
  Using cached dadmatools-1.5.2-py3-none-any.whl (862 kB)
Collecting bpemb>=0.3.3 (from dadmatools)
  Using cached bpemb-0.3.4-py3-none-any.whl (19 kB)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from dadmatools) (3.8.1)
Requirement already satisfied: folium>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from dadmatools) (0.14.0)
Requirement already satisfied: spacy>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from dadmatools) (3.5.2)
Collecting sklearn>=0.0 (from dadmatools)
  Using cached sklearn-0.0.post5.tar.gz (3.7 kB)
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I really want to use your tool but I don't know how, Thanks for your project

dose not recognize dadmatools as pakage in windows

hello when i import the library it throws error that dadmatools is not a package
import dadmatools.pipline.language ModuleNotFoundError: No module named 'dadmatools.pipline'; 'dadmatools' is not a package

Add document Clustering task

Is there a way to correct the half-spaces?

Hello again,
As you should know, correcting the half-spaces in many languages such as Persian can affect the performance of other NLP tools and applications. Is there a way in DadmaTools to correct the half-spaces?
Thank you

Installation Error in Ubuntu

ERROR: Could not build wheels for spacy, tokenizers which use PEP 517 and cannot be installed directly

I installed Rust, but the error still remains.

thank you in advance.

Topic classification

It should have a feature to add new classes to the classification method.

Datasets could not be downloaded

some of datasets have runtime error to download

Having an Adapter Pipeline with multiple pre-train models

parsbert error

file /home/milad/anaconda3/lib/python3.9/site-packages/dadmatools/saved_models/parsbert/parsbert/config.json not found

OSError Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
511 # Load from URL or cache if already cached
--> 512 resolved_config_file = cached_path(
513 config_file,

~/anaconda3/lib/python3.9/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
1377 # File, but it doesn't exist.
-> 1378 raise EnvironmentError(f"file {url_or_filename} not found")
1379 else:

OSError: file /home/milad/anaconda3/lib/python3.9/site-packages/dadmatools/saved_models/parsbert/pa

Comparing our tool with other Persian tools

we have to compare our model with other in terms of velocity, performance and models.

Error when importing "import dadmatools.pipeline.language as language"

Hello, I Have an issue that when I try to import import dadmatools.pipeline.language as language in my local machine I face this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 220: character maps to undefined
How can I fix this?

This is the full trace of the error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[34], line 1
----> 1 import dadmatools.pipeline.language as language
      3 # here lemmatizer and pos tagger will be loaded
      4 # as tokenizer is the default tool, it will be loaded as well even without calling
      5 pips = 'lem'

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\__init__.py:1
----> 1 from .language import Pipeline
      2 from .tpipeline import TPipeline
      3 from .language import supported_langs, langwithner, remove_with_path

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\language.py:4
      1 from typing import List
      3 from .config import config as master_config
----> 4 from .informal2formal.main import Informal2Formal
      5 from .models.base_models import Multilingual_Embedding
      6 from .models.classifiers import TokenizerClassifier, PosDepClassifier, NERClassifier, SentenceClassifier, \
      7     KasrehClassifier

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\informal2formal\main.py:6
      4 import yaml
      5 from .download_utils import download_dataset
----> 6 import dadmatools.pipeline.informal2formal.utils as utils
      7 from .formality_transformer import FormalityTransformer
      8 from dadmatools.pipeline.persian_tokenization.tokenizer import SentenceTokenizer

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\informal2formal\utils.py:10
      7 from dadmatools.pipeline.persian_tokenization.tokenizer import WordTokenizer
      8 from dadmatools.normalizer import Normalizer
---> 10 normalizer = Normalizer()
     11 tokenizer = WordTokenizer('cache/dadmatools')
     12 # tokenizer = WordTokenizer(separate_emoji=True)

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\normalizer.py:32, in Normalizer.__init__(self, full_cleaning, unify_chars, refine_punc_spacing, remove_extra_space, remove_puncs, remove_html, remove_stop_word, replace_email_with, replace_number_with, replace_url_with, replace_mobile_number_with, replace_emoji_with, replace_home_number_with)
     30 self.remove_puncs = remove_puncs
     31 self.remove_stop_word = remove_stop_word
---> 32 self.STOPWORDS = open(prefix+save_dir+'stopwords-fa.py').read().splitlines()
     33 self.PUNCS = string.punctuation.replace('<', '').replace('>', '') + '،؟'
     34 if full_cleaning:

File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 220: character maps to <undefined>

"Pip install dadmatools" install previous version of dadmatools.

we should having 2 version, one for legacy and one for adapter version.

Fix bug of sentiment analyzer in adapter.

پسوند افعال

سلام و عرض ادب. ممنون بابت توسعه این ابزار
من میخواستم پسوند یا همان شناسه یک فعل را پیدا کنم ولی نمیدونستم اصلا امکان این وجود داره یا نه
مثلا برای فعل می‌رویم، 'یم' بهم داده بشه. چند تا مثال دیگه به این شکل هست:
خواهم نوشت => م
می‌نویسند => ند
آیا امکان اینکار برای زبان فارسی وجود داره و با این ابزار میشه این کار رو انجام داد؟
اگر پاسخ بدید ممنون میشم

word embedding

Thanks for developing such a nice library;

I wonder why english words have not removed before training, let's say Glove ?
I wonder if you can also share the code to train word embedding (fast text, Glove)

Thanks.

Potential performance Issue: Slow read_csv() Function with pandas 1.3.3

Issue Description:

Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.3. And I notice some parts of the repository depend on pandas 1.3.3 in dadmatools/requirements.txt and some other dependencies require pandas below 1.4. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #44158 and #44610.
I also found that dadmatools/pipeline/informal2formal/utils.py and dadmatools/pipeline/informal2formal/VerbHandler.py used the influenced api. There may be more files using the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 1.4 or exploring other solutions to optimize the performance of read_csv.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

add sentiment analysis on github README.md

Constituency parser

I am using 'import dadmatools.models.constituency_parser as cons' as written in test file, but I get 'ModuleNotFoundError: No module named 'dadmatools.models'' error. How can I solve it?

dadmatech / dadmatools Goto Github PK

dadmatools's People

Contributors

Stargazers

Watchers

Forkers

dadmatools's Issues

Recommend Projects

Recommend Topics

Recommend Org