dadmatech / dadmatools Goto Github PK
View Code? Open in Web Editor NEWDadmaTools is a Persian NLP tools developed by Dadmatech Co.
License: Apache License 2.0
DadmaTools is a Persian NLP tools developed by Dadmatech Co.
License: Apache License 2.0
NER need some important labels like date, and also we should improve generalization of the model.
When installing the package, the following error occurs:
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
I have downloaded fa_tokenizer.pt
manually from the URL https://www.dropbox.com/s/bajpn68bp11o78s/fa_ewt_tokenizer.pt?dl=1
. It's 636k in size. Its md5 is:
2097a125c5f85b36d569857bd60d51b7 fa_tokenizer.pt
It cannot be loaded, however:
import dadmatools.pipeline.language as language
# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh'
nlp = language.Pipeline(pips)
# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))
# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که میگفت، گاهی حرص میخورد!')
Model fa_tokenizer exists in /Users/evar/.pernlp/fa_tokenizer.pt
2022-11-21 09:05:41,580 Cannot load model from /Users/evar/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/saved_models/fa_tokenizer/fa_tokenizer.pt
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [5], in <cell line: 6>()
3 # here lemmatizer and pos tagger will be loaded
4 # as tokenizer is the default tool, it will be loaded as well even without calling
5 pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh'
----> 6 nlp = language.Pipeline(pips)
8 # you can see the pipeline with this code
9 print(nlp.analyze_pipes(pretty=True))
File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:258, in Pipeline.__new__(cls, pipeline)
257 def __new__(cls, pipeline):
--> 258 language = NLP('fa', pipeline)
259 nlp = language.nlp
260 return nlp
File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:64, in NLP.__init__(self, lang, pipelines)
58 # if 'def-norm' in pipelines:
59 # global normalizer_model
60 # normalizer_model = normalizer.load_model()
61 # self.nlp.add_pipe('normalizer', first=True)
63 global tokenizer_model
---> 64 tokenizer_model = tokenizer.load_model()
65 self.nlp.add_pipe('tokenizer')
67 global mwt_model
File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenizer.py:125, in load_model()
123 mwt_dict = load_mwt_dict(args['mwt_json_file'])
124 use_cuda = args['cuda'] and not args['cpu']
--> 125 trainer = Trainer(model_file=args['save_dir'], use_cuda=use_cuda)
126 loaded_args, vocab = trainer.args, trainer.vocab
128 for k in loaded_args:
File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:19, in Trainer.__init__(self, args, vocab, model_file, use_cuda)
16 self.use_cuda = use_cuda
17 if model_file is not None:
18 # load everything from file
---> 19 self.load(model_file)
20 else:
21 # build model from scratch
22 self.args = args
File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:85, in Trainer.load(self, filename)
83 def load(self, filename):
84 try:
---> 85 checkpoint = torch.load(filename, lambda storage, loc: storage)
86 except BaseException:
87 logger.error("Cannot load model from {}".format(filename))
File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:713, in load(f, map_location, pickle_module, **pickle_load_args)
711 return torch.jit.load(opened_file)
712 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:938, in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
936 assert key in deserialized_objects
937 typed_storage = deserialized_objects[key]
--> 938 typed_storage._storage._set_from_file(
939 f, offset, f_should_read_directly,
940 torch._utils._element_size(typed_storage.dtype))
941 if offset is not None:
942 offset = f.tell()
RuntimeError: unexpected EOF, expected 312321 more bytes. The file might be corrupted.
I am using dadmatools==1.5.2
, Python 3.10, macOS 12.2.1.
Hi
I have an issue while loading models with python3.9:
RuntimeError Traceback (most recent call last)
Input In [31], in
1 import dadmatools.pipeline.language as language
3 pips = 'tok,lem,pos,dep,chunk,cons'
----> 4 nlp = language.Pipeline(pips)
5 def dadmatokenize(text):
6 doc = nlp(text)File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/pipeline/language.py:216, in Pipeline.new(cls, pipeline)
215 def new(cls, pipeline):
--> 216 language = NLP('fa', pipeline)
217 nlp = language.nlp
218 return nlpFile ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/pipeline/language.py:71, in NLP.init(self, lang, pipelines)
69 if 'dep' or 'chunk' in pipelines:
70 global depparser_model
---> 71 depparser_model = dp.load_model()
72 self.nlp.add_pipe('dependancyparser')
74 if 'pos' or 'chunk' in pipelines:File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/dependancy_parser.py:148, in load_model()
145 config['target_dir'] = prefix + config['target_dir']
146 config['embeddings']['BertEmbeddings-0']['bert_model_or_path'] = prefix + config['embeddings-saved-dir']
--> 148 student=create_model(config)
149 base_path=Path(config['target_dir'])/config['model_name']
151 return studentFile ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/dependancy_parser.py:120, in create_model(config)
118 tagger = tagger.load(base_path / "final-model.pt")
119 elif (base_path).exists():
--> 120 tagger = tagger.load(base_path)
121 else:
122 assert 0, str(base_path)+ ' not exist!'File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/dadmatools/models/flair/nn.py:105, in Model.load(cls, model, device)
102 # load_big_file is a workaround by https://github.com/highway11git to load models on some Mac/Windows setups
103 # see flairNLP/flair#351
104 f = flair.file_utils.load_big_file(str(model_file))
--> 105 state = torch.load(f, map_location=device)
107 model = cls._init_model_with_state_dict(state, testing = device=='cpu')
109 model.eval()File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/torch/serialization.py:600, in load(f, map_location, pickle_module, **pickle_load_args)
595 if _is_zipfile(opened_file):
596 # The zipfile reader is going to advance the current file position.
597 # If we want to actually tail call to torch.jit.load, we need to
598 # reset back to the original position.
599 orig_position = opened_file.tell()
--> 600 with _open_zipfile_reader(opened_file) as opened_zipfile:
601 if _is_torchscript_zip(opened_zipfile):
602 warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
603 " dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to"
604 " silence this warning)", UserWarning)File ~/repos/Nasim/nasim_venv/lib/python3.9/site-packages/torch/serialization.py:242, in _open_zipfile_reader.init(self, name_or_buffer)
241 def init(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
I tried to use colab you provided, but unfortunately It can't install dadmatools properly!!!
Here I provide the error below for you:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dadmatools
Using cached dadmatools-1.5.2-py3-none-any.whl (862 kB)
Collecting bpemb>=0.3.3 (from dadmatools)
Using cached bpemb-0.3.4-py3-none-any.whl (19 kB)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from dadmatools) (3.8.1)
Requirement already satisfied: folium>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from dadmatools) (0.14.0)
Requirement already satisfied: spacy>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from dadmatools) (3.5.2)
Collecting sklearn>=0.0 (from dadmatools)
Using cached sklearn-0.0.post5.tar.gz (3.7 kB)
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (setup.py) ... error
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
I really want to use your tool but I don't know how, Thanks for your project
hello when i import the library it throws error that dadmatools is not a package
import dadmatools.pipline.language ModuleNotFoundError: No module named 'dadmatools.pipline'; 'dadmatools' is not a package
Hello again,
As you should know, correcting the half-spaces in many languages such as Persian can affect the performance of other NLP tools and applications. Is there a way in DadmaTools to correct the half-spaces?
Thank you
ERROR: Could not build wheels for spacy, tokenizers which use PEP 517 and cannot be installed directly
I installed Rust, but the error still remains.
thank you in advance.
It should have a feature to add new classes to the classification method.
some of datasets have runtime error to download
file /home/milad/anaconda3/lib/python3.9/site-packages/dadmatools/saved_models/parsbert/parsbert/config.json not found
OSError Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
511 # Load from URL or cache if already cached
--> 512 resolved_config_file = cached_path(
513 config_file,
~/anaconda3/lib/python3.9/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
1377 # File, but it doesn't exist.
-> 1378 raise EnvironmentError(f"file {url_or_filename} not found")
1379 else:
OSError: file /home/milad/anaconda3/lib/python3.9/site-packages/dadmatools/saved_models/parsbert/pa
we have to compare our model with other in terms of velocity, performance and models.
Hello, I Have an issue that when I try to import import dadmatools.pipeline.language as language in my local machine I face this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 220: character maps to undefined
How can I fix this?
This is the full trace of the error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[34], line 1
----> 1 import dadmatools.pipeline.language as language
3 # here lemmatizer and pos tagger will be loaded
4 # as tokenizer is the default tool, it will be loaded as well even without calling
5 pips = 'lem'
File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\__init__.py:1
----> 1 from .language import Pipeline
2 from .tpipeline import TPipeline
3 from .language import supported_langs, langwithner, remove_with_path
File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\language.py:4
1 from typing import List
3 from .config import config as master_config
----> 4 from .informal2formal.main import Informal2Formal
5 from .models.base_models import Multilingual_Embedding
6 from .models.classifiers import TokenizerClassifier, PosDepClassifier, NERClassifier, SentenceClassifier, \
7 KasrehClassifier
File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\informal2formal\main.py:6
4 import yaml
5 from .download_utils import download_dataset
----> 6 import dadmatools.pipeline.informal2formal.utils as utils
7 from .formality_transformer import FormalityTransformer
8 from dadmatools.pipeline.persian_tokenization.tokenizer import SentenceTokenizer
File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\pipeline\informal2formal\utils.py:10
7 from dadmatools.pipeline.persian_tokenization.tokenizer import WordTokenizer
8 from dadmatools.normalizer import Normalizer
---> 10 normalizer = Normalizer()
11 tokenizer = WordTokenizer('cache/dadmatools')
12 # tokenizer = WordTokenizer(separate_emoji=True)
File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\site-packages\dadmatools\normalizer.py:32, in Normalizer.__init__(self, full_cleaning, unify_chars, refine_punc_spacing, remove_extra_space, remove_puncs, remove_html, remove_stop_word, replace_email_with, replace_number_with, replace_url_with, replace_mobile_number_with, replace_emoji_with, replace_home_number_with)
30 self.remove_puncs = remove_puncs
31 self.remove_stop_word = remove_stop_word
---> 32 self.STOPWORDS = open(prefix+save_dir+'stopwords-fa.py').read().splitlines()
33 self.PUNCS = string.punctuation.replace('<', '').replace('>', '') + '،؟'
34 if full_cleaning:
File c:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 220: character maps to <undefined>
we should having 2 version, one for legacy and one for adapter version.
سلام و عرض ادب. ممنون بابت توسعه این ابزار
من میخواستم پسوند یا همان شناسه یک فعل را پیدا کنم ولی نمیدونستم اصلا امکان این وجود داره یا نه
مثلا برای فعل میرویم، 'یم' بهم داده بشه. چند تا مثال دیگه به این شکل هست:
خواهم نوشت => م
مینویسند => ند
آیا امکان اینکار برای زبان فارسی وجود داره و با این ابزار میشه این کار رو انجام داد؟
اگر پاسخ بدید ممنون میشم
Thanks for developing such a nice library;
Thanks.
Issue Description:
Hello.
I have discovered a performance degradation in the read_csv
function of pandas version 1.3.3. And I notice some parts of the repository depend on pandas 1.3.3 in dadmatools/requirements.txt
and some other dependencies require pandas below 1.4. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #44158 and #44610.
I also found that dadmatools/pipeline/informal2formal/utils.py
and dadmatools/pipeline/informal2formal/VerbHandler.py
used the influenced api. There may be more files using the influenced api.
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 1.4 or exploring other solutions to optimize the performance of read_csv
.
Any other workarounds or solutions would be greatly appreciated.
Thank you!
I am using 'import dadmatools.models.constituency_parser as cons' as written in test file, but I get 'ModuleNotFoundError: No module named 'dadmatools.models'' error. How can I solve it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.