lpty / nlp_base Goto Github PK
View Code? Open in Web Editor NEW自然语言基础模型
自然语言基础模型
我自己制作了数据集来训练中文疑问句判别模型,测试的时候,无论输入什么样的句子,都是一样的得分prob, 并且都是负类,求原因?
应该是缺少模型训练的语料
So sorry to bother you again...
when I use "train()"
the error occur:
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.173 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
File "", line 1, in
File "interrogative/api.py", line 17, in train
model.train()
File "interrogative/model.py", line 76, in train
self.initialize_model()
File "interrogative/model.py", line 31, in initialize_model
train, label = self.corpus.generator()
File "interrogative/corpus.py", line 62, in generator
corpus = cls.read_corpus_from_file(corpus_path)
File "interrogative/corpus.py", line 34, in perform_word_segment
tokenizer = jieba.Tokenizer()
File "/home1/liuxin/anaconda3/envs/py27/lib/python2.7/site-packages/pandas/core/series.py", line 3591, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer
File "interrogative/corpus.py", line 34, in
tokenizer = jieba.Tokenizer()
File "/home1/liuxin/.local/lib/python2.7/site-packages/jieba/init.py", line 282, in cut
sentence = strdecode(sentence)
File "/home1/liuxin/.local/lib/python2.7/site-packages/jieba/_compat.py", line 37, in strdecode
sentence = sentence.decode('utf-8')
AttributeError: 'float' object has no attribute 'decode'
基于Xgboost的中文疑问句判别模型,我想基于你的这个去做,但是需要语料
Content of question_recog.csv is:
content,label
在么,1
你好,0
公司在哪里,1
需要多少钱,1
未成年可以贷款吗,1
你现在在干什么,1
我在这里,0
And when I use 'train()', the error occurs:
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.201 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
File "", line 1, in
File "interrogative/api.py", line 17, in train
model.train()
File "interrogative/model.py", line 77, in train
_, best_param, best_iter_round = self.model_param_select()
File "interrogative/model.py", line 63, in model_param_select
early_stopping_rounds=self.early_stopping_rounds) # stop when metrics not get better
File "/home1/liuxin/anaconda3/envs/py27/lib/python2.7/site-packages/xgboost/training.py", line 446, in cv
res = aggcv([f.eval(i, feval) for f in cvfolds])
File "/home1/liuxin/anaconda3/envs/py27/lib/python2.7/site-packages/xgboost/training.py", line 234, in eval
return self.bst.eval_set(self.watchlist, iteration, feval)
File "/home1/liuxin/anaconda3/envs/py27/lib/python2.7/site-packages/xgboost/core.py", line 1173, in eval_set
ctypes.byref(msg)))
File "/home1/liuxin/anaconda3/envs/py27/lib/python2.7/site-packages/xgboost/core.py", line 178, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: [10:15:09] /workspace/src/metric/rank_metric.cc:144: Check failed: !auc_error AUC: the dataset only contains pos or neg samples
如题
我己完成word segmentation model,效果令人滿意,如果我希望用閣下的pos tagger去train一個基於分詞結果而判別各word tokens的pos tagger, 應該把語料轉換成怎樣的結構?
如题
I want to run your code,but i didn't find any useful model.could you send me your model you have trained and vocabunary dictionary?
CRF训练命名实体识别的时候,如何在字特征的基础上增加词性特征,使得精确度更高
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Lenovo\AppData\Local\Temp\jieba.cache
Loading model cost 0.906 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
File "C:/Users/Lenovo/Desktop/interrogative/manage.py", line 3, in
train()
File "C:\Users\Lenovo\Desktop\interrogative\src\api.py", line 17, in train
model.train()
File "C:\Users\Lenovo\Desktop\interrogative\src\model.py", line 81, in train
_, best_param, best_iter_round = self.model_param_select()
File "C:\Users\Lenovo\Desktop\interrogative\src\model.py", line 67, in model_param_select
early_stopping_rounds=self.early_stopping_rounds) # stop when metrics not get better
File "D:\python project\venv\lib\site-packages\xgboost\training.py", line 445, in cv
fold.update(i, obj)
File "D:\python project\venv\lib\site-packages\xgboost\training.py", line 230, in update
self.bst.update(self.dtrain, iteration, fobj)
File "D:\python project\venv\lib\site-packages\xgboost\core.py", line 1109, in update
dtrain.handle))
File "D:\python project\venv\lib\site-packages\xgboost\core.py", line 176, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: Invalid Parameter format for silent expect boolean but value='91'
你好;
有幸看到你的代码,我在跑CRF的时候,想问一下,用类别(BIO)做的F1是不是不好啊,我感觉应该用实际的识别出的实体的结果做F1会好一点?
你好,我下载你的代码学习过程中,运行 /interrogative/manage.py 出现报错:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1758, in <module>
main()
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1752, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1147, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/jinglun/PycharmProjects/DownloadProjects/nlp_base/interrogative/manage.py", line 4, in <module>
train()
File "/Users/jinglun/PycharmProjects/DownloadProjects/nlp_base/interrogative/src/api.py", line 17, in train
model.train()
File "/Users/jinglun/PycharmProjects/DownloadProjects/nlp_base/interrogative/src/model.py", line 81, in train
self.initialize_model()
File "/Users/jinglun/PycharmProjects/DownloadProjects/nlp_base/interrogative/src/model.py", line 40, in initialize_model
self.max_depth = to_json(self.config.get('model', 'max_depth'))
File "/Users/jinglun/PycharmProjects/DownloadProjects/nlp_base/interrogative/src/util.py", line 16, in to_json
return demjson.decode(text, encoding='utf-8')
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 5699, in decode
return_stats=(return_stats or write_stats) )
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 4915, in decode
raise errors[0]
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 2428, in set_input
self.buf = buffered_stream( txt, encoding=encoding )
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 1614, in __init__
self.set_text( txt, encoding )
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 1685, in set_text
raise newerr
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 1675, in set_text
decoded = helpers.unicode_decode( txt, encoding )
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/site-packages/demjson.py", line 1256, in unicode_decode
unitxt, numbytes = cdk.decode( txt, **cdk_kw ) # DO THE DECODE HERE!
File "/Users/jinglun/software/miniconda2/envs/nlp_base36/lib/python3.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
demjson.JSONDecodeError: a Unicode decoding error occurred
具体代码行是如下 /interrogative/model.py 下这行:
self.max_depth = to_json(self.config.get('model', 'max_depth'))
我的 config.py 中 model 配置没有改动,如下:
'model': {
'max_depth': [4, 5, 6],
'eta': [0.1, 0.05, 0.02],
'subsample': [0.5, 0.7, 1.0],
'max_iterations': 100,
'objective': ['binary:logistic'],
'silent': [1],
'num_boost_round': 2000,
'nfold': 5,
'stratified': 1,
'metrics': 'auc',
'early_stopping_rounds': 50,
'model_path': ' src/data/{}.model'
}
不是很明白为什么会报这个错误,网上搜索也没有找到解决方法,请教一下这个可以怎么解决吗?
您好,请问数据能够分享么,或者,能够提供获取训练语料的途径么?
你好,我想问一下,您的(依存分析:基于序列标注的中文依存句法分析模型实现 https://blog.csdn.net/sinat_33741547/article/details/79321401),对语料的预处理的程序能提供一下吗?还有我的呈现报错,错误为:ConfigParser.NoSectionError: No section: 'depparser',如果可以能远程指导一下吗?都调试了好多天了,联系:[email protected].
您的分词的代码,我已经调通,非常希望能得到您的帮助
When I use 'from interrogative.api import *'
The error occurs:
Traceback (most recent call last):
File "", line 1, in
File "/home1/lx/nlp_base/interrogative/interrogative/api.py", line 7, in
from model import get_model
ModuleNotFoundError: No module named 'model'
from ner.api import recognize
sentence = u'新华社北京十二月三十一日电(**人民广播电台记者刘振英、新华社记者张宿堂)今天是一九九七年的最后一天。'
u'辞旧迎新之际,国务院总理李鹏今天上午来到北京石景山发电总厂考察,向广大企业职工表示节日的祝贺,'
u'向将要在节日期间坚守工作岗位的同志们表示慰问'
predict = recognize(sentence)
##########################################################################
y_predict = self.model.predict(features)
这一步出来的是<type 'list'>: [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']],请问遇见过这个问题吗
NER的训练语料现在找不到了,能提供下地址吗
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.