v-mipeng / lexiconner Goto Github PK
View Code? Open in Web Editor NEWLexicon-based Named Entity Recognition
License: Apache License 2.0
Lexicon-based Named Entity Recognition
License: Apache License 2.0
I merged datasets of all entity types (i.e. all train.XXX.txt), and I directly trained the vanilla BiLSTM+CRF on the merged one. The overall F1 was exceeding 90.0 (seems unreasonably high, considering it was generated by dictionaries). Did I misunderstand anything? Many thanks!
I use 100 dimensional glove embeddings, 30 dimensional character embeddings (by a LSTM).
The hidden dimension is 200 (i.e. 100 for each direction). The dropout rate is 0.5. The optimizer is SGD, with learning rate of 0.01. The batch size is 32.
When I run the feature_pu_model.py I get the following error:
Traceback (most recent call last):
File "feature_pu_model.py", line 11, in
from utils.data_utils import DataPrepare
ModuleNotFoundError: No module named 'utils.data_utils'
Could you please help. Thank you.
Thanks for sharing your paper and codes and this paper interesting
I downloaded this codes and run feature_pu_model.py
I found that the calculation of
look forward to your replies.....
作者您好,我尝试了将PU算法这篇复现到中文数据集ResumeNER上,当时通过不断尝试loss的权重参数和正例的比例参数成功了一类,但是其他几类就无法复现了,对于这两个参数的选择感觉也很玄学,所以想请教一下您这两个参数具体的设置原理是什么以及您是否有在中文数据集上做过尝试,万分感谢!
Thanks for the paper and code !
The calculation of risk in bnPU setup is a little confusing.
In the paper, the non-negative makes the Risk = Pi * Prisk + max(0, nRisk)
However, when the nRisk < self.beta, the risk = -self.gamma * nRisk in following code.
Could you please explain why the risk is calculated as this when the nRisk is smaller than a small beta ? I can not match the code with the equations in paper
hP = result.masked_select(torch.from_numpy(postive).byte().cuda()).contiguous().view(-1, 2)
hU = result.masked_select(torch.from_numpy(unlabeled).byte().cuda()).contiguous().view(-1, 2)
if len(hP) > 0:
pRisk = self.model.loss_func(1, hP, args.type)
else:
pRisk = torch.FloatTensor([0]).cuda()
uRisk = self.model.loss_func(0, hU, args.type)
nRisk = uRisk - self.prior * (1 - pRisk)
risk = self.m * pRisk + nRisk
if args.type == 'bnpu':
if nRisk < self.beta:
risk = -self.gamma * nRisk
感谢作者无私分享!
有个困惑,如能解答不胜感激!
看了下AdaPU 的loss计算 相当于(m-p)pRisk + uRisk,有点类似于weighted BCE loss,不知理解是否有误,如果是这样的话,PU Learning的意义在哪里呢?
It is not immediately clear how to modify this repository for NER on an unlabeled data set with new classes. For example, the files ada_dict_generation.py
and adaptive_pu_model.py
both require model files from from supervised training output and from labeled data.
However, the approach in the paper describes the benefit of the proposed novel solution as being able to infer NER instances without training data.
How can this code be modified to support inference of novel NER classes in the absence of labeled data? i.e. What steps must be taken to modify this code to enable training and inference without any labeled data? Or, do I misunderstand the paper (is there always a requirement for some supervised training before the approach can then be used on unlabeled data?)
Thanks for sharing your paper and codes and this paper interesting
I downloaded this codes and wanted to run it, but I did not know how run it.
could you write detailed file in README.md ?
look forward to your replies.....
Hello,
Thanks for sharing this code. This code is extremely easy to use and very readable. I wanted to know if there are some practical considerations for deciding upon the hyperparameters associated to the loss function. Particularly, what are your recommendations for choosing appropriate value for m(class balance rate), beta, prior and gamma for a given dataset?
您好,关于损失函数,在论文中提到:
Therefore, in this work, we force l to be bounded by replacing the common unbounded cross entropy loss function with the mean absolute error, resulting in a bounded unbiased positive- unlabeled learning (buPU) algorithm. This slightly differs from the setting of uPU, which only requires ? to be slymmetric.
我的理解是您论文中的buPU模型用MAE作为损失函数,但是在代码中定义的损失函数:
def loss_func(self, yTrue, yPred, type): y = torch.eye(2)[yTrue].float() if len(y.shape) == 1: y = y[None, :] # y = torch.from_numpy(yTrue).float().cuda() if type == 'bnpu' or type == 'bpu': loss = torch.mean((y * (1 - yPred)).sum(dim=1)) elif type == 'upu': loss = torch.mean((-y * torch.log(yPred)).sum(dim=1)) # loss = 0.5 * torch.max(1-yPred*(2.0*yTrue-1),0) return loss
是一个0-1损失而不是MAE。
求解惑,十分感谢!
我在跑代码的过程中,遇到了如下的问题,有小伙伴知道这个问题要怎么解决嘛?所用的pytorch版本是1.1.0,python3的环境。
Traceback (most recent call last):
File "feature_pu_model.py", line 272, in
acc, risk, prisk, nrisk = trainer.train_mini_batch(batch, args)
File "feature_pu_model.py", line 136, in train_mini_batch
(risk).backward()
AttributeError: 'float' object has no attribute 'backward'
你好!我已经获取了ontonotes4.0原数据集,但是不知道如何处理,网上只有5.0的处理教程。还希望能分享一下4.0数据集预处理流程
PER 和 ORG 的 python feature_pu_model.py --dataset conll2003 --flag PER 可以正常得到结果,但是LOC和MISC始终是 Precision: 0, Recall: 0.0, F1: 0,得不到结果
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.