v-mipeng / lexiconner Goto Github PK

View Code? Open in Web Editor NEW

157.0 157.0 32.0 2.87 MB

Lexicon-based Named Entity Recognition

License: Apache License 2.0

Python 100.00%

ai ner nlp pythoon

lexiconner's People

Contributors

Stargazers

Watchers

lexiconner's Issues

Are ``train.XXX.txt'' generated by dictionaries?

I merged datasets of all entity types (i.e. all train.XXX.txt), and I directly trained the vanilla BiLSTM+CRF on the merged one. The overall F1 was exceeding 90.0 (seems unreasonably high, considering it was generated by dictionaries). Did I misunderstand anything? Many thanks!

I use 100 dimensional glove embeddings, 30 dimensional character embeddings (by a LSTM).
The hidden dimension is 200 (i.e. 100 for each direction). The dropout rate is 0.5. The optimizer is SGD, with learning rate of 0.01. The batch size is 32.

Code issue

When I run the feature_pu_model.py I get the following error:

Traceback (most recent call last):
File "feature_pu_model.py", line 11, in
from utils.data_utils import DataPrepare
ModuleNotFoundError: No module named 'utils.data_utils'

Could you please help. Thank you.

Estimate $\pi_p$

Thanks for sharing your paper and codes and this paper interesting

I downloaded this codes and run feature_pu_model.py
I found that the calculation of $\pi_p$ is based on the golden labels, and I try to use 0.04 for PER type as mentioned in the paper, but only get the result of 93.44 for 'bnpu'.

look forward to your replies.....

关键参数如何根据实体分布不同做调整

作者您好，我尝试了将PU算法这篇复现到中文数据集ResumeNER上，当时通过不断尝试loss的权重参数和正例的比例参数成功了一类，但是其他几类就无法复现了，对于这两个参数的选择感觉也很玄学，所以想请教一下您这两个参数具体的设置原理是什么以及您是否有在中文数据集上做过尝试，万分感谢！

about the bnPU loss

Thanks for the paper and code !

The calculation of risk in bnPU setup is a little confusing.
In the paper, the non-negative makes the Risk = Pi * Prisk + max(0, nRisk)
However, when the nRisk < self.beta, the risk = -self.gamma * nRisk in following code.
Could you please explain why the risk is calculated as this when the nRisk is smaller than a small beta ? I can not match the code with the equations in paper

hP = result.masked_select(torch.from_numpy(postive).byte().cuda()).contiguous().view(-1, 2)
hU = result.masked_select(torch.from_numpy(unlabeled).byte().cuda()).contiguous().view(-1, 2)
if len(hP) > 0:
pRisk = self.model.loss_func(1, hP, args.type)
else:
pRisk = torch.FloatTensor([0]).cuda()
uRisk = self.model.loss_func(0, hU, args.type)
nRisk = uRisk - self.prior * (1 - pRisk)
risk = self.m * pRisk + nRisk

    if args.type == 'bnpu':
        if nRisk < self.beta:
            risk = -self.gamma * nRisk

AdaPU困惑

感谢作者无私分享！
有个困惑，如能解答不胜感激！
看了下AdaPU 的loss计算相当于(m-p)pRisk + uRisk，有点类似于weighted BCE loss，不知理解是否有误，如果是这样的话，PU Learning的意义在哪里呢？

中文的如何弄？

作者您好，我有一个疑惑希望可以得到您的解答
你基于英文的每一个词是可以拆分到单个字母，做embeding，中文的话是否要分词？但是分词存在边界很有可能就错了
在一个是实体词重合，你是怎么考虑的？比如说，北京上海都是一个国际大都市, 会出来 [1,1,1,1,0,0,0,0,0,0,0,0], 那现在的做法是直接将北京上海看成一个实体词吗？

Training without any supervised labeling

It is not immediately clear how to modify this repository for NER on an unlabeled data set with new classes. For example, the files ada_dict_generation.py and adaptive_pu_model.py both require model files from from supervised training output and from labeled data.
However, the approach in the paper describes the benefit of the proposed novel solution as being able to infer NER instances without training data.

How can this code be modified to support inference of novel NER classes in the absence of labeled data? i.e. What steps must be taken to modify this code to enable training and inference without any labeled data? Or, do I misunderstand the paper (is there always a requirement for some supervised training before the approach can then be used on unlabeled data?)

About questions ?

Thanks for sharing your paper and codes and this paper interesting

I downloaded this codes and wanted to run it, but I did not know how run it.
could you write detailed file in README.md ?

look forward to your replies.....

Choosing appropriate hyperparameters for the loss function

Hello,
Thanks for sharing this code. This code is extremely easy to use and very readable. I wanted to know if there are some practical considerations for deciding upon the hyperparameters associated to the loss function. Particularly, what are your recommendations for choosing appropriate value for m(class balance rate), beta, prior and gamma for a given dataset?

关于损失函数

您好，关于损失函数，在论文中提到：

Therefore, in this work, we force l to be bounded by replacing the common unbounded cross entropy loss function with the mean absolute error, resulting in a bounded unbiased positive- unlabeled learning (buPU) algorithm. This slightly differs from the setting of uPU, which only requires ? to be slymmetric.

我的理解是您论文中的buPU模型用MAE作为损失函数，但是在代码中定义的损失函数：

def loss_func(self, yTrue, yPred, type):
    y = torch.eye(2)[yTrue].float()
    if len(y.shape) == 1:
        y = y[None, :]
    # y = torch.from_numpy(yTrue).float().cuda()
    if type == 'bnpu' or type == 'bpu':
        loss = torch.mean((y * (1 - yPred)).sum(dim=1))
    elif type == 'upu':
        loss = torch.mean((-y * torch.log(yPred)).sum(dim=1))
    # loss = 0.5 * torch.max(1-yPred*(2.0*yTrue-1),0)
    return loss

是一个0-1损失而不是MAE。
求解惑，十分感谢！

AttributeError: 'float' object has no attribute 'backward'

我在跑代码的过程中，遇到了如下的问题，有小伙伴知道这个问题要怎么解决嘛？所用的pytorch版本是1.1.0，python3的环境。
Traceback (most recent call last):
File "feature_pu_model.py", line 272, in
acc, risk, prisk, nrisk = trainer.train_mini_batch(batch, args)
File "feature_pu_model.py", line 136, in train_mini_batch
(risk).backward()
AttributeError: 'float' object has no attribute 'backward'

ontonotes4.0原数据集处理

你好！我已经获取了ontonotes4.0原数据集，但是不知道如何处理，网上只有5.0的处理教程。还希望能分享一下4.0数据集预处理流程

LOC and MISC type training

PER 和 ORG 的 python feature_pu_model.py --dataset conll2003 --flag PER 可以正常得到结果，但是LOC和MISC始终是 Precision: 0, Recall: 0.0, F1: 0，得不到结果

v-mipeng / lexiconner Goto Github PK

lexiconner's People

Contributors

Stargazers

Watchers

Forkers

lexiconner's Issues

Recommend Projects

Recommend Topics

Recommend Org