thunlp / bert-kpe Goto Github PK

View Code? Open in Web Editor NEW

437.0 16.0 78.0 8.63 MB

License: MIT License

Python 99.55% Shell 0.45%

bert-kpe's People

Contributors

Stargazers

Watchers

Forkers

gokunwu hitluobin kiminh lifeixianshen sxwxs the-black-knight-01 two222 beekbin sunsishining laituan245 zhongyunuestc w-zm mnleod zhangjianzhang urmi22 zeta1999 shiztong chirasmita-mallick amarnamarpan fangzheng354 joydajunspacecraft xrosliang xuetf awyshw jiankeguxin wzh0211 supermc5657 mostafa-ellebodi qianrenjian tangzy7 bjtulinl wuxiaoxue daizzhisheng tiffen hellomaxwell macshkim john9281 luomuqinghan vinayasathyanarayana kariswr prios-chintan-mehta liguiming77 amirveyseh subhamkhemka c00renut anatanick harrisonll musitafa0032 aliweiya amitchaulwar bobycv06fpm alvinwatner jackjyzhang dliofindia ruo-feng gulldan mr-noman readytoimpact littlepotato1994 techthiyanes boostwu whn09 rootlt santipongth dumpmemory flypanda666 oo2316oo daiyizheng tangweiyi1 qolina bowendoctor1616 tangzixia kang9779 amankumar196 lcvcl gangzhao98 limohanlmh

bert-kpe's Issues

Does this repo support non-en language, such as Chinese, japan and korea?

Thanks for your brellient work, I check this repo and I didn't find preprocess about non-en languages, and for evaluate phrase, does it can generate Chinese word?
Can you help me verify this part, Thanks a lot.

loss function is wrong

File "G:\BERT-KPE\scripts\train.py", line 52, in train
loss = model.update(step, inputs, scaler)
File "G:\BERT-KPE\scripts\model.py", line 86, in update
loss = self.network(**inputs)
File "G:\BERT-KPE\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "G:\BERT-KPE\scripts..\bertkpe\networks\Roberta2Joint.py", line 356, in forward
Rank_Loss_Fct(
File "G:\BERT-KPE\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "G:\BERT-KPE\venv\lib\site-packages\torch\nn\modules\loss.py", line 1333, in forward
return F.margin_ranking_loss(input1, input2, target, margin=self.margin, reduction=self.reduction)
File "G:\BERT-KPE\venv\lib\site-packages\torch\nn\functional.py", line 3328, in margin_ranking_loss
raise RuntimeError(
RuntimeError: margin_ranking_loss : All input tensors should have same dimension but got sizes: input1: torch.Size([2, 1]), input2: torch.Size([1, 307]), target: torch.Size([1])

Increasing max_phrase_words runs into memory issues

Hi, first of all, thank you very much for the repository! :)

I want to retrain the model using a larger number of keyphrases and output longer keyphrases in general.

To achieve this I:

increase the number of max_phrase_words from 5 to 10 in scripts/config.py
increase max_gram from 5 to 10 parameter in my model (in bertkpe/networks/)

However, I see that every number bigger than 5 makes me run out of memory during the step
"start preparing (train) features for bert2joint (bert) ..."
I can also see that if I increase the numbers to 6, I run out of memory at a much later stage in the preparation step than if I increase it to something higher like 10, even though the operation is performed on batches. I suspect that there is a memory leak in one of the data loader functions.

Why include transformers?

Did you have to customize the transformers library for this task? Thanks

What is the parameter `active_mask` mean?

API for key phrase extraction inference

Currently the provided code (test.py) evaluates a pretrained checkpoint on datasets such as OpenKP. Is there an API where the input is some arbitrary text and the outputs are the extracted keywords, or tutorials on how to adapt the current code to do that? Thank you very much!

Using kp20k dataset

dataloaders code of your own are fitted with OpenKP.
How about kp20k for tutorials?

变量 max_diff_gram_num 的含义

hey，孙博士好。近期我在研究KPE模型源码时，对 batchify_XXX_XXX 函数中的变量max_diff_gram_num 有些疑惑，以 /bertkpe/dataloader)/bert2joint_dataloader.py 为例：

line 482: max_diff_gram_num = (1 + max([max(_mention_mask[-1]) for _mention_mask in mention_mask]))

这边的 max_diff_gram_num 是不是表示在一个batch的文章中，候选关键词集合的元素数的最大值？

如果是的话，这行代码是不是可以等价于 max(phrase_list_lens) ? 因为 phrase_list_lens 即是每篇文章的候选关键词集合中包含的元素数。

完全按照提供的步骤,无法复现结果

按照步骤来无法复现这个结果, 测试集里F@3只有0.1 - -!
请问是否有其他的细节没有提供, 比如train.sh里有哪些参数是必须要添加进去的?

Installing the requirements.txt file does not work

Hi, thanks for providing the code!
Unfortunately, I cannot install all the requirements with your provided command

python 3.5
pip install -r requirements.txt

Do you really use this outdated Python version?

The GitHub repository -e git+https://github.com/xaynetwork/xayn_ai_research.git@23d366ff8a05eca164718a6857eb31d439d52448#egg=xain_ai_research does not exist.

There is a version conflict of allennlp and transformers

The conflict is caused by:
    The user requested transformers==4.12.3
    allennlp 2.5.0 depends on transformers<4.7 and >=4.1

Unable to use checkpoints for inference

I was trying to use the checkpoints provided for inference on openkp dataset, but I am getting this error for bert-base-cased:

RuntimeError: Error(s) in loading state_dict for BertForChunkTFRanking:                                                         
Missing key(s) in state_dict: "bert.embeddings.position_ids".

For roberta-base:

RuntimeError: Error(s) in loading state_dict for RobertaForChunkTFRanking:                                                      
Missing key(s) in state_dict: "roberta.embeddings.position_ids".

I am using transformers==4.12.3 and pytorch==1.8 as mentioned in the requirements.txt file.

micro-f1 or macro-f1 ?

hi , in

BERT-KPE/bertkpe/evaluator/kp20k_evaluator.py

Line 83 in 424ff52

# Micro-Averaged Method

I found that you annotated with " # Micro-Averaged Method ". But your approach seems more in line with macro calculation method(Calculate the f1 of each sample first, and then calculate the average of those as the total f1.). So, which indicator do you use in your final result?

BERT-KPE/scripts/utils.py

Line 147 in 424ff52

logger.info("F1:{}".format(np.mean(f1_scores[i])))

hi~ 发现您在kp20k_evaluator.py中注释了 " # Micro-Averaged Method "，但您的计算方法看起来更符合macro-f1的定义(先计算每个样本的f1，再求平均得到总体的f1),请问您最终的结果用得是macro-f1还是micro-f1呢？

Script to generate keyphrases from arbitrary text?

It would be great to have a command to generate keyphrases from arbitrary text.

Something like:

generate.sh mytext.txt > keyphrases.jsonl

Model loading failed

We used test.sh to load model in checkpoints/bert2joint/bert2joint.openkp.bert.checkpoint and encountered the following error. According to error info, we guess that the model you provided may be inappropriate. I hope you can check and provide a model that can be used directly. Thank you.

Some weights of the model checkpoint at ../data/pretrain_model/bert-base-cased were not used when initializing BertForTFRanking: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTFRanking from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTFRanking from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTFRanking were not initialized from the model checkpoint at ../data/pretrain_model/bert-base-cased and are newly initialized: ['cnn2gram.cnn_list.4.weight', 'cnn2gram.cnn_list.1.bias', 'classifier.bias', 'cnn2gram.cnn_list.3.weight', 'cnn2gram.cnn_list.4.bias', 'cnn2gram.cnn_list.0.weight', 'classifier.weight', 'cnn2gram.cnn_list.2.weight', 'cnn2gram.cnn_list.2.bias', 'cnn2gram.cnn_list.3.bias', 'cnn2gram.cnn_list.1.weight', 'cnn2gram.cnn_list.0.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "test.py", line 327, in <module>
    args.eval_checkpoint, args
  File "/home/smji/BERT-KPE-master/scripts/model.py", line 215, in load_checkpoint
    model = KeyphraseSpanExtraction(args, state_dict)
  File "/home/smji/BERT-KPE-master/scripts/model.py", line 35, in __init__
    self.network.load_state_dict(state_dict)
  File "/home/smji/anaconda3/envs/bert_kpe_up/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BertForTFRanking:
	Missing key(s) in state_dict: "bert.embeddings.position_ids".

请问是否可以用于中文关键词提取?

请问 BERT-KPE 可以用于中文关键词提取吗?

Error while running test.sh

Hello, thank you for this great work and for sharing your code. I am trying to run the script test.sh using your provided dataset and pre-trained model checkpoints, for bert2joint. I get the following error:

Traceback (most recent call last):
  File "test.py", line 276, in <module>
    dev_candidate = candidate_decoder(args, dev_data_loader, dev_dataset, model, test_input_refactor, pred_arranger, 'dev')
  File "test.py", line 152, in bert2rank_decoder
    for step, batch in enumerate(tqdm(data_loader)):
  File "/home/ec2-user/anaconda3/envs/bert-kpe/lib/python3.5/site-packages/tqdm/std.py", line 1165, in __iter__
    for obj in iterable:
  File "/home/ec2-user/anaconda3/envs/bert-kpe/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 346, in __next__
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ec2-user/anaconda3/envs/bert-kpe/lib/python3.5/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ec2-user/anaconda3/envs/bert-kpe/lib/python3.5/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "../bertkpe/dataloader/loader_utils.py", line 104, in __getitem__
    self.tokenizer, self.mode, self.max_phrase_words)
  File "../bertkpe/dataloader/bert2joint_dataloader.py", line 229, in bert2joint_converter
    src_tensor = torch.LongTensor(tokenizer.convert_tokens_to_ids(src_tokens))
TypeError: an integer is required (got type NoneType)

Please let me know if you have any suggestions on how to fix this. Thank you.

I am using Python 3.5.5 and huggingface transformers-2.5.1.

When I ran the code, some errors occurred.

true_score.unsqueezez(-1) and neg_score.unsqueeze(0) have different sizes. But margin_ranking_loss requires that all input tensors have same dimension. I think there is the reason why errors occurred.