nealcly / templatener Goto Github PK

View Code? Open in Web Editor NEW

205.0 6.0 40.0 1.61 MB

Source code for template-based NER

Python 100.00%

few-shot-learning named-entity-recognition sequence-labeling prompt template

templatener's Introduction

Template-Based NER

Source Code For Template-Based Named Entity Recognition Using BART

Training

Training train.py

Inference inference.py

Corpus

ATIS (https://github.com/yvchen/JointSLU/tree/master/data)

MIT Restaurant Corpus (https://groups.csail.mit.edu/sls/downloads/)

MIT Movie Corpus (https://groups.csail.mit.edu/sls/downloads/)

Contact

If you have any questions, please feel free to contact Leyang Cui ([email protected]).

Citation

@inproceedings{cui-etal-2021-template,
    title = "Template-Based Named Entity Recognition Using {BART}",
    author = "Cui, Leyang  and
      Wu, Yu  and
      Liu, Jian  and
      Yang, Sen  and
      Zhang, Yue",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.161",
    doi = "10.18653/v1/2021.findings-acl.161",
    pages = "1835--1845",
}

templatener's People

Contributors

Stargazers

Watchers

Forkers

parakalan dadelani mars-wei mklimasz fieldrainzhang rocke2020 qshuang123 yanyuefan whm-one nmarkin savasy laiviet jackyjqcheung cwlseu zhaiyunfan leaperfrog bookraint hjc3613 tatnashev khairunnisaor awyys gary222520 zhangbeibei1991 smileix p0x0q minsgoing jasonnoy tong990226 zzq-ntu leoncrashcode 5l1v3r1 vicky-zz jiachenfan1208 xiaoqingnlp emanuelaboros joao-luz bimasenaputra yangchenghuang shaking54

templatener's Issues

Implementation for other language

Hi,

thank you for your great contribution to this interesting template NER topic. I wonder if it's possible to adapt this code to another language. I've included the model and tokenizer in the MODEL_CLASSES (and other parts) since it has a different tokenizer compared to English BART.

MODEL_CLASSES = {
    "auto": (AutoConfig, AutoModel, AutoTokenizer),
    "bart": (BartConfig, BartForConditionalGeneration, BartTokenizer),
    "bert": (BertConfig, BertModel, BertTokenizer),
    "roberta": (RobertaConfig, RobertaModel, RobertaTokenizer),
    "indobart": (MBartConfig, MBartForConditionalGeneration, IndoNLGTokenizer)
}

Could you share some hints on which part I should put attention to when adding other pre-trained models/language to the code?

Thank you so much for your help!

Best,
Oryza

Loss fluctuation

Hi, when I run train.py with the data/train.csv and data/dev.csv, the loss fluctuates a lot between 0.6 and 0.3 and doesn't seem to get better. Do you have any Idea what might be the reason for this?

Further consideration of efficiency?

In this paper, the author considered efficiency, while Inference is still time-consuming. This is because each sample x with a length of n will create 8n * k templates.

The source sequence of the model is an input text X = {x1, . . . , xn} and the target sequence Tyk,xi:j = {t1, . . . , tm} is a template filled by candidate text span xi:j and the entity type yk.

For efficiency, we restrict the number of n-grams for a span from one to eight, so 8n templates are created for each sentence.

Evaluation Metrics

The acc calculated in the project only computed the absolute accuracy of generated sequence, which contained the tokens like: "is," "a," "an," "entity."

I calculated the P, R, F1 of the entity token content and entity class of the generated sequence（train based on CoNLL03）. The evaluation results are inconsistent with the paper, with the ' organization' entity obtaining only 0.58 F1. Could you please publish the dataset of the paper and the complete evaluation methods?

Error encountered when running the inference.py file

Hi,
Thank you for sharing. Could you please provide the version of transformers and other packages? I encountered the problem when running the inference.py (Line: output = model(input_ids=input_ids.to(device), decoder_input_ids=output_ids[:, :output_ids.shape[1] - 2].to(device))[0).
The error is as below:

在少样本数据集上微调

尊敬的作者您好！我现在已经在CoNLL03上复现了你的结果，也生成了对应的模型，请问我如何在MIT Movie少样本数据集上进行微调呢，直接在train.py中修改数据集的路径，然后运行就可以了吗

Can you release the finetuned checkpoint?

Hi, thanks for your great work! I'm interested in your novel approach. Can you release the finetuned checkpoint for us to reproduce the results?

Final trained models

Hi,

I am working on using your paper for my research purposes. Would you be releasing the trained checkpoint anytime soon?

Thanks and Regards,

few-shot setting;dev.csv

Have you added validation sets to train under the few-shot setting?

The answer sentence in csv files

The csv file in the data directory seems only use the gold entity as the answer sentence. Shall I add some negative samles like "by is not a named entity" as an answer sentence to the csv file for training? Thanks.

miss module/file simpletransformers

simpletransformers module can't import

CSV input files

Can you share the format of the input CSV files?
Thank you,
Viet

checkpoint-3060及transformer版本问题

请问下您实验中使用的是transformer哪个版本呢？
为什么checkpoint-3060在https://huggingface.co的model中查找不到呢？

Seq2SeqModel predicts one entity at a time

Hi,
Seq2SeqModel.predict function predicts one entity at a time. e.g.

predict("Tesla, IBM, and Amazon are the good tech companies") -> "Tesla is an Organization"
What about extracting IBM and Amazon as well at the same time!

Could you please release the test data

Hi, Thank you for your nice work.
Could you please release the test data of this code?
I have no idea what format of the data should be applied in inference.py.

你好，测试集的问题

您好，在测试集推理时，我有两处存在问题，想请教下作者。
（1）为什么这里需要指定id[:,0]==2呢

(2)这里，为什么要减2呢？我不太理解这样做的意义

Where are the train.csv and dev.csv for conll2003 dataset?

Hi,

Thanks for your sharing. I failed to run your codes without the train.csv and dev.csv . Can you tell me how I can get the the train.csv and dev.csv . Thanks.

训练问题

您好，请问，下载代码直接执行train.py，训练好，得到一个模型，下载readme提到的数据集，在直接执行inference.py文件就可以了嘛？然后就是请问一下你的transformers的版本，因为有些地方导包爆红了

Hard coded numbers in template_entity function of inference.py

Hi,

would you mind explaining some hard-coded numbers in the template_entity function from inference.py?

def template_entity(words, input_TXT, start):
    # input text -> template
    words_length = len(words)
    words_length_list = [len(i) for i in words]
    input_TXT = [input_TXT]*(5*words_length)

    input_ids = tokenizer(input_TXT, return_tensors='pt')['input_ids']
    model.to(device)
    template_list = [" is a location entity .", " is a person entity .", " is an organization entity .",
                     " is an other entity .", " is not a named entity ."]
    entity_dict = {0: 'LOC', 1: 'PER', 2: 'ORG', 3: 'MISC', 4: 'O'}
    temp_list = []
    for i in range(words_length):
        for j in range(len(template_list)):
            temp_list.append(words[i]+template_list[j])

    output_ids = tokenizer(temp_list, return_tensors='pt', padding=True, truncation=True)['input_ids']
    output_ids[:, 0] = 2
    output_length_list = [0]*5*words_length


    for i in range(len(temp_list)//5):
        base_length = ((tokenizer(temp_list[i * 5], return_tensors='pt', padding=True, truncation=True)['input_ids']).shape)[1] - 4
        output_length_list[i*5:i*5+ 5] = [base_length]*5
        output_length_list[i*5+4] += 1

    score = [1]*5*words_length
    with torch.no_grad():
        output = model(input_ids=input_ids.to(device), decoder_input_ids=output_ids[:, :output_ids.shape[1] - 2].to(device))[0]
        for i in range(output_ids.shape[1] - 3):
            # print(input_ids.shape)
            logits = output[:, i, :]
            logits = logits.softmax(dim=1)
            # values, predictions = logits.topk(1,dim = 1)
            logits = logits.to('cpu').numpy()
            # print(output_ids[:, i+1].item())
            for j in range(0, 5*words_length):
                if i < output_length_list[j]:
                    score[j] = score[j] * logits[j][int(output_ids[j][i + 1])]

    end = start+(score.index(max(score))//5)
        # score_list.append(score)
    return [start, end, entity_dict[(score.index(max(score))%5)], max(score)] #[start_index,end_index,label,score]

I learned from the opened issues that the 5s are the length of the template_list but how about the other numbers?

It would be a great help if you could response to this, thank you in advance!

Repository not found

I tried to run inference.py but it gives the error

/configuration_utils.py", line 609, in _get_config_dict
    user_agent=user_agent,
  File "/usr/local/lib/python3.7/dist-packages/transformers/utils/hub.py", line 292, in cached_path
    local_files_only=local_files_only,
  File "/usr/local/lib/python3.7/dist-packages/transformers/utils/hub.py", line 495, in get_from_cache
    _raise_for_status(r)
  File "/usr/local/lib/python3.7/dist-packages/transformers/utils/hub.py", line 418, in _raise_for_status
    f"401 Client Error: Repository not found for url: {response.url}. "
transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/checkpoint-3060/resolve/main/config.json. If the repo is private, make sure you are authenticated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "inference.py", line 104, in <module>
    model = BartForConditionalGeneration.from_pretrained('./checkpoint-3060')
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py", line 1934, in from_pretrained
    **kwargs,
  File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 526, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 553, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 614, in _get_config_dict
    f"{pretrained_model_name_or_path} is not a local folder and is not a valid model identifier listed on "
OSError: ./checkpoint-3060 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass ``use_auth_token=True

    LABELS=["Adjective","API","Core","GUI","Hardware","Language","Platform","Standard","User","Verb","O"]
    template_list=[" is a %s entity"%(e) for e in LABELS]
    entity_dict={i:e for i, e in enumerate(LABELS)}

Here is loading checkpoint

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('./outputs/best_model')

Here is the inference and the error

prediction("As a user I should be able to use the attribute type User in my queries.")

RuntimeError
----> 2 prediction("As a user I should be able to use the attribute type User in my queries.")
/usr/local/lib/python3.7/dist-packages/transformers/models/bart/modeling_bart.py in _shape(self, tensor, seq_len, bsz)
157 def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
--> 158 return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
RuntimeError: shape '[88, -1, 16, 64]' is invalid for input of size 778240

cross-domain question

您好，我在cross-domain实验中遇到了key-value的问题，我用的是bert-softmax，不知道您在进行Sequence Labeling BERT是如何解决这个问题的，非常期待您的回答