ymcui / macbert Goto Github PK

View Code? Open in Web Editor NEW

628.0 14.0 56.0 188 KB

Revisiting Pre-trained Models for Chinese Natural Language Processing (MacBERT)

Home Page: https://www.aclweb.org/anthology/2020.findings-emnlp.58/

License: Apache License 2.0

nlp bert language-model tensorflow macbert pytorch transformers

macbert's People

Contributors

Stargazers

Watchers

Forkers

duxiaochao yiyepiaoling0715 acproject happyf haojiepan1 xrosliang johnson7788 zhaoxiongjun tubbz-alt wuuw shenzaimin zoechen119 qfzxhy yang-zhikai pinacoladast w2q3q1 arcral ziyicui2022 yueyedeai yotofu zijunsun ljj12 rocke2020 zhulongt robertaaa initialbug dvhuang sunshine866 joshzyj bigkuang tassadar007 piaoxijun cancangit sl403 meverystrong zyzyzhou pauseman hongjian zerochenml qixy13 andysdc ycchen-tw flywolfs tiyaro waterluck georegehehe alsiatang zhaoxjmail jangocheng hzzhang-nlp suprah925 zlgenuine lynchying liutong0072

macbert's Issues

能否新增一个bort模型呢？

Amazon在https://arxiv.org/pdf/2010.10499.pdf这篇论文中提到搜索出了Bert的最优子结构bort，目前huggingface上已经有了英文版，但还没有中文版的，非常期待一个中文版的bort出现，如果贵单位能训练一个出来就非常感谢啦

huggingface models clone failed

I tried several times, but still getting the following error

(base) vimos@vimos-Z270MX-Gaming5 pretrained_models  % git lfs install
git clone https://huggingface.co/hfl/chinese-macbert-base
Git LFS initialized.
Cloning into 'chinese-macbert-base'...
remote: Authorization error.
fatal: unable to access 'https://huggingface.co/hfl/chinese-macbert-base/': The requested URL returned error: 403

I can clone other models without problems. Is there any special authentication requirements on MacBERT?

MacBERT 在 mask 上的一些细节问题

想自己动手做一下 MacBERT 的 mask，有下面两个问题希望可以请教一下 @ymcui：

论文中 "We use a percentage of 15% input words for masking"，可以理解为 mask 掉 15% 的 word 而不是 token 吗？
随机采样词级别的 1，2，3，4-gram 的文本，这个采样会像 Google 的原生实现那样避免采样重复的词吗？

如果相似词与原词长度不同是怎么处理的？谢谢

如果提取的相似词与原词长度不同是怎么处理的呢？按照相似度遍历，直到找到与原词长度相同的词进行替换吗？
有劳作者大大指导思路！感谢！

中文单个字符如何找到同义词的

论文中提到摸型对单字符的概率是40%，对于英文来说，比较容易找到单个字符的同义词，但是单个字符对于中文来说，大概率是无法找到同义词的。难道要把这么多找不到同义词的使用随机替换？请问是如何处理的？
代码到时候会开源吗

missing weights

Hi,

Thanks for releasing the pre-trained models on huggingface, have a basic question, was hoping you guys can help.

I loaded the MaskedLM model using the AutoModelForMaskedLM code.

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-macbert-base")
model = AutoModelForMaskedLM.from_pretrained("hfl/chinese-macbert-base")

However, I received the following error, which I didn't expect, given the pre-trained weights were for maskedLM:

Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

Wondering if you guys know why?

Thanks in advance,
Yi Peng

为啥不开源训练代码呢，你们也是用的别人开源的代码

问题

你好，这个模型能运用在知识抽取方面表现吗

请问计算损失函数时是不是只考虑被替换的token？

谢谢。

關於繼續訓練pretrained model

您好，首先非常感激 @ymcui 作者及團隊提供資源給大家做使用，
我是一位學生，想請教一些觀念，手上有一些醫療健康網上爬下來的文章，
想繼續訓練在由哈工大提供之全詞遮罩BERT-like系列and MacBERT上，
預計使用的程式碼為由huggingface提供之run_mlm.py 與run_mlm_wwm.py
(來源:https://github.com/huggingface/transformers/tree/master/examples/research_projects/mlm_wwm
https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling)
以下為問題：
（１）非全詞遮罩的模型(bert-base-chinese)是否可以用全詞遮罩run_mlm_wwm.py 繼續訓練？
（２）全詞遮罩的模型(BERT-wwm-ext,RoBERTa-wwm-ext-large)有無限制接續的訓練一定要使用全詞遮罩run_mlm_wwm.py？若繼續使用單詞遮罩run_mlm.py 進行訓練，是否會出現問題導致效果不好？
（３）MacBERT是否可以使用上述兩者程式碼繼續訓練？因原文中提到會進行相似字替換，概念與上述兩者不符
（４）要如何判繼續訓練在特殊領域（ex:醫療,法律）之模型已收斂，以loss或以training steps ？

最後，謝謝作者珍貴的時間，撥冗閱讀，如有觀念上的指點，再麻煩作者回覆了，感激不盡
Stay safe