Code Monkey home page Code Monkey logo

macbert's People

Contributors

ymcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

macbert's Issues

huggingface models clone failed

I tried several times, but still getting the following error

(base) vimos@vimos-Z270MX-Gaming5 pretrained_models  % git lfs install
git clone https://huggingface.co/hfl/chinese-macbert-base
Git LFS initialized.
Cloning into 'chinese-macbert-base'...
remote: Authorization error.
fatal: unable to access 'https://huggingface.co/hfl/chinese-macbert-base/': The requested URL returned error: 403

I can clone other models without problems. Is there any special authentication requirements on MacBERT?

MacBERT 在 mask 上的一些细节问题

想自己动手做一下 MacBERT 的 mask,有下面两个问题希望可以请教一下 @ymcui

  1. 论文中 "We use a percentage of 15% input words for masking",可以理解为 mask 掉 15% 的 word 而不是 token 吗?
  2. 随机采样词级别的 1,2,3,4-gram 的文本,这个采样会像 Google 的原生实现那样避免采样重复的词吗?

中文单个字符如何找到同义词的

论文中提到摸型对单字符的概率是40%,对于英文来说,比较容易找到单个字符的同义词,但是单个字符对于中文来说,大概率是无法找到同义词的。难道要把这么多找不到同义词的使用随机替换?请问是如何处理的?
代码到时候会开源吗

missing weights

Hi,

Thanks for releasing the pre-trained models on huggingface, have a basic question, was hoping you guys can help.

I loaded the MaskedLM model using the AutoModelForMaskedLM code.

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-macbert-base")
model = AutoModelForMaskedLM.from_pretrained("hfl/chinese-macbert-base")

However, I received the following error, which I didn't expect, given the pre-trained weights were for maskedLM:

Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

Wondering if you guys know why?

Thanks in advance,
Yi Peng

问题

你好,这个模型能运用在知识抽取方面表现吗

關於繼續訓練pretrained model

您好,首先非常感激 @ymcui 作者及團隊提供資源給大家做使用,
我是一位學生,想請教一些觀念,手上有一些醫療健康網上爬下來的文章,
想繼續訓練在由哈工大提供之全詞遮罩BERT-like系列and MacBERT上,
預計使用的程式碼為由huggingface提供之run_mlm.py 與run_mlm_wwm.py
(來源:https://github.com/huggingface/transformers/tree/master/examples/research_projects/mlm_wwm
https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling)
以下為問題:
(1)非全詞遮罩的模型(bert-base-chinese)是否可以用全詞遮罩run_mlm_wwm.py 繼續訓練?
(2)全詞遮罩的模型(BERT-wwm-ext,RoBERTa-wwm-ext-large)有無限制接續的訓練一定要使用全詞遮罩run_mlm_wwm.py?若繼續使用單詞遮罩run_mlm.py 進行訓練,是否會出現問題導致效果不好?
(3)MacBERT是否可以使用上述兩者程式碼繼續訓練?因原文中提到會進行相似字替換,概念與上述兩者不符
(4)要如何判繼續訓練在特殊領域(ex:醫療,法律)之模型已收斂,以loss或以training steps ?

最後,謝謝作者珍貴的時間,撥冗閱讀,如有觀念上的指點,再麻煩作者回覆了,感激不盡
Stay safe

词表中的<S>和<T>含义

您好,我看到预训练词表中有两个特殊的token,请问这两个特殊token有何含义?在预训练中是否使用过?还是仅仅只是类似unused的随机初始化?希望能得到您的回复。

如果计算得到的相似词与原词长度不同咋办呢?谢谢

感谢作者,思路简单有效,但是我还是有两个疑问啊:

  1. N-Gram 按照理解应该是以词组为单位进行mask吧?比如 4-Gram 就mask掉4个连续的 Whole word ?
  2. 对于 Mac 部分,输入一个词组,但是得到的近义词的长度与原来词组的长度不一致了怎么办呢?

谢谢。

咨询 MacBert 上的一些问题

你好,我在看 MacBert 的论文时有一些迷惑。本来想法邮件的,但好像发不到那个邮箱。

关于“ We use whole word masking as well as N-gram masking strategies for selecting candidate tokens for masking, with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram. ”,这段是指一个词 40% 的概率被换成近义词,两个词以 30% 概率换成近义词,以此类推吗?

请问 whole word masking 和 N-gram masking 是如何一起使用的?

您好!在MacBERT和PERT中,都提到预训练过程同时使用 whole word masking 和 N-gram masking,想请教一下具体是怎么一起使用的?比如:

是怎么取的比例?

Whole word masking 和 N-gram masking 是一并实现(N-gram masking 中只取那些构成 whole word 的 N-gram),还是分开实现(N-gram masking 不考虑是否构成 whole word)?

When using Synonyms masking, Synonyms are not in the list of vocab.

你好,我在尝试复现macbert的同义词替换mask操作,使用你们建议的 Synonyms 包。在使用的过程中,我发现了一个问题。

在我想替换某个词的时候,其同义词并不能在我自己的vocab词表里找到对应的词向量。我目前采取的方法是遇到这种情况我就选择跳过这个词,不进行mask操作。

我想询问一下你们有无遇到这种情况,是怎么处理的,能否给我一些建议,谢谢。@ymcui

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.