codertimo / bert-pytorch Goto Github PK

View Code? Open in Web Editor NEW

6.0K 6.0K 1.3K 101 KB

Google AI 2018 BERT pytorch implementation

License: Apache License 2.0

Python 99.83% Makefile 0.17%

bert language-model nlp pytorch transformer

bert-pytorch's Introduction

I’m currently working on 🔭

Developing open-domain dialog system whose name is Luda
Building Korean MRC(Machine Reading Comprehesion) Dataset at KLUE
Trying hard to reduce the Learning Machine Learning(LML) loss 😂
Coding everyday for better research engineering skill

I’m currently learning 🌱

Theoretical Machine Learning from the basic
Tensorflow 2.x
Research Environment in Cloud (GCP, AWS etc)
GO language study
Asset Investing (Portfolio, Risk Management, Macro-Finance etc)

How to reach me 📫

Email: [email protected]
CV: https://junseong.oopy.io/introduction
Blog: https://junseong.oopy.io/
Facebook: https://fb.com/codertimo
Linkdin: https://www.linkedin.com/in/codertimo/

bert-pytorch's People

Contributors

Stargazers

Watchers

Forkers

chenfei-wu burness merajat qsevent hitum-dev cedrickchee freedomkite zhyq zhouyonglong allensmile jeanru zorrock eliotpbrenner daiwk giserh scapeqin wanjinchang sayduke shinichr binyi10 machenfeng seanlee97 uppet yjfiejd jswhy shugao0810 jianchengss airob henryflee vangogh0318 xiedake anirband sysujayce shubhampachori12110095 liben2018 dayu321 zhanzecheng xavier66 bluegreenup 0xflotus gjylt shaunstanislauslau super-louis mjc14 hongshunyang tk1363704 tianforks joseph-chan chiuyeelau devhttps onpoeet vitvicky jiasir803 mysee1989 carolzxyzxy casillas-qf chunde fgdbtkd hhy5277 sjyttkl myvrml alchemist1024 lu839684437 binwone dream1202 lixinsu li10141110 btbujiangjun samangel93 pku-wuwei wurentidai shengleih nothinglz jackylee1 frcmail yuanjie-ai entn-at highclow lichaoliu666 oppa3109 sharmer156 apoet gysan gdsttian mengbinzhu luojianp luciencho hipercube hczheng cosecant-csc fssqawj fbfra xuehaouwa dgreen2017 kylecharles 605883732 loicgrobol nathaliewang intuitionmachine blay12cedric-zz

bert-pytorch's Issues

Tie the input and output embedding?

I think it's reasonable to tie the input and output embedding. Especially the output embedding along each token. But I still can't get a way to do this. Any one give an idea?

IndexError: list index out of range

pred_loss decrease fast while avg_acc stay at 50%

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.

Tensor transform question in pretrain.py

There is a line like below in pretrain.py,

mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

I run it, and find "mask_lm_output" is like "batch_sizeinput_lengthvocab_size", and "data["bert_label"]" like "batch_size*input_length", if transpose as above, Does it make sense ? I am confused.

how dose your code implement Bidirectional Transformers?

hi,
i am new user of pytorch, i want to know which part in your code can represent Bidirectional Transformers ? thanks .

model/embedding/position.py

div_term = (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp()
should be:
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

In [51]: (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp()
...:
Out[51]:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Additional question:
I don't quite understand how "bidirectional" transformer in the raw paper implemented. Maybe like BiLSTM: concat two direction's transformer output together? Didn't find the similar structure in your code.

The LayerNorm implementation

I am wondering why don't you use the standard nn version of LayerNorm?
I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

Could you clarify these 2 approaches?

Question about random sampling.

BERT-pytorch/bert_pytorch/dataset/dataset.py

Lines 50 to 64 in 7efd2b5

    
           prob = random.random() 
        
           if prob < 0.15: 
        
               # 80% randomly change token to make token 
        
               if prob < prob * 0.8: 
        
                   tokens[i] = self.vocab.mask_index 
        
               # 10% randomly change token to random token 
        
               elif prob * 0.8 <= prob < prob * 0.9: 
        
                   tokens[i] = random.randrange(len(self.vocab)) 
        
               # 10% randomly change token to current token 
        
               elif prob >= prob * 0.9: 
        
                   tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) 
        
               output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

Well, seems random.random() always returns a positive number, so prob >= prob * 0.9 will always be true?

Making Book Corpus

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

Erroneous code

BERT-pytorch/bert_pytorch/dataset/dataset.py

Lines 53 to 62 in 7efd2b5

    
           if prob < prob * 0.8: 
        
               tokens[i] = self.vocab.mask_index 
        
           # 10% randomly change token to random token 
        
           elif prob * 0.8 <= prob < prob * 0.9: 
        
               tokens[i] = random.randrange(len(self.vocab)) 
        
           # 10% randomly change token to current token 
        
           elif prob >= prob * 0.9: 
        
               tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

This code is incorrect - it will always go into the last if clause. For instance, prob < prob * 0.8 is never true.

Why doesn't the counter in data_iter increase?

I am currently playing around with training and testing the model. However, as I implemented the test section, I'm noticing that during the LM training, your counter doesn't increase when looping over data_iter found in pretrain.py. This would cause problems when calculating the average loss/accuracy, wouldn't it?

How to embedding segment lable

Thanks for you code ,which let me leran more details for this papper .But i cant't understand segment.py. You haven't writeen how to embedding segment lable .

would you provide some sample datasets for demo the pre-training

readme has wrong commands

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

should be

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

according to

bert-vocab -c data/corpus.small -o data/vocab.small

Bidirectional Encoder = Transformer (self-attention), Is it true？

https://github.com/codertimo/BERT-pytorch/blob/alpha0.0.1a4/bert_pytorch/model/transformer.py#L9

Thank you！

Single Sentence Input support

In the paper, they note that they optionally use single sentence input for some classification tasks. I'll try to take a look at doing it myself, as it looks like it is not currently supported.

PositionalEmbedding

The position embedding in the BERT is not the same as in the transformer. Why not use the form in bert?

Should random sampling work per every epoch?

Is is okay to use random sampled data which is saved before the training?
I mean, should it have to be changed every epoch?

The question about the implement of learning_rate

Nice implements! However, I have a question about learning rate. The learning_rate schedule which from the origin Transformers is warm-up restart, but your implement just simple decay. Could you implement it in your BERT code?

ner

how to get dataset.small from corpus.small

in the 0th step , build a corpus named corpus.small,
in the 2th step, use a dataset.small ?
here is my question, build dataset.small from corpus.small or not ?

Is there any result?

Is there any result that can be compared to the raw paper?Looking forward to your updates.

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

Attention maybe changed

Hi, Thanks for your great job.
I wonder that the attention mechanism of your code seems to be changed.
The shape of attention vector should be (batch, timestep, timestep), but according to your code, the shape of self attention vector is (batch, timestep, hidden_size). There is new code that I fixed below. Please review it and appreciate your comments. Thank you.

`
class Attention(nn.Module):
def init(self, num_hidden, h=8):
super(Attention, self).init()

    self.num_hidden_per_attn = num_hidden // h
    self.h = h
    
    self.key = nn.Linear(num_hidden, num_hidden)
    self.value = nn.Linear(num_hidden, num_hidden)
    self.query = nn.Linear(num_hidden, num_hidden)
    
    self.layer_norm_1 = LayerNorm(num_hidden)
    self.layer_norm_2 = LayerNorm(num_hidden)
    self.out_linear = nn.Linear(num_hidden, num_hidden)
    
    self.dropout = nn.Dropout(p=0.1)
    
def forward(self, input_):
    batch_size = input_.size(0)
    
    key = F.relu(self.key(input_))
    value = F.relu(self.value(input_))
    query = F.relu(self.query(input_))
    
    key, value, query = list(map(lambda x: x.view(batch_size, -1, self.h, self.num_hidden_per_attn), (key, value, query)))
    params = [(key[:,:,i,:], value[:,:,i,:], query[:,:,i,:]) for i in range(self.h)]

    _attn = list(map(self._multihead, params))
    attn = list(map(lambda x: x[0], _attn))
    probs = list(map(lambda x: x[1], _attn))
    result = t.cat(attn, -1)

    result = self.dropout(result)
    result = result.view(batch_size, -1, self.h * self.num_hidden_per_attn)
    
    # residual connection
    result = self.layer_norm_1(F.relu(input_ + result))
    
    out = self.out_linear(result)
    out = self.layer_norm_2(F.relu(result + out))
    
    return result, probs

def _multihead(self, params):

    key, value, query = params[0], params[1], params[2]

    attn = t.bmm(query, key.transpose(1,2)) / math.sqrt(key.shape[-1])

    attn = F.softmax(attn, dim=-1)
    result = t.bmm(attn, value)

    return result, attn

Example of Input Data

Could you give a concrete example of the input data? You gave an example of the corpus data, but not the dataset.small file found in this line:

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

If you could show perhaps a couple of examples, that would be very helpful! I am new to pytorch, so the dataloader function is a little confusing.

What should be the shapes and example of values of x and segment_info in BERT.forward?

I'm trying to add BERT as trainable part to my model and want to pass some data to it.
Could you complete my code example with some x and segment_info?

from bert_pytorch import BERT
N = 30000
bert_model = BERT(N)
x = ...
segment_info = ...
bert_model.forward(x, segment_info)

will "random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)" report error?

in the dataset.py line 31, it seem to report error random.randint() needs two positional argument.

what’s your data set?

shape match for mul？

https://github.com/codertimo/BERT-pytorch/blob/alpha0.0.1a5/bert_pytorch/model/embedding/position.py#L18

What are the shapes of position and div_term?

imbalance GPU memory usage

Hi,

Nice try for BERT implementation.

I try to run your code in 4V100 and I find the memory usage is imbalance: the first GPU consume 2x memory than the others. Any idea about the reason?

Btw, I think the parameter order in train.py line 64 is incorrect.

Why output_label=0 in datasets generation for Masked LM

In dataset.py, function 'random_word', line90, why the output_label of 85% data(no masking) is set to 0 ， output_label.append(0)？

segmentation fault

When the code's Dataset class is modified, the error occurs. I try to debug the error, the code except at line: loss.backward().

Very low GPU usage when training on 8 GPU in a single machine

Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+

I am not familiar with pytorch. Any one konws why?

Question about the loss of Masked LM

Thank you very much for this great contribution.
I found the loss of masked LM didn't decrease when it reaches the value around 7. However, in the official tensorflow implementation, the loss of MLM decreases to 1 easily. I think something went wrong in your implementation.
In additional, I found the code can not predict the next sentence correctly. I think the reason is: self.criterion = nn.NLLLoss(ignore_index=0). It can not be used as criterion for sentence prediction because the label of sentence is 1 or 0. We should remove ignore_index=0 for sentence prediction.
I am looking forward to your reply~

how to test the model?

Could you please give some example code to load in the pre-trained model(i.e. bert.model.ep0 files)?
The code might take me a while to understand so I really appreciate it if you can help me.

Is it possible to train BERT?

Is it possible to achieve the same result as the paper in short time?
Well.. I don't have enough GPU & computation power to see the enough result as google ai.

If we can't train the full corpus as the google, then how can we prove that this code is verified?
Training 256M size corpus without Google AI class gpu computation is nearly, impossible for me.

If you have any thought(reducing the model size) please let me know!

Maybe Bugs?

In pretrain.py, save() methods. I guess self.bert.to(self.device) should be removed..right?

How to test your model after training on my own dataset?

I am using for next sent. gen so while it has stored model in .ep* format but how to run my test dataset using those models.

Thanks

[BERT] Cannot import bert

I have problems importing bert when following http://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html

(mxnet_p36) [ec2-user@master ~]$ ipython
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import warnings
   ...: warnings.filterwarnings('ignore')
   ...:
   ...: import random
   ...: import numpy as np
   ...: import mxnet as mx
   ...: from mxnet import gluon
   ...: import gluonnlp as nlp
   ...:
   ...:


In [2]:

In [2]: np.random.seed(100)
   ...: random.seed(100)
   ...: mx.random.seed(10000)
   ...: ctx = mx.gpu(0)
   ...:
   ...:

In [3]: from bert import *
   ...:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-40b999f3ea6a> in <module>()
----> 1 from bert import *

ModuleNotFoundError: No module named 'bert'

Looks gluonnlp are successfully installed. Any idea?

(mxnet_p36) [ec2-user@master site-packages]$ ll /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg
-rw-rw-r-- 1 ec2-user ec2-user 499320 Dec 28 23:15 /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg

Mask language model loss

Hi,
Thank you for your clean code on Bert. I have a question about Mask LM loss after I read your code. Your program computes a mask language model loss on both positive sentence pairs and negative pairs.

Does it make sense to compute Mask LM loss on negative sentence pairs? I am not sure how Google computes this loss.

Vocab Replace \t to blank issue

when the corpus is:
how are you \ tnice to meet you
and apply bert-vocab cmd, the output of the vacab is
['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to'].
But when change the corputs to
how are you\tnice to meet you, the result is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice'], the last token become younice.
a <'blank'> need on both sides of <'\t'>.
it's may not a bug.

made a script to generate bert pre-train data

the script is similar to https://github.com/google-research/bert/blob/master/create_pretraining_data.py from google-research.
it can convert a document into bert trainning data

EP_train:0: 0%|| 0/15636 [00:00<?, ?it/s]Segmentation fault (core dumped)

the data format is s, t, s_l, t_l, IsNext.

chooses 15% of token

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure. Does it aligned with 15% of token will be chosen?

Pretrained model transfer to pytorch

Well all of you guys know, it's nearly impossible to train from the scratch, because of lack of computation power. So I'm going to implement the transfer code for making pretrained model can be supported on pytorch too.

This implementation will be started when the Google release their official BERT code and pretrained model. If anyone interested to join this work, please leave the comment underside.

Thank you everyone who carefully watching this project👍
By Junseong Kim

the format of input

You mentioned that

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

and gave an example:

Welcome to the \t the jungle\n
I can stay \t here all night\n

However, the example is actually ONE sentence in one line.
Should it be:

Welcome to the jungle \t I can stay here all night\n

(suppose these two sentences are continuous in the broader context)

self.d_k = d_model // h gives 64 dimension ?

BERT-pytorch/bert_pytorch/model/attention/multi_head.py

Line 15 in d10dc4f

self.d_k = d_model // h

Looks that self.d_k = d_model // h ---> embed size 768 dividing number of heads 12 = 64

        self.d_k = d_model // h # 64
        self.h = h # 12 heads
        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
                             for l, x in zip(self.linear_layers, (query, key, value))]

why convert 768 dimensional [q,v,k] into 64 dimension embedding ?

Reference:
http://nlp.seas.harvard.edu/2018/04/03/attention.html
I put some comments on the shape:

class MultiHeadedAttention(nn.Module): # d_model=512, h=8
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h # 512//8=64
        self.h = h # 8
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k) # (nbatches, -1, 512)
        return self.linears[-1](x)

Making Wikipedia Corpus