nlpyang / bertsum Goto Github PK

View Code? Open in Web Editor NEW

1.5K 32.0 419.0 15.04 MB

Code for paper Fine-tune BERT for Extractive Summarization

License: Apache License 2.0

Python 100.00%

bertsum's Introduction

BertSum

This code is for paper Fine-tune BERT for Extractive Summarization(https://arxiv.org/pdf/1903.10318.pdf)

!New: Please see our full paper with trained models

Results on CNN/Dailymail (25/3/2019):

Models	ROUGE-1	ROUGE-2	ROUGE-L
Transformer Baseline	40.9	18.02	37.17
BERTSUM+Classifier	43.23	20.22	39.60
BERTSUM+Transformer	43.25	20.24	39.63
BERTSUM+LSTM	43.22	20.17	39.59

Python version: This code is in Python3.6

Package Requirements: pytorch pytorch_pretrained_bert tensorboardX multiprocess pyrouge

Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation For CNN/Dailymail

Option 1: download the processed data

download https://drive.google.com/open?id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6

unzip the zipfile and put all .pt files into bert_data

Option 2: process the data yourself

Step 1 Download Stories

Download and unzip the stories directories from here for both CNN and Daily Mail. Put all .story files in one directory (e.g. ../raw_stories)

Step 2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2017-06-09 directory.

Step 3. Sentence Splitting and Tokenization

python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH

RAW_PATH is the directory containing story files (../raw_stories), JSON_PATH is the target directory to save the generated json files (../merged_stories_tokenized)

Step 4. Format to Simpler Json Files

python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -map_path MAP_PATH -lower

RAW_PATH is the directory containing tokenized files (../merged_stories_tokenized), JSON_PATH is the target directory to save the generated json files (../json_data/cnndm), MAP_PATH is the directory containing the urls files (../urls)

Step 5. Format to PyTorch Files

python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log

JSON_PATH is the directory containing json files (../json_data), BERT_DATA_PATH is the target directory to save the generated binary files (../bert_data)
-oracle_mode can be greedy or combination, where combination is more accurate but takes much longer time to process

Model Training

First run: For the first time, you should use single-GPU, so the code can download the BERT model. Change -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 to -visible_gpus 0 -gpu_ranks 0 -world_size 1, after downloading, you could kill the process and rerun the code with multi-GPUs.

To train the BERT+Classifier model, run:

python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000

To train the BERT+Transformer model, run:

python train.py -mode train -encoder transformer -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_transformer -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048 -inter_layers 2 -heads 8

To train the BERT+RNN model, run:

python train.py -mode train -encoder rnn -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_rnn -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_rnn -use_interval true -warmup_steps 10000 -rnn_size 768 -dropout 0.1

-mode can be {train, validate, test}, where validate will inspect the model directory and evaluate the model for each newly saved checkpoint, test need to be used with -test_from, indicating the checkpoint you want to use

Model Evaluation

After the training finished, run

python train.py -mode validate -bert_data_path ../bert_data/cnndm -model_path MODEL_PATH  -visible_gpus 0  -gpu_ranks 0 -batch_size 30000  -log_file LOG_FILE  -result_path RESULT_PATH -test_all -block_trigram true

MODEL_PATH is the directory of saved checkpoints
RESULT_PATH is where you want to put decoded summaries (default ../results/cnndm)

bertsum's People

Contributors

Stargazers

Watchers

Forkers

fancycheung berryhn gojamie deeep-learning stevaras2 lapsule burakakrishna bailianfa excelsimon tuhinjubcse cdj0311 henriaw lplping tim-impactia yaserkl chuanli11 johndpope haotse shi2yu3 magic282 curryli a380922457 duytue 1230113202 sfaz t-web jx57 indexfziq ljshou hxw11 colinsongf 90217 munaachyuta didgogns catchonwu kubapb cool425589 nnnngo thinkingmachines lity3lenovo aidreamwin sannma7 legendtianjin lithiumh zwol3027 jufengada youmna-salah fly-hero blacksea2001 ftufred sh-bhat enzoampil beatobongco wengbenjue xueweixiansheng hemidemisemi sdsantiagodiez peter-xbs moon575 alaincr sahara2001 dderek-01 rajshakerp carrielui manikant92 mingchen62 parker84 muyaostudio leemengtw qibaoyuan amagooda alkalami abiraja2004 yi-mao xiaowen-ttkx ianliyi1996 santosh-gupta shuaichengli0428 yyht wangdxf qianrenjian tkwitty jinyeqiong itspritish lihengtianxia martin6336 thzll2001 1000-7 njermann uptodiff weili-nlp zhongxiangboy tlifcen meinwerk zhp510730568 tarsbase mengyuan2023 jihun-hong ash1998 udnet96

bertsum's Issues

Question about Learning rate

According to your paper :

As you said, it follows Attention is all you need :

But in your case, what is the reason you choose 2e-3 as initial learning rate ?

If we follow the formula of Vaswani, we have :

d_model = 768 (because we use BERT-base)
d_model^-0.5 ~= 0.036

So where this 2e-3 comes from ?

Also, why choosing 10 000 warmup steps ?

Original Transformer paper used 4 000 warmup steps - 100 000 total steps.

BERT paper used 10 000 warmup steps - 1 000 000 total steps, but they didn't use noam decay method, just linear decay.

Doubt about processed data

Sorry for opening an issue on this, the doubt is sort of trivial.
What oracle mode was used for the processed data, combination or greedy?

Format JSON file to Pytorch File

First of all, many thanks for the code!

I am trying to convert the sample json file in the ../json_data directory to a Pytorch file in the ../bert_data directory (testing this out so I can use my own text in JSON format:

python preprocess.py -mode format_to_bert -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log

However, the code doesn't seem to do much. I get the following back:

[('../json_data\\cnndm_sample.train.0.json', Namespace(dataset='', log_file='../logs/preprocess.log', lower=True, map_path='../data/', max_nsents=100, max_src_ntokens=200, min_nsents=3, min_src_ntokens=5, mode='format_to_bert', n_cpus=4, oracle_mode='greedy', raw_path='../json_data/', save_path='../bert_data/', shard_size=2000), '../bert_data/bert.pt_data\\c
nndm_sample.train.0.bert.pt')]

And it has been stuck on this for the last 2/3 hours. I would think that converting the .json files to .pt shouldn't take this long (also my CPU's are not utilized at all). Have you encountered this?

Where can I find the rouge scores from the evaluation script?

The training and evaluation script I believe are showing me only "xent", cross entropy loss (I'm assuming). Where can I find the rogue scores that were published in your paper? Thanks.

summarization

Hi，
There are 50 checkpoints in the file ，I need the summarization of the article to compare ，but the results file have nothing.

Link processed data broken

Hi,

The current link for downloading the processed data seems to be dead and returns a 404

Processed data link is broken

Hi,
This link does not work:
https://drive.google.com/open?id=1lqQmKflLisi-JBzFY0DL4sxqKtzZMrxt

What is gold label Yi?

Read the paper and the gold label Yi trouble me . I hope you can explain it.Thanks

Evaluation part

Sorry, I encountered some errors while running the model evaluation. How can I solve it?

Why Stanford Core-NLP tokenizer instead of WordPiece?

May I ask why you use the Stanford Core-NLP tokenizer instead of the default BERT WordPiece tokenizer?

Processed data lacks files

Dear author,

May I ask why the number of data points in processed training data is only 287084, while the paper claims it uses 287227 training points?

Looking forward to your reply,
Thank you,

Problem about Running Test to get Rouge Score

Hello Yang. First, thank you very much for sharing your method and code. I have met a problem, when doing the model evaluation. Could you please help me figure it out?

I have run this code for testing the model on test set:
python train.py -mode test -bert_data_path ../bert_data/cnndm -model_path MODEL_PATH -visible_gpus -1 -gpu_ranks 0 -batch_size 30000 -log_file LOG_FILE -result_path RESULT_PATH -test_all -block_trigram true -test_from /Users/admin/Desktop/XXX/model_step_50000.pt

An Error is raised:
FileNotFoundError: [Error 2] No such file or directory: 'XXX/.pyrouge/settings.ini'

Have you met this kind of problem before? Or can you provide another ways to calculate ROUGE for the model? Thank you very much!

About rouge score

Hi:

In the paper, the result is the best result in a checkpoints or the averaged results on the top-3 checkpoints? Becasue in my test the result for bertsum+transformer the best result is 43.23 and the averaged results is 43.1466.

The Option 1: download the processed data use combination_selection or greedy_selection?
Because if use data with Option 1 will have better rouge result than Option 2.

Thanks.

Empty Logs Directory for First Run

I am trying to mimic the code in Google Colab using a GPU, however, after preprocessing when I try run the BERT+Classifier model for the first time (with visible_gpus 0 etc.), I have an error that the '/logs/bert_classifier' file/directory doesn't exist as the logs folder is empty. Should there be anything there? Or is the issue that the code hasn't downloaded the bert model?

Thanks

Evaluation does not produce output

Hi, I am facing some troubles evaluating. Would you mind telling me what's wrong with my evaluation parameters?

Training without GPU

Hi, I have downloaded the processed data and I am trying to run the command

python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000

However, I get the error:

RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /Users/distiller/project/conda/conda-bld/pytorch_1556653464916/work/torch/lib/THD/process_group/General.cpp:20

My machine does not have a GPU. Is there any way I can run this only on a CPU? Thanks!

requirements for bert-large?

What if any issues would occur if bert-large was used? For example gpu requirements and training time? would it be too costly? Any reason why bert-base was used instead of bert-large?

Testing Performance

Just wanted to know if this is comparable to your model. and if not where could i possibly improve?

A question about format_to_lines

HI,

My command
python3 preprocess.py -mode tokenize -raw_path ../raw_stories/ -save_path ../merged_stories_tokenized -log_file ../logs/cnndmtoken.log -n_cpus 50 -log_file ../logs/cnndmtoken.log
python3 preprocess.py -mode format_to_lines -raw_path ../merged_stories_tokenized -save_path ../json_data/cnndm -map_path ../urls -lower -log_file ../logs/cnndmtoken.log

But After format_to_lines tgt in my file is a empty list

a part of cnndm.train.0.json

portedly", "treated", "for", "appendicitis", "but", "was", "well", "enough", "to", "walk", "out", "of", "the", "clinic", "on", "his", "own", "@highlight", "took", "a", "private", "jet", "to", "the", "argentinian", "capital", "and", "is", "now", "under", "`", "observation", "'"]], "tgt": []},

Appreciated!

question about candidate

Hello,

about the trigram blocking. Is the candidate c (as per your paper), a group of 3-grams randomly selected for each source sentence?

is this the code for the candidate?

for i, idx in enumerate(selected_ids):
                            _pred = []
                            if(len(batch.src_str[i])==0):
                                continue
                            for j in selected_ids[i][:len(batch.src_str[i])]:
                                if(j>=len( batch.src_str[i])):
                                    continue
                                candidate = batch.src_str[i][j].strip()
                                if(self.args.block_trigram):
                                    if(not _block_tri(candidate,_pred)):
                                        _pred.append(candidate)
                                else:
                                    _pred.append(candidate)

                                if ((not cal_oracle) and (not self.args.recall_eval) and len(_pred) == 3):
                                    break

I m trying to understand this code. It seems 'j' here in the for loop is a character or is it a word? I was trying to match this with preprocessing step and trigram blocking.

def _block_tri(c, p):

If you can help me understand this code a bit it would be great.

is this code saying take the character of each source string and find repeating trigrams? is the '_pred' appending each word or character here?

how long does it take with 3 gpus?

Hello,

I was wondering how long will take to train the model with 3 gpus? I m trying to calculate the cost and whether it is affordable for me to use an aws p3.16xlarge to train the model.

saved checkpoints

Are you planning to publish your saved checkpoints for the models ? (particularly BERTSUM+Transformer)

Thanks in advance :)

where is dailymail map file？

Truncated article without oracle

Thank you for sharing such a great codebase :)

I have a question about truncated article.

As mentioned in #14, article are truncated at 512 tokens. In some case, if the oracle sentences were located at the end of the article, this will produce samples with no gold labels.

So for these "empty" samples, the network will be trained to classify all of the article's sentence as not salient.

This process raises several questions :

Is it useful for the performance to keep such "empty" samples ?
Did you compare the performance of the actual network with a network trained without "empty" samples (even empirically) ?
It seems similar to SQuAD 2.0 : teaching the network that there is not always 3 salient sentences in the fed input (sometimes there is 2, sometimes 1, sometimes 0).
Yet at test time, you invariably pick the 3 best sentences, no matter their score (= no matter if the network decided that only 2/1/0 sentence was really salient).
It seems to be an important difference between training and inference. Is my intuition wrong ?
If I'm wrong, can you (quickly) explain where I misunderstood.
If I'm right, isn't it going to hurt the performance (maybe there is too little of these empty sample to really hurt the performance) ?

Thank you for your answer !

Do we expect BertSum to work on text tokenized using a tokenizer other than Core-NLP?

Is it safe to assume that BERTSUM will perform without issues on input text tokenized using a tokenizer other than Core-NLP as long as we implement sentence splitting?

Bert Fine-tuned problem

Hi:

Are Token Embeddings 、Interval Segment Embeddings and Position Embeddings will be train(Fine-tune)?

Thank you!

A question about preprocessing

Hi,

I have some problems in preprocessing.

I download the data cnn_stories_tokenized and dm_stories_tokenized, however, it's all *.story. In your preprocess.py, it requires *.json as input, can you provide the code that transform *.story to *.json? I meet some problem in reading the sample json with you load_json function in data_builder.py. Appreciated!

Do you have the limitation of the article length?

Since the original Bert model in training restricts the max length of sentence to 512. So in summarization, did you set any hand-crafted scheme to restrict the article length? Or, you just inject all the article token into the pre-trained Bert model.

ROUGE1.5.5

hi,
Can i have your ROUGE1.5.5 file or download link?
Beacause my lead3 and other result is better than the result on your paper .
Is it normal ?

Thanks.

testing on new text

Hi,

I am currently in the process of testing some recent approaches for extractive summarization. I just want to test the models on a collection of text that I have, but I still could not sort out what should I do just to summarize a new piece of text using your codebase. Any pointers?

Thanks!

about positional encoding

under model_builder,

bert is initialized and vectors from bert are fed into the encoder. The encoder itself has positional embedding of vectors (under encoder.py). There are no positional embedding prior to this. Am I to assume the Bert model adds positional embedding to each sentence and we are not required before vectorization to bert?

list index out of range

File "D:/untitled/BertSum-master/src/train.py", line 349, in
step = int(cp.split('.')[-2].split('_')[-1])
IndexError: list index out of range

what should i do ?
thanks...

Can't find train.py or python files on repository

Access to pretrained model

Hi Yang , i wanted to see the pretrained model results,May you please provide access to the pretrained model,the permission to download is not open from the link mentioned in the README file

Problem for artical

Hi，sorry to trouble you:
:how can I get the original article?
For example, the reference summary ref.83.txt and can.83.txt, i cannot find the original article.

And the results I get are different from yours, my pyrouge configuration is correct, what's wrong with me? can you help me? Thanks!
This is classifier log file

About preprocessing for CNN/DailyMail

Hi,
In your paper or the data Option 1 . Did you remove the sentence that shorter than args.min_nsents 5(default) for lead3 or any mode in CNN/DailyMail?

Thank you very much.

about load_pretrained_bert=True?

I see that load_pretrained_bert=True was used during training but not for validation or testing. Is there a particular reason for this? I m assuming 'load_pretrained_bert=True' is to load the pretrained bert. why not use for testing and validation as well?

Experimenting with the number of sentences selected

I read in the paper that for extractive summarization, you only take the top 3 sentence scores, There's no explicit mention of this in the code, is there any way i could modify this to get more sentences and experiment a little?

How to test？

Hello，can u tell me how to test？how to set the -test_from
can you show me the order?

Will the tensorflow version be updata later?

Hi ,nlpyang!
I'm really looking forward to tensorflow version,I know you can handle it easily

Tow confused parts

From the code trainer.py(function-test) & utils.py(function-test_rouge), I think you compare cnndm_step50000.candidate and cnndm_step50000.gold to compute rouge for model evaluation. In my comprehension, cnndm_step50000.gold is oracle summary, generated from greedy algorithms, that is, it is not the abstractive summary from original document. I wonder why you take cnndm_step50000.gold as ref, but not the abstractive summary of the document? I think taking original abstractive summary as ref will give more comparable rouge score.

2.In the paper's Table1: Test set results on the CNN/DailyMail dataset using ROUGE F1, you show rouge score of Oracle and other model. I want to know how do you calculate Oracle_ROUGE-1(52.59; 31.24; 48.87), and taking what as ref? And how do you calculate BERTSUM+Transformer_ROUGE-1(43.25; 20.24; 39.63), and taking what as ref?

model training interrupt

Following the readme, I download the processed data and try to train the model by myself.
I use single-GPU to downloading data. After that I rerun the code with multi-GPUs. The program can start executing, but after a while, whether it is a single GPU or multiple GPUs, the following problems always occur. The step of running at the time of interruption may be different between different experiments.

[2019-04-04 00:46:20,545 INFO] loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../temp/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[2019-04-04 00:46:23,405 INFO] Step 400/50000; xent: 3.30; lr: 0.0000008; 20 docs/s; 368 sec
[2019-04-04 00:47:08,724 INFO] Step 450/50000; xent: 3.33; lr: 0.0000009; 22 docs/s; 413 sec
[2019-04-04 00:47:51,472 INFO] Step 500/50000; xent: 3.19; lr: 0.0000010; 23 docs/s; 456 sec
[2019-04-04 00:48:35,173 INFO] Step 550/50000; xent: 3.22; lr: 0.0000011; 23 docs/s; 500 sec[2019-04-04 00:49:16,433 INFO] Loading train dataset from ../bert_data/cnndm.train.6.bert.pt, number of examples: 2001[2019-04-04 00:49:41,427 ERROR] Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
File "train.py", line 335, in
train(args, device_id)
File "train.py", line 267, in train
trainer.train(train_iter_fct, args.train_steps)
File "/users4/zyfeng/gitcodes/BertSum/src/models/trainer.py", line 142, in train
for i, batch in enumerate(train_iter):
File "/users4/zyfeng/gitcodes/BertSum/src/models/data_loader.py", line 141, in iter
self.cur_iter = self._next_dataset_iterator(dataset_iter)
File "/users4/zyfeng/gitcodes/BertSum/src/models/data_loader.py", line 159, in _next_dataset_iterator
device=self.device, shuffle=self.shuffle, is_test=self.is_test)
File "/users4/zyfeng/gitcodes/BertSum/src/models/data_loader.py", line 175, in init
self.bert_data = BertData(args)
File "/users4/zyfeng/gitcodes/BertSum/src/models/data_loader.py", line 15, in init
self.sep_vid = self.tokenizer.vocab['[SEP]']
AttributeError: 'NoneType' object has no attribute 'vocab'

The following files is downloaded in temp dir.

26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.json
9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba.json

I am curious about this reason. I don't know what should I do ?

Infinite loop on loading training dataset

hi, my command line is "pythons train.py -mode train -encoder transformer -dropout 0.1 -bert_data_path ../news_bert/cnndm -model_path ../bert_model/bert_transformer -lr 2e-3 -visible_gpus 1 -gpu_ranks 1 -world_size 1 -report_every 50 -save_checkpoint_steps 1 -batch_size 3000 -decay_method noam -train_steps 1 -accum_count 2 -log_file ../logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048 -inter_layers 2 -heads 8",
then I encountered a bug: the dataset during the training phase is loaded an infinite number of times.

[2019-06-12 16:39:12,359 INFO] Loading train dataset from ../news_bert/cnndm.train.1.bert.pt, number of examples: 1961
[2019-06-12 16:39:13,338 INFO] Loading train dataset from ../news_bert/cnndm.train.6.bert.pt, number of examples: 1970
[2019-06-12 16:39:14,270 INFO] Loading train dataset from ../news_bert/cnndm.train.15.bert.pt, number of examples: 1564
[2019-06-12 16:39:15,057 INFO] Loading train dataset from ../news_bert/cnndm.train.7.bert.pt, number of examples: 1962
[2019-06-12 16:39:16,033 INFO] Loading train dataset from ../news_bert/cnndm.train.3.bert.pt, number of examples: 1971
[2019-06-12 16:39:17,077 INFO] Loading train dataset from ../news_bert/cnndm.train.11.bert.pt, number of examples: 1959
[2019-06-12 16:39:18,029 INFO] Loading train dataset from ../news_bert/cnndm.train.13.bert.pt, number of examples: 1972
[2019-06-12 16:39:18,870 INFO] Loading train dataset from ../news_bert/cnndm.train.8.bert.pt, number of examples: 1967
[2019-06-12 16:39:19,761 INFO] Loading train dataset from ../news_bert/cnndm.train.14.bert.pt, number of examples: 1973
[2019-06-12 16:39:20,687 INFO] Loading train dataset from ../news_bert/cnndm.train.2.bert.pt, number of examples: 1970
[2019-06-12 16:39:21,623 INFO] Loading train dataset from ../news_bert/cnndm.train.9.bert.pt, number of examples: 1971
[2019-06-12 16:39:22,526 INFO] Loading train dataset from ../news_bert/cnndm.train.9.bert.pt, number of examples: 1971
[2019-06-12 16:39:23,377 INFO] Loading train dataset from ../news_bert/cnndm.train.14.bert.pt, number of examples: 1973
[2019-06-12 16:39:24,187 INFO] Loading train dataset from ../news_bert/cnndm.train.4.bert.pt, number of examples: 1963
[2019-06-12 16:39:25,108 INFO] Loading train dataset from ../news_bert/cnndm.train.7.bert.pt, number of examples: 1962
[2019-06-12 16:39:26,105 INFO] Loading train dataset from ../news_bert/cnndm.train.10.bert.pt, number of examples: 1970
[2019-06-12 16:39:27,056 INFO] Loading train dataset from ../news_bert/cnndm.train.11.bert.pt, number of examples: 1959
[2019-06-12 16:39:28,027 INFO] Loading train dataset from ../news_bert/cnndm.train.0.bert.pt, number of examples: 1960
[2019-06-12 16:39:28,954 INFO] Loading train dataset from ../news_bert/cnndm.train.5.bert.pt, number of examples: 1982
[2019-06-12 16:39:29,930 INFO] Loading train dataset from ../news_bert/cnndm.train.13.bert.pt, number of examples: 1972
[2019-06-12 16:39:30,946 INFO] Loading train dataset from ../news_bert/cnndm.train.3.bert.pt, number of examples: 1971
What can I do? Thanks!

run error due to dsataset

Traceback (most recent call last):
File "train.py", line 340, in
train(args, device_id)
File "train.py", line 272, in train
trainer.train(train_iter_fct, args.train_steps)
File "/home/wsy/xry/BertSum-master/src/models/trainer.py", line 142, in train
for i, batch in enumerate(train_iter):
File "/home/wsy/xry/BertSum-master/src/models/data_loader.py", line 131, in iter
for batch in self.cur_iter:
File "/home/wsy/xry/BertSum-master/src/models/data_loader.py", line 235, in iter
batch = Batch(minibatch, self.device, self.is_test)
File "/home/wsy/xry/BertSum-master/src/models/data_loader.py", line 27, in init
src = torch.tensor(self._pad(pre_src, 0))
File "/home/wsy/xry/BertSum-master/src/models/data_loader.py", line 14, in _pad
width = max(len(d) for d in data)
ValueError: max() arg is an empty sequence

A question about the final vocabulary

Dear,
Thanks for your great jobs.
I have a question about the final vocabulary.

Should I maintain a vocabulary for the corpus, or use the exact vocabulary privided by the BertTokenizer.from_pretrained('bert-base-uncased').vocab ?
In other words, should I use the exact vocab in BerTokensizer when finetuning Bert?
I found that there are only 27615 English words in the vocab of BertTokenizer.

Best regards.

Problem of processed data

Hi,

I clone the git, then download the processed data, then run the cmd:
----------- cmd-----------
python train.py -mode train -encoder rnn -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0 -gpu_ranks 0 -world_size 1 -report_every 50 -save_checkpoint_steps 1000 -batch_size 64 -decay_method noam -train_steps 5120 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 256
----------- cmd-----------

I got this error:
------------error------------
[2019-05-14 03:21:58,963 INFO] Start training...
[2019-05-14 03:21:59,150 INFO] Loading train dataset from ../bert_data_tmp/cnndm.train.0.bert.pt, number of examples: 2001
Traceback (most recent call last):
File "train.py", line 340, in
train(args, device_id)
File "train.py", line 272, in train
trainer.train(train_iter_fct, args.train_steps)
File "/notebooks/workspace/git/BertSum/src/models/trainer.py", line 158, in train
report_stats)
File "/notebooks/workspace/git/BertSum/src/models/trainer.py", line 323, in _gradient_accumulation
sent_scores, mask = self.model(src, segs, clss, mask, mask_cls)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/notebooks/workspace/git/BertSum/src/models/model_builder.py", line 96, in forward
sent_scores = self.encoder(sents_vec, mask_cls).squeeze(-1)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/notebooks/workspace/git/BertSum/src/models/encoder.py", line 129, in forward
memory_bank = self.dropout(memory_bank) + x
RuntimeError: The size of tensor a (512) must match the size of tensor b (768) at non-singleton dimension 2
------------error------------

Unable to preprocess my own data

Hi, I'm currently trying to preprocess my own news articles so that I can use it with the pre-trained model. I'm currently trying to use Stanford NLP to preprocess the data and am then looking to use preprocess.py. Am I on the correct lines, and is this something I need to do if I want to generate summaries of my own articles?

Evaluation problem

Hi,
The program is running, but the results directory don't have some results, is there some wrong?
My commend is:
python train.py -mode validate -bert_data_path /home/test/WangHN/BertSum-master/bert_data/cnndm -model_path /home/test/WangHN/BertSum-master/models/bert_classifier -visible_gpus 0 -gpu_ranks 0 -batch_size 30000 -log_file /home/test/WangHN/BertSum-master/logs/Evaluation/bert_classifier -result_path /home/test/WangHN/BertSum-master/results/classifier/cnndm -test_all -block_trigram true

The information is:
[2019-05-08 20:19:55,407 INFO] Loading checkpoint from /home/test/WangHN/BertSum-master/models/bert_classifier/model_step_3000.pt
Namespace(accum_count=1, batch_size=30000, bert_config_path='../bert_config_uncased_base.json', bert_data_path='/home/test/WangHN/BertSum-master/bert_data/cnndm', beta1=0.9, beta2=0.999, block_trigram=True, dataset='', decay_method='', dropout=0.1, encoder='classifier', ff_size=512, gpu_ranks=[0], heads=4, hidden_size=128, inter_layers=2, log_file='/home/test/WangHN/BertSum-master/logs/Evaluation/bert_classifier', lr=1, max_grad_norm=0, mode='validate', model_path='/home/test/WangHN/BertSum-master/models/bert_classifier', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='/home/test/WangHN/BertSum-master/results/classifier/cnndm', rnn_size=512, save_checkpoint_steps=5, seed=666, temp_dir='../temp', test_all=True, test_from='', train_from='', train_steps=1000, use_interval=True, visible_gpus='0', warmup_steps=8000, world_size=1)
[2019-05-08 20:20:04,796 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.0.bert.pt, number of examples: 2001
gpu_rank 0
[2019-05-08 20:20:04,799 INFO] * number of parameters: 109483009
[2019-05-08 20:20:42,455 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.1.bert.pt, number of examples: 2001
[2019-05-08 20:21:21,366 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.2.bert.pt, number of examples: 2001
[2019-05-08 20:22:00,217 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.3.bert.pt, number of examples: 2001
[2019-05-08 20:22:39,093 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.4.bert.pt, number of examples: 2001
[2019-05-08 20:23:17,807 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.5.bert.pt, number of examples: 2000
[2019-05-08 20:23:56,463 INFO] Loading valid dataset from /home/test/WangHN/BertSum-master/bert_data/cnndm.valid.6.bert.pt, number of examples: 1362
[2019-05-08 20:24:22,847 INFO] Validation xent: 0.125946 at step 3000

should i wait for this commend end of run?

Is it normal to have validation xent around 5.4 after 50000 steps?

about preprocessing step

can you explain what labels plays a role in bert tokenizer.

especially this code:

labels = labels[:len(cls_ids)]

i don't understand what the above code does and how do labels play a part in tokenization for bert?