Code Monkey home page Code Monkey logo

bert4doc-classification's Introduction

How to Fine-Tune BERT for Text Classification?

This is the code and source for the paper How to Fine-Tune BERT for Text Classification?

In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning.

*********** update at Mar 14, 2020 *************

Our checkpoint can be loaded in BertEmbedding from the latest fastNLP package.

Link to fastNLP.embeddings.BertEmbedding

Requirements

For further pre-training, we borrow some code from Google BERT. Thus, we need:

  • tensorflow==1.1x
  • spacy
  • pandas
  • numpy

Note that you need Python 3.7 or earlier for compatibility with tensorflow 1.1x.

For fine-tuning, we borrow some codes from pytorch-pretrained-bert package (now well known as transformers). Thus, we need:

  • torch>=0.4.1,<=1.2.0

Run the code

1) Prepare the data set:

Sogou News

We determine the category of the news based on the URL, such as “sports” corresponding to “http://sports.sohu.com”. We choose 6 categories – “sports”, “house”, “business”, “entertainment”, “women” and “technology”. The number of training samples selected for each class is 9,000 and testing 1,000.

Data is available at here.

The rest data sets

The rest data sets were built by Zhang et al. (2015). We download from URL created by Xiang Zhang.

2) Prepare Google BERT:

BERT-Base, Uncased

BERT-Base, Chinese

3) Further Pre-Training:

Generate Further Pre-Training Corpus

Here we use AG's News as example:

python generate_corpus_agnews.py

File agnews_corpus_test.txt can be found in directory ./data.

Run Further Pre-Training

python create_pretraining_data.py \
  --input_file=./AGnews_corpus.txt \
  --output_file=tmp/tf_AGnews.tfrecord \
  --vocab_file=./uncased_L-12_H-768_A-12/vocab.txt \
  --do_lower_case=True \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5
  
python run_pretraining.py \
  --input_file=./tmp/tf_AGnews.tfrecord \
  --output_dir=./uncased_L-12_H-768_A-12_AGnews_pretrain \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=./uncased_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=./uncased_L-12_H-768_A-12/bert_model.ckpt \
  --train_batch_size=32 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=100000 \
  --num_warmup_steps=10000 \
  --save_checkpoints_steps=10000 \
  --learning_rate=5e-5

4) Fine-Tuning

Convert Tensorflow checkpoint to PyTorch checkpoint

python convert_tf_checkpoint_to_pytorch.py \
  --tf_checkpoint_path ./uncased_L-12_H-768_A-12_AGnews_pretrain/model.ckpt-100000 \
  --bert_config_file ./uncased_L-12_H-768_A-12_AGnews_pretrain/bert_config.json \
  --pytorch_dump_path ./uncased_L-12_H-768_A-12_AGnews_pretrain/pytorch_model.bin

Fine-Tuning on downstream tasks

While fine-tuning on downstream tasks, we notice that different GPU (e.g.: 1080Ti and Titan Xp) may cause slight differences in experimental results even though we fix the initial random seed. Here we use 1080Ti * 4 as example.

Take Exp-I (See Section 5.3) as example,

export CUDA_VISIBLE_DEVICES=0,1,2,3
python run_classifier_single_layer.py \
  --task_name imdb \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir ./IMDB_data/ \
  --vocab_file ./uncased_L-12_H-768_A-12_IMDB_pretrain/vocab.txt \
  --bert_config_file ./uncased_L-12_H-768_A-12_IMDB_pretrain/bert_config.json \
  --init_checkpoint ./uncased_L-12_H-768_A-12_IMDB_pretrain/pytorch_model.bin \
  --max_seq_length 512 \
  --train_batch_size 24 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir ./imdb \
  --seed 42 \
  --layers 11 10 \
  --trunc_medium -1

where num_train_epochs can be 3.0, 4.0, or 6.0.

layers indicates list of layers which will be taken as feature for classification. -2 means use pooled output, -1 means concat all layer, the command above means concat layer-10 and layer-11 (last two layers).

trunc_medium indicates dealing with long texts. -2 means head-only, -1 means tail-only, 0 means head-half + tail-half (e.g.: head256+tail256), other natural number k means head-k + tail-rest (e.g.: head-k + tail-(512-k)).

There also other arguments for fine-tuning:

pooling_type indicates which feature will be used for classification. mean means mean-pooling for hidden state of the whole sequence, max means max-pooling, default means taking hidden state of [CLS] token as features.

layer_learning_rate and layer_learning_rate_decay in run_classifier_discriminative.py indicates layer-wise decreasing layer rate (See Section 5.3.4).

Further Pre-Trained Checkpoints

We upload IMDb-based further pre-trained checkpoints at here.

For other checkpoints, please contact us by e-mail.

How to cite our paper

@inproceedings{sun2019fine,
  title={How to fine-tune {BERT} for text classification?},
  author={Sun, Chi and Qiu, Xipeng and Xu, Yige and Huang, Xuanjing},
  booktitle={China National Conference on Chinese Computational Linguistics},
  pages={194--206},
  year={2019},
  organization={Springer}
}

bert4doc-classification's People

Contributors

evrys avatar mhilmiasyrofi avatar xuyige avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert4doc-classification's Issues

OOM when batchSize=1

Hi, thanks for your great work.
While running run_pretraining.py, I kept getting OOM for any size of the matrix.
I already reduce the batch size to 1 but didn't help.
I'm using 960M, TensorFlow-gpu1.10, Cuda toolkit 9.0
I'm wondering what version of TensorFlow are you using? Any thoughts on this issue?
Thanks in advance.

Further Pre-Training on the IMDB dataset

Dear Yige,
thanks a lot for sharing the code!
I was wondering if you could provide some more detail on "further pre-training" on the IMDB dataset, e.g. the hyperparameter settings for it.
Or, is it possible to share the BERT model which did the LM pre-training on the IMDB dataset?

further-pretraining

I got this error when doing further-pretraining

my environment
Ubuntu 18.04.4 LTS (GNU/Linux 5.4.0-74-generic x86_64)
GPU 2080ti

I use following command
python run_pretraining.py
--input_file=./tmp/tf_AGnews.tfrecord
--output_dir=./uncased_L-12_H-768_A-12_AGnews_pretrain
--do_train=True
--do_eval=True
--bert_config_file=./uncased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=./uncased_L-12_H-768_A-12/bert_model.ckpt
--train_batch_size=8
--max_seq_length=128
--max_predictions_per_seq=20
--num_train_steps=100000
--num_warmup_steps=10000
--save_checkpoints_steps=10000
--learning_rate=5e-5

I got this message and further pretraining does not work
How can I fix this problem?

WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 62 vs previous value: 62. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
W0622 17:33:44.304897 140418054317888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 62 vs previous value: 62. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.

How much time did it take to run the further pre-training step?

@xuyige Time taken

!python run_pretraining.py
--input_file=./tmp/tf_examples.tfrecord
--output_dir=./tmp/pretraining_output
--do_train=True
--do_eval=True
--bert_config_file=./uncased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=./uncased_L-12_H-768_A-12/bert_model.ckpt
--train_batch_size=32
--max_seq_length=128
--max_predictions_per_seq=20
--num_train_steps=100000
--num_warmup_steps=10000
--learning_rate=5e-5
--use_tpu=False
--save_checkpoints_steps=10000

0 instances wrote while further pre-training on my own dataset

Hey,
When i run the command create_pretraining_data.py i see the following msg:

INFO:tensorflow:*** Reading from input files ***
I1210 15:59:58.812381 140714487977856 create_pretraining_data.py:419] *** Reading from input files ***
INFO:tensorflow:*** Writing to output files ***
I1210 15:59:58.815751 140714487977856 create_pretraining_data.py:430] *** Writing to output files ***
INFO:tensorflow: tmp/tf_AGnews.tfrecord
I1210 15:59:58.815884 140714487977856 create_pretraining_data.py:432] tmp/tf_AGnews.tfrecord
WARNING:tensorflow:From create_pretraining_data.py:97: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

W1210 15:59:58.816398 140714487977856 module_wrapper.py:139] From create_pretraining_data.py:97: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

INFO:tensorflow:Wrote 0 total instances
I1210 15:59:58.819541 140714487977856 create_pretraining_data.py:162] Wrote 0 total instances
Does this mean no data is created? If, yes, can you tell me why this is happening?

Thanks in advance.

Dealing with multiple sentences

Hi sorry to bother you, but I have one question.

Documents have multiple sentences so how do you deal with that ? Do you split the text into sentences and the concatenate the final embeddings for each sentence or do you remove all punctuation marks so the text won't have any [SEP] tokens.

Questions about discriminative_fine_tuning

In Section 5.4.3 " We find that assign a lower learn- ing rate to the lower layer is effective to fine-tuning BERT, and an appropriate setting is ξ=0.95 and lr=2.0e-5."
Compared to the code in https://github.com/xuyige/BERT4doc-Classification/blob/master/codes/fine-tuning/run_classifier.py#L812
Seem that you divide the bert layer into 3 part (4 layers for one part) and set different learning rate for each part.
Some questions about it:

  1. How could the decay factor 0.95 match the number 2.6 in code ?
  2. And the last classify layer seem not be contained , no need to set lr for it ?

further pre-training

Hi,

I followed ur code to further pre-train a bert model on my own corpus but I got only checkpoint files without any config or vocab.txt file any ideas plz?

Thank u

save_checkpoints_steps doesnt work.

The parser option for save_checkpoints_steps doesnt do anything for me.

Im running:

python3 run_classifier_single_layer.py --task_name imdb --do_train --do_eval --do_lower_case --data_dir ./stock --vocab_file ./uncased_L-12_H-768_A-12/vocab.txt --bert_config_file ./uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint ./uncased_L-12_H-768_A-12/pytorch_model.bin --max_seq_length 512 --train_batch_size 16 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir ./stock_output --seed 42 --layers 11 10 --trunc_medium -1 --save_checkpoints_steps 1000

Any idea to solve this?

Validation dataset split

Hi,
Thanks so much for sharing the code for this fantastic work!
In the paper you mentioned that "We empirically set the max number of the epoch to 4 and save the best model on the validation set for testing". I am wondering how did you create the validation dataset for the classification tasks? Did you split the original train dataset into train/val? If that's the case, what's the ratio you split train/validation dataset for the IMDB, AGnews etc.?

Thanks so much for your help in advance!

Bert中的Embedding Layer

您好,Bert当中的Embedding Layer是在Layer0之前的,他的学习率设置为Layer0乘以权重ξ(0.95)会不会更好一点?

max_sequence_length in create_pretraining_data

你好,非常感谢,这个项目对我目前的工作很有帮助。我在做学生作文自动评分的项目,数据量是450mb,大约93万篇学生作文。我用create_pretraining_data这个脚本生成了一个17G的tf.records 文件,max_sequence_length 选择的是128。我的问题是:在生成预训练数据这个步骤中,max_sequence_length 是选择最大的文章的长度,还是最长的一句话的长度?

How to fine tuning model on multi-tasks?

Sorry to bother you!
But it seems to me, the run_classifier_single_layer.py does not save the model, and what should I do to fine tuning the fine tuned model?
Thanks!

hight perplexity when Further Pre-Training

When do further pre-training on my own datas the ppl is too much high for example 709. I have 3582619 examples, and use batch size=8, epoch=3, learing rate=5e-5. Is there any advice ? Thanks a lot!

Generate Further Pre-Training Corpus

Hi,
Thank you for sharing your code. I met the following problem when running "python generate_corpus_agnews.py".

Traceback (most recent call last):
File "generate_corpus_agnews.py", line 18, in
f.write(str(test_data[i][1])+"\n")
IndexError: index 1 is out of bounds for axis 0 with size 1

And also, could provide some guideline on how I can apply your code on my own dataset?

Question about Further Pre-training

Hi:
I tried to use your code on my own corpus to do classification which consists of many short sentences.I want to try some expriements with further pre-training without the NSP task.But from your code of "create_pretraining_data.py" ,I found you random choose a doc from the dataset to concatenate to another doc after [SEP] as input which confuse me a lot,could you please explain to me why this is done?Thanks a lot.

For Layer-wise Decreasing Layer Rate

Thanks for your hard work!
I have two questions. First, for Layer-wise Decreasing Layer Rate, did you use a warm-up or polynomial_decay simultaneous?,and it means that warm-up rate and Layer-wise Decreasing Layer Rate are used simultaneous? Second, for large bert, how did you set the Learning rate and Decay factor which the paper didn't give?

Resource exhausted

Hi,

first, thank u for having sharing ur cod with us

I am trying to further pretraining a bert model on my own corpus on colab gpu but I am getting an error of resource exhausted
can someone tell me how to fix this

Also what are the expected output of this further pretraining
Are they the bert tenserflow files that we can use for fine-tuning ( checkpoint, config, and vocab)?

Thank u

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.