yitu-opensource / convbert Goto Github PK

License: Other

Shell 0.38% Python 99.62%

convbert's Introduction

ConvBERT

Introduction

In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU. For detailed description and experimental results, please refer to our NeurIPS 2020 paper ConvBERT: Improving BERT with Span-based Dynamic Convolution.

Requirements

Python 3
tensorflow 1.15
numpy
scikit-learn

Experiments

Pre-training

These instructions pre-train a medium-small sized ConvBERT model (17M parameters) using the OpenWebText corpus.

To build the tf-record and pre-train the model, download the OpenWebText corpus (12G) and setup your data directory in build_data.sh and pretrain.sh. Then run

bash build_data.sh

The processed data require roughly 30G of disk space. Then, to pre-train the model, run

bash pretrain.sh

See configure_pretraining.py for the details of the supported hyperparameters.

Fine-tining

We gives the instruction to fine-tune a pre-trained medium-small sized ConvBERT model (17M parameters) on GLUE. You can refer to the Google Colab notebook for a quick example. See our paper for more details on model performance. Pre-trained model can be found here. (You can also download it from baidu cloud with extraction code m9d2.)

To evaluate the performance on GLUE, you can download the GLUE data by running

python3 download_glue_data.py

Set up the data by running mv CoLA cola && mv MNLI mnli && mv MRPC mrpc && mv QNLI qnli && mv QQP qqp && mv RTE rte && mv SST-2 sst && mv STS-B sts && mv diagnostic/diagnostic.tsv mnli && mkdir -p $DATA_DIR/finetuning_data && mv * $DATA_DIR/finetuning_data. After preparing the GLUE data, setup your data directory in finetune.sh and run

bash finetune.sh

And you can test different tasks by changing configs in finetune.sh.

If you find this repo helpful, please consider cite

@inproceedings{NEURIPS2020_96da2f59,
 author = {Jiang, Zi-Hang and Yu, Weihao and Zhou, Daquan and Chen, Yunpeng and Feng, Jiashi and Yan, Shuicheng},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},
 pages = {12837--12848},
 publisher = {Curran Associates, Inc.},
 title = {ConvBERT: Improving BERT with Span-based Dynamic Convolution},
 url = {https://proceedings.neurips.cc/paper/2020/file/96da2f590cd7246bbde0051047b0d6f7-Paper.pdf},
 volume = {33},
 year = {2020}
}

References

Here are some great resources we benefit:

Codebase: Our codebase are based on ELECTRA.

Dynamic convolution: Implementation from Pay Less Attention with Lightweight and Dynamic Convolutions

Dataset: OpenWebText from Language Models are Unsupervised Multitask Learners

convbert's People

Contributors

Stargazers

Watchers

Forkers

zihangjiang advanceflow mppmys yuweihao laomagic hell-to-heaven bruinxiong guome charliezhugj qianrenjian zhanzq zhuyawen fengxingxiang yyht frostjsy cdqncn wudi001007 xiaming9880 mrwaterzhou yiluzhuimeng xrosliang weibobo2015 zxgineng aikho guojson xinyang178 sabirdvd johnson7788 miss-rain phychaos lliai tommylitlle german-nlp-group judepark96 ishine wm2012011492 mengbingrock kedengfeng vovanphuc junnyu olek-glowka techthiyanes hudakas zth9730 linhr000 priyadhanu14 sinking8 trellixvulnteam jianfeiwang wangmengzhi akmiller01 lastmanstanding25

convbert's Issues

请问你提供的预训练模型是中文预训练模型还是英文是基于什么进行训练的细节可以稍微介绍下吗

我从readme 里下载了你的预训练模型 convbert_base convbert_medium convbert_small. 这三个模型文件夹里没有词表，我根据你项目中的词表 vocab.txt (30522维度) 我理解你这是英文的预训练模型，请问我理解的对吗(我是根据electra 来看英文模型词表是30522 中文预训练模型词表是 21128)。谢谢回答

UnboundLocalError: local variable 'seq_length' referenced before assignment

Hi, I am using the ConvBertForTokenClassification model in transformers and encountered the bug when passing only input_embeds to forward().
The traceback says that at line 833 in modeling_convbert.py

if token_type_ids is None:
    if hasattr(self.embeddings, "token_type_ids"):
        buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]

The seq_length is unassigned.

I noticed just above this piece of code that in

elif input_ids is not None:
    input_shape = input_ids.size()
    batch_size, seq_length = input_shape
elif inputs_embeds is not None:
    input_shape = inputs_embeds.size()[:-1]

seq_length is not assigned if the program enters elif inputs_embeds is not None.

Not sure if it is the batch_size, seq_length = input_shape missing for inputs_embeds or I am not using the model correctly?

Is ConvBertModel autoregressive?

Can I use ConvBertModel as a decoder in autoregressive mode ?

请问有计划开源中文的模型吗

预测性能

hi,
请问有预测性能数据吗？
比如, bert_base, bert_tiny, conv_bert, conv_bert_small

Where is the chinese convbert model?

Pytorch version

when will have Pytorch version?

请问使用tpu还是gpu训练

在论文里没有看到训练使用的算力介绍

疑惑

论文描述中，这个部分是LConv，有点不解，望不吝解答，感谢

span light conv疑惑

你好，我想请问下，在span light conv中，既然已经用tf.layers.separable_conv1d生成了带span信息的矩阵key_conv_attn_layer，为什么还需要点乘query_layer呢？对应于conv_attn_layer = tf.multiply(key_conv_attn_layer, query_layer)。感觉此处点乘不是很有必要

关于预训练的问题

我想请问一下，在实际预训练过程中，如何去判断训练多少步是足够的，另外训练过程中loss大概是多少，我目前在9-11左右一直在徘徊，是不是有问题？

这个预训练代码不就是ELECTRA那套？

关于mixed-attention推理速度的问题

请问一下，因为我看到论文中提到的FLOPs分别是26.5G和19.3G，请问这个实验数据是怎么得到的，因为我自己测试12层的medium-small模型encoder总体是在1GFLOPs左右。还有后面的推理速度是什么条件下测试出来的呢？
因为我这边得到的结果是推理速度慢于原始的self-attention，我猜想是因为里面虽然浮点计算操作少了，但是数据搬运的时间多了（reshape、transpose）。

Please update your citation bib

Your bib is still an ArXiv one but I understand that the paper has been published at NeurIPS.

Thank you

Training on multiple GPUs for BASE or LARGE Models

Hi,

here #16 (comment) you say

Our code is only tested on a single V100 GPU.

But in your Paper you write about BASE size ConvBERT models.

But BASE size models can not be trained (created) on a single GPU. From my experince you need 8 GPUs.

Could you please explain this? I would like to create a german BASE or maybe even LARGE new language model.

At #16 (comment) you say that Hugging Face might be an option for multi GPU training. From my experience they are good at downstream task training but not good at the initial language model creation.

I would be super happy about some help to create my new ConvBERT BASE or larger model in different languages.

Many thanks
Philip

请问有pytorch版本发布吗？

感谢分享这么好的项目！

请问有pytorch版本发布吗？
可以发布pytorch版本吗？

感谢

用自己的数据预训练各种nah loss 问题

您好，感谢您的开源。我用自己的数据进行预训练默认的2e-4 lr 的base 模型一开始训练就nah loss. 换成 medium-small 模型使用 2e-4 2e-5 均存在训练大概几千步nah loss 退出训练的问题想请教下解决办法

Train on GPU instead of TPU - differnt distribution strategies

Hi,
many thanks for this nice new model type and your research.
We would like to train a ConvBERT but on GPU and not TPU.
Do you have any experiences or tips how to do this?
We have concerns regarding the differnt distribution strategies
between GPUs and TPUs.

Thanks
Philip

能否提供一个预训练模型的国内下载地址，谢谢

如题

What's the essential difference between ConvBert and LSRA?

LSRA: Lite Transformer with Long-Short Range Attention.

LSRA also integrates convolution operations into transformer blocks. I'm just wondering what makes ConvBert differ from LSRA.

The exact English pretraining data and Chinese pretraining data that are exact same to the BERT paper's pretraining data.

Any one know where to get them?
Thank you and thank you.