ATPapers

Worth-reading papers and related resources on Attention Mechanism, Transformer and Pretrained Language Model (PLM) such as BERT.

Suggestions about fixing errors or adding papers, repositories and other resources are welcomed!

Since I am Chinese, I mainly focus on Chinese resources. Welcome to recommend excellent resources in English or other languages!

值得一读的注意力机制、Transformer和预训练语言模型论文与相关资源集合。

欢迎修正错误以及新增论文、代码仓库与其他资源等建议！

Attention

Papers

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. (ICML 2015) [paper] - Hard & Soft Attention
Effective Approaches to Attention-based Neural Machine Translation. Minh-Thang Luong, Hieu Pham, Christopher D. Manning. (EMNLP 2015) [paper] - Global & Local Attention
Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. (ICLR 2015) [paper]
Non-local Neural Networks. Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. (CVPR 2018) [paper][code]
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. Gongbo Tang, Mathias Müller, Annette Rios, Rico Sennrich. (EMNLP 2018) [paper]
Phrase-level Self-Attention Networks for Universal Sentence Encoding. Wei Wu, Houfeng Wang, Tianyu Liu, Shuming Ma. (EMNLP 2018) [paper]
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang. (ICLR 2018) [paper][code] - Bi-BloSAN
Efficient Attention: Attention with Linear Complexities. Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, Hongsheng Li. (CoRR 2018) [paper][code]
Leveraging Local and Global Patterns for Self-Attention Networks. Mingzhou Xu, Derek F. Wong, Baosong Yang, Yue Zhang, Lidia S. Chao. (ACL 2019) [paper] [tf code][pt code]
Attention over Heads: A Multi-Hop Attention for Neural Machine Translation. Shohei Iida, Ryuichiro Kimura, Hongyi Cui, Po-Hsuan Hung, Takehito Utsuro, Masaaki Nagata. (ACL 2019) [paper]
Are Sixteen Heads Really Better than One?. Paul Michel, Omer Levy, Graham Neubig. (NeurIPS 2019) [paper]

Survey & Review

An Attentive Survey of Attention Models. Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, Varun Mithal. (IJCAI 2019) [paper]

English Blog

Illustrated: Self-Attention

Chinese Blog

Repositories

thushv89 / Keras Attention Layer - Keras Layer implementation of Attention

Transformer

Papers

Attention is All you Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. (NIPS 2017) [paper][code] - Transformer
Weighted Transformer Network for Machine Translation. Karim Ahmed, Nitish Shirish Keskar, Richard Socher. (CoRR 2017) [paper][code]
Accelerating Neural Transformer via an Average Attention Network. Biao Zhang, Deyi Xiong, Jinsong Su. (ACL 2018) [paper][code] - AAN
Self-Attention with Relative Position Representations. Peter Shaw, Jakob Uszkoreit, Ashish Vaswani. (NAACL 2018) [paper] [unoffical code]
Universal Transformers. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser. (ICLR 2019) [paper][code]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, Ruslan Salakhutdinov. (ACL 2019) [paper]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov. (ACL 2019) [paper]
Star-Transformer. Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, Zheng Zhang. (NAACL 2019) [paper]
Generating Long Sequences with Sparse Transformers. Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. (CoRR 2019) [paper][code]
Memory Transformer Networks. Jonas Metzger. (CS224n Winter2019 Reports) [paper]
Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel. Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov. (EMNLP 2019) [paper][code]
Transformers without Tears: Improving the Normalization of Self-Attention. Toan Q. Nguyen, Julian Salazar. (IWSLT 2019) [paper][code]
TENER: Adapting Transformer Encoder for Named Entity Recognition. Hang Yan, Bocao Deng, Xiaonan Li, Xipeng Qiu. (CoRR 2019) [paper]
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection. Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun. (CoRR 2019) [paper][code]
Compressive Transformers for Long-Range Sequence Modelling. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap. (ICLR 2020) [paper][code]
Reformer: The Efficient Transformer. Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. (ICLR 2020) [paper] [code 1][code 2][code 3]
On Layer Normalization in the Transformer Architecture. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu. (ICML 2020) [paper]
Lite Transformer with Long-Short Range Attention. Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han. (ICLR 2020) [paper][code]
ReZero is All You Need: Fast Convergence at Large Depth. Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, Julian McAuley. (CoRR 2020) [paper] [code] [related Chinese post]
Improving Transformer Models by Reordering their Sublayers. Ofir Press, Noah A. Smith, Omer Levy. (ACL 2020) [paper]
Highway Transformer: Self-Gating Enhanced Self-Attentive Networks. Yekun Chai, Jin Shuo, Xinwen Hou. (ACL 2020) [paper][code]
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han. (ACL 2020) [paper][code]
Longformer: The Long-Document Transformer. Iz Beltagy, Matthew E. Peters, Arman Cohan. (CoRR 2020) [paper][code]
Talking-Heads Attention. Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou. (CoRR 2020) [paper]
Synthesizer: Rethinking Self-Attention in Transformer Models. Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng. (CoRR 2020) [paper]
Linformer: Self-Attention with Linear Complexity. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma. (CoRR 2020) [paper]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret. (ICML 2020) [paper][code][project]
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. (CoRR 2020) [paper][code]
Fast Transformers with Clustered Attention. Apoorv Vyas, Angelos Katharopoulos, François Fleuret. (CoRR 2020) [paper][code]
Memory Transformer. Mikhail S. Burtsev, Grigory V. Sapunov. (CoRR 2020) [paper]
Multi-Head Attention: Collaborate Instead of Concatenate. Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi. (CoRR 2020) [paper][code]

Chinese Blog

English Blog

Repositories

DongjunLee / transformer-tensorflow - Transformer Tensorflow implementation
andreamad8 / Universal-Transformer-Pytorch - Universal Transformer PyTorch implementation
lucidrains / Linear Attention Transformer - Transformer based on a variant of attention that is linear complexity in respect to sequence length
sannykim / transformers - A collection of resources to study Transformers in depth

Pretrained Language Model

Models

Deep Contextualized Word Representations (NAACL 2018) [paper] - ELMo
Universal Language Model Fine-tuning for Text Classification (ACL 2018) [paper] - ULMFit
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019) [paper][code][official PyTorch code] - BERT
Improving Language Understanding by Generative Pre-Training (CoRR 2018) [paper] - GPT
Language Models are Unsupervised Multitask Learners (CoRR 2019) [paper][code] - GPT-2
MASS: Masked Sequence to Sequence Pre-training for Language Generation (ICML 2019) [paper][code] - MASS
Unified Language Model Pre-training for Natural Language Understanding and Generation (CoRR 2019) [paper][code] - UNILM
Multi-Task Deep Neural Networks for Natural Language Understanding (ACL 2019) [paper][code] - MT-DNN
75 Languages, 1 Model: Parsing Universal Dependencies Universally[paper][code] - UDify
ERNIE: Enhanced Language Representation with Informative Entities (ACL 2019) [paper][code] - ERNIE (THU)
ERNIE: Enhanced Representation through Knowledge Integration (CoRR 2019) [paper] - ERNIE (Baidu)
Defending Against Neural Fake News (CoRR 2019) [paper][code] - Grover
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding (CoRR 2019) [paper] - ERNIE 2.0 (Baidu)
Pre-Training with Whole Word Masking for Chinese BERT (CoRR 2019) [paper] - Chinese-BERT-wwm
SpanBERT: Improving Pre-training by Representing and Predicting Spans (CoRR 2019) [paper] - SpanBERT
XLNet: Generalized Autoregressive Pretraining for Language Understanding (CoRR 2019) [paper][code] - XLNet
RoBERTa: A Robustly Optimized BERT Pretraining Approach (CoRR 2019) [paper] - RoBERTa
NEZHA: Neural Contextualized Representation for Chinese Language Understanding (CoRR 2019) [paper][code] - NEZHA
K-BERT: Enabling Language Representation with Knowledge Graph (AAAI 2020) [paper][code] - K-BERT
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (CoRR 2019) [paper][code] - Megatron-LM
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transforme (CoRR 2019) [paper][code] - T5
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (CoRR 2019) [paper] - BART
ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations (CoRR 2019) [paper][code] - ZEN
The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service (CoRR 2019) [paper][code] - BAAI-JDAI-BERT
Knowledge Enhanced Contextual Word Representations (EMNLP 2019) [paper] - KnowBert
UER: An Open-Source Toolkit for Pre-training Models (EMNLP 2019) [paper][code] - UER
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (ICLR 2020) [paper] - ELECTRA
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding (ICLR 2020) [paper] - StructBERT
FreeLB: Enhanced Adversarial Training for Language Understanding (ICLR 2020) [paper][code] - FreeLB
HUBERT Untangles BERT to Improve Transfer across NLP Tasks (CoRR 2019) [paper] - HUBERT
CodeBERT: A Pre-Trained Model for Programming and Natural Languages (CoRR 2020) [paper] - CodeBERT
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training (CoRR 2020) [paper] - ProphetNet
ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation (CoRR 2020) [paper][code] - ERNIE-GEN
Efficient Training of BERT by Progressively Stacking (ICML 2019) [paper][code] - StackingBERT
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination (CoRR 2020) [paper][code]
Towards a Human-like Open-Domain Chatbot (CoRR 2020) [paper] - Meena
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training (CoRR 2020) [paper][code] - UNILMv2
Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space (CoRR 2020) [paper][code] - Optimus
SegaBERT: Pre-training of Segment-aware BERT for Language Understanding. He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Ming Li. (CoRR 2020) [paper]
MPNet: Masked and Permuted Pre-training for Language Understanding (CoRR 2020) [paper][code] - MPNet
Language Models are Few-Shot Learners (CoRR 2020) [paper][code] - GPT-3
SPECTER: Document-level Representation Learning using Citation-informed Transformers (ACL 2020) [paper] - SPECTER
Recipes for building an open-domain chatbot (CoRR 2020) [paper][post][code] - Blender
PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning (CoRR 2020) [paper][code] - PLATO-2
DeBERTa: Decoding-enhanced BERT with Disentangled Attention (CoRR 2020) [paper][code] - DeBERTa

Multi-Modal

VideoBERT: A Joint Model for Video and Language Representation Learning (ICCV 2019) [paper]
Learning Video Representations using Contrastive Bidirectional Transformer (CoRR 2019) [paper] - CBT
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (NeurIPS 2019) [paper][code]
VisualBERT: A Simple and Performant Baseline for Vision and Language (CoRR 2019) [paper][code]
Fusion of Detected Objects in Text for Visual Question Answering (EMNLP 2019) [paper][[code]](https://github.com/google-research/ language/tree/master/language/question_answering/b2t2) - B2T2
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training (AAAI 2020) [paper]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP 2019) [paper][code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representatio (CoRR 2019) [paper][code]
UNITER: Learning UNiversal Image-TExt Representations (CoRR 2019) [paper]
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval （SIGIR 2020) [paper] - FashionBERT
VD-BERT: A Unified Vision and Dialog Transformer with BERT (CoRR 2020) [paper] - VD-BERT

Multilingual

Cross-lingual Language Model Pretraining (CoRR 2019) [paper] - XLM
MultiFiT: Efficient Multi-lingual Language Model Fine-tuning (EMNLP 2019) [paper][code] - MultiFit
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization (CoRR 2020) [paper][code] - XTREME
Pre-training via Paraphrasing (CoRR 2020) [paper] - MARGE
WikiBERT Models: Deep Transfer Learning for Many Languages (CoRR 2020) [paper][code] - WikiBERT
Language-agnostic BERT Sentence Embedding (CoRR 2020) [paper] - LaBSE

Compression & Accelerating

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks (CoRR 2019) [paper]
Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System (CoRR 2019) [paper] - MKDM
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding (CoRR 2019) [paper]
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (CoRR 2019) [paper]
Small and Practical BERT Models for Sequence Labeling (EMNLP 2019) [paper]
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT (CoRR 2019) [paper] - Q-BERT
Patient Knowledge Distillation for BERT Model Compression (EMNLP 2019) [paper] - BERT-PKD
Extreme Language Model Compression with Optimal Subwords and Shared Projections (ICLR 2019) [paper]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter [paper][code] - DistilBERT
TinyBERT: Distilling BERT for Natural Language Understanding (ICLR 2019) [paper][code] - TinyBERT
Q8BERT: Quantized 8Bit BERT (NeurIPS 2019 Workshop) [paper] - Q8BERT
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (ICLR 2020) [paper][code] - ALBERT
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning (ICLR 2020) [paper][PyTorch code]
Reducing Transformer Depth on Demand with Structured Dropout (ICLR 2020) [paper] - LayerDrop
Multilingual Alignment of Contextual Word Representations (ICLR 202) [paper]
AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search (CoRR 2020) [paper] - AdaBERT
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing. Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou. (CoRR 2020) [paper][pt code][tf code][keras code]
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (CoRR 2020) [paper][code] - MiniLM
FastBERT: a Self-distilling BERT with Adaptive Inference Time (ACL 2020) [paper][code] - FastBERT
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices (ACL 2020) [paper][code] - MobileBERT
DynaBERT: Dynamic BERT with Adaptive Width and Depth (CoRR 2020) [paper] - DynaBERT
SqueezeBERT: What can computer vision teach NLP about efficient neural networks?. Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. (CoRR 2020) [paper]

Application

BERT for Joint Intent Classification and Slot Filling (CoRR 2019) [paper]
GPT-based Generation for Classical Chinese Poetry (CoRR 2019) [paper]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) [paper][code]
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring (ICLR 2020) [paper]
Pre-training Tasks for Embedding-based Large-scale Retrieval (ICLR 2020) [paper]
K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters (CoRR 2020) [paper] - K-Adapter
Keyword-Attentive Deep Semantic Matching (CoRR 2020) [paper & code] [post] - Keyword BERT
Unified Multi-Criteria Chinese Word Segmentation with BERT (CoRR 2020) [paper]
ToD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues (CoRR 2020) [paper][code]
Spelling Error Correction with Soft-Masked BERT (ACL 2020) [paper] - Soft-Masked BERT
DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering (ACL 2020) [paper][code] - DeFormer
BLEURT: Learning Robust Metrics for Text Generation (ACL 2020) [paper][code] - BLEURT
Context-Aware Document Term Weighting for Ad-Hoc Search (WWW 2020) [paper][code] - HDCT

Analysis & Tools

Probing Neural Network Comprehension of Natural Language Arguments (ACL 2019) [paper][code]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference (ACL 2019) [paper] [code]
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks (RepL4NLP@ACL 2019) [paper]
Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection (CICLing 2019) [paper]
Understanding the Behaviors of BERT in Ranking (CoRR 2019) [paper]
How to Fine-Tune BERT for Text Classification? (CoRR 2019) [paper]
What Does BERT Look At? An Analysis of BERT's Attention (BlackBoxNLP 2019) [paper][code]
Visualizing and Understanding the Effectiveness of BERT (EMNLP 2019) [paper]
exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models (CoRR 2019) [paper] [code]
Transformers: State-of-the-art Natural Language Processing [paper][code][code]
Do Attention Heads in BERT Track Syntactic Dependencies? [paper]
Fine-tune BERT with Sparse Self-Attention Mechanism (EMNLP 2019) [paper]
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings (EMNLP 2019) [paper]
oLMpics -- On what Language Model Pre-training Captures (CoRR 2019) [paper]
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment (AAAI 2020) [paper][code] - TextFooler
A Mutual Information Maximization Perspective of Language Representation Learning (ICLR 2020) [paper]
Cross-Lingual Ability of Multilingual BERT: An Empirical Study (ICLR2020) [paper]
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping (CoRR 2020) [paper]
How Much Knowledge Can You Pack Into the Parameters of a Language Model? (CoRR 2020) [paper]
A Primer in BERTology: What we know about how BERT works. Anna Rogers, Olga Kovaleva, Anna Rumshisky. (CoRR 2020) [paper]
BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations (CoRR 2020) [paper]
Contextual Embeddings: When Are They Worth It? (ACL 2020) [paper]
Weight Poisoning Attacks on Pre-trained Models (ACL 2020) [paper][code] - RIPPLe
Roles and Utilization of Attention Heads in Transformer-based Neural Language Models (ACL 2020) [paper][code] - Transformer Anatomy
Adversarial Training for Large Neural Language Models (CoRR 2020) [paper][code]
Cross-Lingual Ability of Multilingual BERT: An Empirical Study (ICLR 2020) [paper][code]
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference (ACL 2020) [paper][code][huggingface implementation]
Beyond Accuracy: Behavioral Testing of NLP models with CheckList. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh. (ACL 2020 Best Paper) [paper][code]
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith. (ACL 2020) [paper][code]
TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing. Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting Liu, Shijin Wang, Guoping Hu. (ACL 2020) [paper][code]
Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT. Zhiyong Wu, Yun Chen, Ben Kao, Qun Liu. (ACL 2020) [paper][pt code][keras code]
Rethinking Positional Encoding in Language Pre-training. Guolin Ke, Di He, Tie-Yan Liu. (CoRR 2020) [paper][code] - TUPE

Tutorial & Survey

Transfer Learning in Natural Language Processing. Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, Thomas Wolf. (NAACL 2019) [paper]
Evolution of Transfer Learning in Natural Language Processing. Aditya Malte, Pratik Ratadiya. (CoRR 2019) [paper]
Transferring NLP Models Across Languages and Domains. Barbara Plank. (DeepLo 2019) [slides]
Recent Breakthroughs in Natural Language Processing. Christopher Manning (BAAI 2019) [slides]
Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang. (Invited Review of Science China Technological Sciences 2020) [paper]
Embeddings in Natural Language Processing. Mohammad Taher Pilehvar, Jose Camacho-Collados. (2020) [book]

Repository

bojone / bert4keras - bojone's (苏神) BERT Keras implementation
brightmart/roberta_zh - RoBERTa中文预训练模型
brightmart / albert_zh - 海量中文预训练ALBERT模型
CyberZHG / keras-bert - CyberZHG's BERT Keras implementation
policeme / roberta-wwm-base-distill - A chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large
tomohideshibata / BERT-related-papers - This is a list of BERT-related papers.
terrifyzhao / bert-utils - One line generate BERT's sent2vec for classification or matching task
graykode / gpt-2-Pytorch - Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation
hanxiao / bert-as-service - Using BERT model as a sentence encoding service
heartcored98 / Transformer_Anatomy - Toolkit for finding and analyzing important attention heads in transformer-based models
Hironsan / bertsearch - Elasticsearch with BERT for advanced document search
CLUEbenchmark / CLUE - Chinese Language Understanding Evaluation Benchmark
jessevig / bertviz - BERT Visualization Tool
Jiakui / awesome-bert - Collect BERT related resources
Morizeyao / GPT2-Chinese - Chinese version of GPT2 training code, using BERT tokenizer
Separius / BERT-keras - Separius' BERT Keras implementation
Tencent / TurboTransformers - A fast and user-friendly runtime for transformer inference on CPU and GPU
THUNLP / OpenCLaP - Open Chinese Language Pre-trained Model Zoo
THUNLP / PLMpapers - Must-read Papers on pre-trained language models.
THUNLP-AIPoet / BERT-CCPoem - A BERT-based pre-trained model particularly for Chinese classical poetry
ymcui / Chinese-XLNet - Pre-Trained Chinese XLNet（中文XLNet预训练模型）
ZhuiyiTechnology / SimBERT - A bert for retrieval and generation

tonylv / atpapers Goto Github PK

atpapers's Introduction

ATPapers

Attention

Papers

Survey & Review

English Blog

Chinese Blog

Repositories

Transformer

Papers

Chinese Blog

English Blog

Repositories

Pretrained Language Model

Models

Multi-Modal

Multilingual

Compression & Accelerating

Application

Analysis & Tools

Tutorial & Survey

Repository

Chinese Blog

English Blog

atpapers's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org