Code Monkey home page Code Monkey logo

awesome-pretrained-chinese-nlp-models's Introduction

Awesome Pretrained Chinese NLP ModelsAwesome

在自然语言处理领域中,预训练语言模型(Pretrained Language Models)已成为非常重要的基础技术,本仓库主要收集了目前网上公开的一些高质量中文预训练模型(感谢分享资源的大佬),并将持续更新......

: 🤗huggingface模型下载地址: 1. 清华大学开源镜像 2. 官方地址

Expand Table of Contents

NLU系列

BERT

  • 2018 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Jacob Devlin, et al. | arXiv | PDF
  • 2019 | Pre-Training with Whole Word Masking for Chinese BERT | Yiming Cui, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
BERT-Base 110M 中文维基(词数0.4B) Google Drive Google Research bert 通用
BERT-wwm 110M 中文维基(词数0.4B) Google Drive 讯飞云-密码07Xj Google Drive Yiming Cui Chinese-BERT-wwm 通用
BERT-wwm-ext 110M 通用语料(词数5.4B) Google Drive 讯飞云-密码4cMG Google Drive Yiming Cui Chinese-BERT-wwm 通用
bert-base-民事 2654万民事文书 阿里云 THUNLP OpenCLaP 司法
bert-base-刑事 663万刑事文书 阿里云 THUNLP OpenCLaP 司法
BAAI-JDAI-BERT 42G电商客服对话数据(词数9B) 京东云 JDAI pretrained_models_and_embeddings 电商客服对话
FinBERT 400万金融领域数据(词数30亿) Google Drive 百度网盘-密码1cmp Google Drive 百度网盘-密码986f Value Simplex FinBERT 金融科技领域
EduBERT 2000万教育领域数据(词数3.8亿) 好未来AI tal-tech tal-tech edu-bert 教育领域
WoBERT 30通用语料+医学专业词典 百度网盘-密码kim2 natureLanguageQing Medical_WoBERT 医学领域
MC-BERT Google Drive Alibaba AI Research ChineseBLUE 医学领域
guwenbert-base 古代文献语料(词数1.7B) 百度网盘-密码4jng huggingface Ethan guwenbert 古文领域
guwenbert-large 古代文献语料(词数1.7B) 百度网盘-密码m5sz huggingface Ethan guwenbert 古文领域

备注:

[1] wwm全称为**Whole Word Masking **,一个完整的词的部分WordPiece子词被mask,则同属该词的其他部分也会被mask

[2] ext表示在更多数据集下训练

RoBERTa

  • 2019 | RoBERTa: A Robustly Optimized BERT Pretraining Approach | Yinhan Liu, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
RoBERTa-tiny-clue 7.5M 通用语料100G Google Drive 百度网盘-密码8qvb CLUE CLUEPretrainedModels 通用
RoBERTa-tiny-pair 7.5M 通用语料100G google drive 百度网盘-密码8qvb CLUE CLUEPretrainedModels 通用
RoBERTa-tiny3L768-clue 38M 通用语料100G Google Drive CLUE CLUEPretrainedModels 通用
RoBERTa-tiny3L312-clue <7.5M 通用语料100G google drive 百度网盘-密码8qvb CLUE CLUEPretrainedModels 通用
RoBERTa-large-pair 290M 通用语料100G Google Drive 百度网盘-密码8qvb CLUE CLUEPretrainedModels 通用
RoBERTa-large-clue 290M 通用语料100G google drive 百度网盘-密码8qvb CLUE CLUEPretrainedModels 通用
RBTL3 通用语料(词数5.4B) Google Drive 讯飞云-密码vySW Google Drive Yiming Cui Chinese-BERT-wwm 通用
RBTL4 通用语料(词数5.4B) 讯飞云-密码e8dN Yiming Cui Chinese-BERT-wwm 通用
RBTL6 通用语料(词数5.4B) 讯飞云-密码XNMA Yiming Cui Chinese-BERT-wwm 通用
RoBERTa-wwm-ext 通用语料(词数5.4B) Google Drive 讯飞云-密码Xe1p Google Drive Yiming Cui Chinese-BERT-wwm 通用
RoBERTa-wwm-ext-large 通用语料(词数5.4B) Google Drive 讯飞云-密码u6gC Google Drive Yiming Cui Chinese-BERT-wwm 通用
RoBERTa-base 通用语料30G Google Drive 百度网盘 Google Drive 百度网盘 brightmart roberta_zh 通用
RoBERTa-Large 通用语料30G Google Drive 百度网盘 Google Drive brightmart roberta_zh 通用

ALBERT

  • 2019 | ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations | Zhenzhong Lan, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
Albert_base_zh 12M 通用语料30G Google Drive Google Drive brightmart albert_zh 通用
Albert_large_zh 通用语料30G Google Drive Google Drive brightmart albert_zh 通用
Albert_xlarge_zh 通用语料30G Google Drive Google Drive brightmart albert_zh 通用
Albert_base 通用语料30G Google Drive Google Research ALBERT 通用
Albert_large 通用语料30G Google Drive Google Research ALBERT 通用
Albert_xlarge 通用语料30G Google Drive Google Research ALBERT 通用
Albert_xxlarge 通用语料30G Google Drive Google Research ALBERT 通用

NEZHA

  • 2019 | NEZHA: Neural Contextualized Representation for Chinese Language Understanding | Junqiu Wei, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
NEZHA-base Google Drive 百度网盘-密码ntn3 lonePatient HUAWEI Noah's Ark Lab link 通用
NEZHA-base-WWM Google Drive 百度网盘-密码f68o lonePatient HUAWEI Noah's Ark Lab link 通用
NEZHA-large Google Drive 百度网盘-密码7thu lonePatient HUAWEI Noah's Ark Lab link 通用
NEZHA-large-WWM Google Drive 百度网盘-ni4o lonePatient HUAWEI Noah's Ark Lab link 通用
NEZHA-Gen Google Drive 百度网盘-密码ytim HUAWEI Noah's Ark Lab link 通用
NEZHA-Gen Google Drive 百度网盘-密码rb5m HUAWEI Noah's Ark Lab link
WoNEZHA 30通用语料+医学专业词典 百度网盘-密码qgkq natureLanguageQing link 医学领域

MacBERT

  • 2020 | Revisiting Pre-Trained Models for Chinese Natural Language Processing | Yiming Cui, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
MacBERT-base 102M 通用语料(词数5.4B) Google Drive 讯飞云-密码E2cP Yiming Cui link 通用
MacBERT-large 324M 通用语料(词数5.4B) Google Drive 讯飞云-密码3Yg3 Yiming Cui link 通用

XLNET

  • 2019 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Zhilin Yang, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
XLNet-base 117M 通用语料(词数5.4B) Google Drive 讯飞云-密码uCpe Google Drive Yiming Cui link 通用
XLNet-mid 209M 通用语料(词数5.4B) Google Drive 讯飞云-密码68En Google Drive Yiming Cui link 通用
XLNet_zh_Large 百度网盘 brightmart link 通用

ELECTRA

  • 2020 | ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators | Kevin Clark, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
ELECTRA-180g-large Google Drive 讯飞云-密码Yfcy Yiming Cui link 通用
ELECTRA-180g-small-ex Google Drive 讯飞云-密码GUdp Yiming Cui link 通用
ELECTRA-180g-base Google Drive 讯飞云-密码Xcvm Yiming Cui link 通用
ELECTRA-180g-small Google Drive 讯飞云-密码qsHj Yiming Cui link 通用
legal-ELECTRA-large Google Drive 讯飞云-密码7f7b Yiming Cui link 司法领域
legal-ELECTRA-base Google Drive 讯飞云-密码7f7b Yiming Cui link 司法领域
legal-ELECTRA-small Google Drive 讯飞云-密码7f7b Yiming Cui link 司法领域

ZEN

  • 2019 | ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations | Zhilin Yang, et al. | arXiv | PDF
模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
ZEN-Base Google Drive 百度网盘 Sinovation Ventures AI Institute link 通用

ERNIE

  • 2019 | ERNIE: Enhanced Representation through Knowledge Integration | Yu Sun, et al. | arXiv | PDF

  • 2020 | SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis | Hao Tian, et al. | arXiv | PDF

模型 参数大小 语料大小 PaddlePaddle PyTorch 提供者 源地址 应用领域 备注
ernie-1.0-base link [nghuyong云盘](http://pan.nghuyong.top/#/s/y7Uz) PaddlePaddle link 通用
ernie_1.0_skep_large_ch link Baidu link 情感分析

NLG系列

GPT

  • 2019 | Improving Language Understandingby Generative Pre-Training | Alec Radford, et al. | arXiv | PDF

  • 2019 | Language Models are Unsupervised Multitask Learners | Alec Radford, et al. | arXiv | PDF

模型 参数大小 语料大小 TensorFlow PyTorch 提供者 源地址 应用领域 备注
GPT2 15亿 30G Google Drive 百度网盘-密码ffz6 Caspar ZHANG gpt2-ml
GPT2 15亿 15G Google Drive 百度网盘-密码q9vr Caspar ZHANG gpt2-ml
CDial-GPTLCCC-base 95.5M LCCC-base [huggingface]](https://huggingface.co/thu-coai/CDial-GPT_LCCC-base) thu-coai CDial-GPT
CDial-GPT2LCCC-base 95.5M LCCC-base [huggingface]](https://huggingface.co/thu-coai/CDial-GPT2_LCCC-base) thu-coai CDial-GPT
CDial-GPTLCCC-large 95.5M LCCC-large [huggingface]](https://huggingface.co/thu-coai/CDial-GPT_LCCC-large) thu-coai CDial-GPT
GPT2-dialogue 常见中文闲聊 Google Drive 百度网盘-密码osi6 yangjianxin1 GPT2-chitchat
GPT2-mmi 50w中文闲聊语料 百度网盘-密码jk8d GoogleDrive Google Drive 百度网盘-密码1j88 yangjianxin1 GPT2-chitchat
GPT2-散文模型 130MB散文数据集 Google Drive 百度网盘-密码fpyu Zeyao Du GPT2-Chinese
GPT2-诗词模型 180MB古诗词数据集 Google Drive 百度网盘-密码7fev Zeyao Du GPT2-Chinese
GPT2-对联模型 40MB对联数据集 Google Drive 百度网盘-密码i5n0 Zeyao Du GPT2-Chinese

awesome-pretrained-chinese-nlp-models's People

Contributors

lonepatient avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.