Light

joeat1 / nlp_resource Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 2.0 110 KB

Some useful resource for NLP

nlp_resource's Introduction

🎉NLP_resource 🎊

自然语言处理怎么最快入门？

刘知远老师的 NLP研究入门之道 👍👍👍

🎈Contents


论文列表 ⤵️	信息资讯 ⤵️
工具库 ⤵️	数据集 ⤵️
主要研究机构 ⤵️	基础知识/训练 ⤵️

🎓 论文列表

顶会：ACL、EMNLP、NAACL、COLING （前三个会议的录用数是 CSRankings 在本领域的评价指标）

部分会议的历年录用率

自然语言处理的主要困难就是消除歧义，词法分析、句法分析、语义分析等过程都有所体现，为了解决消歧问题，我们需要借助语言学知识（如词法、句法、语义、上下文等）和世界知识（与语言无关）。

自然语言处理-概述应用于自然语言深度学习的技术概述，包括理论，实现，应用和最先进的结果。

NLP 的巨人肩膀：较为详细的讲述了自然语言处理部分研究的发展历程

列表

充分调研自己研究领域目前的发展状况，包括体系方法（是否有数学基础、方法步骤）、数据集（公认的训练和测试对象）、研究团队（关注其工作进展）。从实践出发，在阅读论文和代码实现上同时进步。

综述文章:boom:

预训练模型 Pre-trained Models for Natural Language Processing: A Survey. paper
上下文嵌入 A Survey on Contextual Embeddings. paper
文本分类 Deep Learning Based Text Classification: A Comprehensive Review. paper
命名实体识别 A Survey on Deep Learning for Named Entity Recognition. paper
对抗生成 A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. paper
关系抽取 More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction. paper
知识图谱 A Survey on Knowledge Graphs: Representation, Acquisition and Applications. paper

文本表示

自然语言处理中面向具体场景的研究大部分都是按照文本表示-编码更新-目标预测的步骤展开，文本表示是最重要的环节。文本表示指的是将 word/sentence/document 等格式的输入文本用低维紧致向量来表示，即嵌入（embedding）。

以下只列出常用的模型算法，其他大量的论文和代码请从嵌入相关论文和代码中查阅，特别关注那些引用数超过 999+ 的论文。

统计语言模型 A Neural Probabilistic Language Model paper 👍👍👍

Date	Model Name	Paper	Codes
浅层词嵌入
2013/01	Word2Vec	Efficient Estimation of Word Representations in Vector Space	C
2014	Glove	GloVe: Global Vectors for Word Representation	C
2014/05	Doc2Vec	Distributed Representations of Sentences and Documents	Pytorch Python
2016/07	fastText	Enriching Word Vectors with Subword Information	C++
上下文嵌入			（大多数的预训练模型可以通过transformers库进行加载）
2018	GPT	Improving Language Understanding by Generative Pre-Training	TF Keras Pytorch, TF2.0
-	GPT-2(117M, 124M, 345M, 355M, 774M, 1558M)	Language Models are Unsupervised Multitask Learners	TF Pytorch, TF2.0 Keras
2018/02	ELMO(AllenNLP, TF-Hub)	Deep contextualized word representations	Pytorch TF
2018/10	BERT(BERT, ERNIE, KoBERT)	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	TF Keras Pytorch, TF2.0 MXNet PaddlePaddle TF Keras
2019/01	Transformer-XL	Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	TF Pytorch Pytorch, TF2.0
2019/05	ERNIE	ERNIE: Enhanced Language Representation with Informative Entities	Pytorch
2019/07	RoBERTa	RoBERTa: A Robustly Optimized BERT Pretraining Approach	Pytorch Pytorch, TF2.0
2019/09	ALBERT	ALBERT: A Lite BERT for Self-supervised Learning of Language Representations	TF
2019/10	DistilBERT	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter	Pytorch, TF2.0
2019/10	T5	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	TF
2020/05	GPT3	Language Models are Few-Shot Learners	https://github.com/openai/gpt-3

文本分类

TextCNN： Convolutional Neural Networks for Sentence Classification
TextRNN：
TextRCNN ：Recurrent Convolutional Neural Network for Text Classification
HAN ：Hierarchical Attention Networks for Document Classification

返回目录 ⤴️

💘 工具库

分词、词性标注、命名实体识别等功能的工具，主要为 Python、Java 语言

NLTK - 自然语言工具包 👍
spacy - 使用 Python 和 Cython 的高性能的自然语言处理库 👍
gensim - 用于对纯文本进行无监督的语义建模的库，支持 word2vec 等算法 👍
StanfordNLP - 适用多语言的 NLP Library ，包含 Java 和 Python 语言 👍
OpenNLP - 基于机器学习的自然语言处理的工具包，使用 Java 语言开发 👍
TextBlob - 为专研常见的自然语言处理（NLP）任务提供一致的 API
Jieba 结巴分词 - 强大的Python 中文分词库 👍
HanLP - 面向生产环境的多语种自然语言处理工具包
SnowNLP - 中文自然语言处理 Python 包，没有用NLTK，所有的算法都是自己实现的
FudanNLP - 用于中文文本处理的 Java 函式库
THULAC - 包括中文分词、词性标注功能。

预训练模型相关

transformers - 强大的预训练模型加载训练库:+1:
- 注意： transformers > 3.1.0 的版本下，在 from_pretrained 函数调用中添加 mirror 选项，如 AutoModel.from_pretrained('bert-base-uncased', mirror='tuna') 可以加快模型的下载。
- 加上 cache_dir="XXX" 手动设置缓存地址，如果不设置，默认下载在 ~/.cache/torch 或者 C:\Users\XXXX\.cache\torch，每个文件都有一个json作为标记，告知对应文件的作用。
Chinese-Word-Vectors
Chinese-BERT-wwm Pre-Training with Whole Word Masking for Chinese BERT

深度学习架构

其他

Interactive Attention Visualization - 交互式的注意力可视化
TextGrapher - 输入一篇文档，形成对文章语义信息的图谱化展示。
Scattertext 在语料库中找到有区分类别能力的单词或短语，并在交互式HTML散点图中显示它们
Seaborn 可视化工具，如 注意力热力图

返回目录 ⤴️

🌀 数据集

nlp-datasets - 很好的自然语言资料集集合
The Big Bad NLP Database
CLUEDatasetSearch

Fakenews

Fake News Detection on Social Media: A Data Mining Perspective

Event

返回目录 ⤴️

🔥 主要研究机构

如有信息不正确或缺失，欢迎批评指正并留言，列表将定期更新。

PS：此处排名不分先后，排名请看 CSRankings。

下图为国内NLP传承图：

国内NLP传承图知乎用户提供

名称	GitHub	备注
高校
斯坦福大学自然语言处理研究组 Stanford NLP	https://github.com/stanfordnlp	Stanford CoreNLP
卡耐基梅隆大学语言技术中心
北京大学计算语言学研究所	语言计算与机器学习组 https://github.com/lancopku	计算语言学教育部重点实验室
清华大学自然语言处理与社会人文计算实验室	https://github.com/thunlp	孙茂松、刘知远团队
哈工大社会计算与信息检索研究中心 SCIR	https://hub.fastgit.org/HIT-SCIR	刘挺团队
中科院计算所自然语言处理研究组	https://github.com/ictnlp
**科学院软件研究所中文信息处理实验室
复旦大学自然语言处理实验室	https://github.com/FudanNLP
南京大学自然语言处理研究组		微信号 NJU-NLP
香港科技大学人类语言技术中心
爱丁堡大学自然语言处理小组(EdinburghNLP)	https://github.com/EdinburghNLP/
企业
腾讯人工智能实验室
微软亚研自然语言计算组
百度自然语言处理	https://github.com/baidu	提供 PaddlePaddle 架构
搜狗实验室		提供预料资源
阿里巴巴达摩院语言技术实验室

返回目录 ⤴️

📢 信息资讯

机器学习算法与自然语言处理 微信公众号和知乎专栏
机器之心 微信公众号和知乎
跟踪自然语言处理（NLP）的进度: https://nlpprogress.com/
ruder的博客
52nlp

返回目录 ⤴️

📓 基础知识/训练

课程学习/资料

PS：建议在具体领域中了解机器学习和深度学习的概念和模型，根据需要回过头仔细学习重点知识。

斯坦福大学-自然语言处理与深度学习-CS224n 👍 B站视频url
斯坦福大学-自然语言理解-CS224U
斯坦福大学-机器学习-CS229旧 CS229-新
马萨诸塞大学-高级自然语言处理-CS 685
约翰霍普金斯大学-机器翻译-EN 601.468/668
麻省理工学院-深度学习-6.S094， 6.S091， 6.S093
巴塞罗那 UPC-语音和语言的深度学习
麻省理工学院-线性代数-18.06 SC
2021年47个机器学习项目
Deep Learning book 📖
用 Python 进行自然语言处理 📖
code-of-learn-deep-learning-with-pytorch url
nlp_course https://github.com/yandexdataschool/nlp_course
Deep Learning for NLP resources（DL4NLP）

基础模型/方法

长短期记忆网络 LSTM(Long Short-term Memory). paper
残差网络 Residual Network(Deep Residual Learning for Image Recognition). paper
DenseNet：Densely Connected Convolutional Networks paper code
- ResNet 残差网络 + Dense connectivity 密集连接 + Composite function
Dropout(Improving neural networks by preventing co-adaptation of feature detectors). paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. paper
优化算法综述：An overview of gradient descent optimization algorithms paper
Xiver 初始化： Understanding the Difficult of Training Deep Feedforward Neural Networks paper
NLP 中激活函数的比较 Comparing Deep Learning Activation Functions Across NLP tasks paper
注意力机制 Attention is all you need paper

博客

详解 transformer： The Illustrated Transformer 中文翻译版，Transformers from scratch
放弃幻想，全面拥抱Transformer：自然语言处理三大特征抽取器（CNN/RNN/TF）比较 url
- RNN 可以接纳不定长输入的由前向后进行信息线性传导，包含了位置信息编码
- CNN 捕获到的是单词的 k-gram 片段信息，其中 k 表示滑动窗口大小
- 把 Transformer 中的 self attention 模块用双向 RNN 或者 CNN 替换掉，可以改善原始 RNN/CNN 的效果
RNN vs LSTM vs GRU -- 该选哪个？url
难以置信！LSTM和GRU的解析从未如此清晰（动图+视频）。url

其他

评价榜单 Evaluation 🥇

训练技巧

Neural Networks: Tricks of the Trade 📖

深度学习网络调参技巧超参数包括学习率、批量、正规化等。
早停法 Early Stopping （泛化误差超过指定阈值，训练过程中泛化误差增加，泛化误差长时间不下降）
Weight Decay Parameter
正则化方法 Regularization
多任务学习
数据增广

返回目录 ⤴️

⬜ TODO

✅

⬜

📃 参考

emoji-list
https://github.com/keon/awesome-nlp
https://github.com/kmario23/deep-learning-drizzle 深度学习、强化学习、机器学习、计算机视觉和 NLP 相关讲座
GitHub-api：https://api.github.com/repos/{:owner}/{:repository} 自动化查询相关项目

🙏 贡献

如果您找到适合本项目的任何类别的资料，则请提出问题或发送 PR 。

感谢为此项目提供帮助的成员和参考的资料。:gift_heart:

nlp_resource's People

Contributors

Stargazers

Watchers

Forkers

shunsunsun little1tow

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.