Code Monkey home page Code Monkey logo

the-nlp-pandect's Introduction

The-NLP-Pandect

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

The-NLP-Resources

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries
Conferences

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

NLP History

General
2020 Year in Review

The-NLP-Podcasts

NLP-only podcasts

Many NLP episodes

Some NLP episodes

The-NLP-Newsletter

The-NLP-Meetups

The-NLP-Youtube

The-NLP-Benchmarks

General NLU

  • GLUE - General Language Understanding Evaluation (GLUE) benchmark
  • SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
  • decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
  • RACE - ReAding Comprehension dataset collected from English Examinations
  • dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
  • DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking

Summarization

  • WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset

Question Answering

  • SQuAD - Stanford Question Answering Dataset (SQuAD)
  • XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
  • GrailQA - Strongly Generalizable Question Answering (GrailQA)
  • CSQA - Complex Sequential Question Answering

Multilingual and Non-English Benchmarks

  • XTREME - Massively Multilingual Multi-task Benchmark
  • GLUECoS - A benchmark for code-switched NLP
  • IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
  • IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
  • LinCE - Linguistic Code-Switching Evaluation Benchmark
  • Russian SuperGlue - Russian SuperGlue Benchmark

Bio, Law, and other scientific domains

  • BLURB - Biomedical Language Understanding and Reasoning Benchmark
  • BLUE - Biomedical Language Understanding Evaluation benchmark

Transformer Efficiency

Speech Processing

  • SUPERB - Speech processing Universal PERformance Benchmark

Other

  • CodeXGLUE - A benchmark dataset for code intelligence
  • CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
  • MultiNLI - Multi-Genre Natural Language Inference corpus

The-NLP-Research

General

Embeddings

Repositories

Blogs

Cross-lingual Word and Sentence Embeddings

  • vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 559 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 5733 stars]

Byte Pair Encoding

  • bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 982 stars]
  • subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1750 stars]
  • python-bpe - Byte Pair Encoding for Python [GitHub, 147 stars]

Transformer-based Architectures

General

Transformer

BERT

Other Transformer Variants

T5
BigBird
Reformer / Linformer / Longformer / Performers
Switch Transformer

GPT-family

General
GPT-3
Learning Resources
Applications
  • Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 3256 stars]
  • GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
  • GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
  • OpenAI API - API Demo to use GPT-3 for commercial applications
Open-source Efforts

Other

Distillation, Pruning and Quantization

Automated Summarization

The-NLP-Industry

Best Practices for NLP

MLOps for NLP

MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.

In general, MLOps for NLP includes having the following processes in place:

  • Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
  • Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
  • Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
  • Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
  • Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
  • Data and Model Observability - track data drift, model accuracy drift etc.

Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:

  • Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
  • Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.

Reading Material

Learning Material

  • MLOps cource by Made With ML
  • GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub

MLOps Communities

Data Versioning

  • DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
  • Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
  • Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]

Experiment Tracking

  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
  • Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
  • Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
  • SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
  • Optuna - hyperparameter optimization framework [GitHub, 4894 stars]
  • Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 4500 stars]
Model Registry
  • DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1301 stars]
  • Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
  • Valohai - End-to-end ML pipelines [Paid Service]
  • Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
  • polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
  • Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]

Automated Testing and Behavioral Testing

  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1452 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1568 stars]
  • WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 66 stars]
  • Great Expectations - Write tests for your data [GitHub, 4768 stars]

Model Deployability and Serving

  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • Amazon SageMaker [Paid Service]
  • Valohai - End-to-end ML pipelines [Paid Service]
  • NLP Cloud - Production-ready NLP API [Paid Service]
  • Saturn Cloud [Paid Service]
  • SELDON - machine learning deployment for enterprise [Paid Service]
  • Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
  • polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
  • TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 1999 stars]
  • Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
  • KFServing - Serverless Inferencing on Kubernetes [GitHub, 1013 stars]
  • TFX - TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines [Paid Service]
  • Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
  • Cortex - containers as a service on AWS [Paid Service]
  • Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
  • End2End Serverless Transformers On AWS Lambda [GitHub, 73 stars]
  • NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 11 stars]
  • Dagster - data orchestrator for machine learning [Free and Open Source]
  • Verta - AI and machine learning deployment and operations [Paid Service]
  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 4500 stars]
  • flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 1600 stars]
  • MLRun - Machine Learning automation and tracking [GitHub, 420 stars]
  • DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI

Data and Model Observability

General
  • whylogs - open source standard for data and ML logging [GitHub, 515 stars]
  • Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 154 stars]
  • MLRun - Machine Learning automation and tracking [GitHub, 420 stars]
  • DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
  • Cortex - containers as a service on AWS [Paid Service]
Model Centric
  • Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
  • Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
  • Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
  • Fiddler - ML Model Performance Management Tool [Paid Service]
  • Hydrosphere - open-source platform for managing ML models [Paid Service]
  • Verta - AI and machine learning deployment and operations [Paid Service]
  • Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
  • iguazio - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]
Data Centric
  • Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
  • acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
  • Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
  • datakin - end-to-end, real-time data lineage solution [Paid Service]
  • Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
  • SODA - data monitoring, testing and validation [Paid Service]
  • whatify - data quality and action recommendation on it [Paid Service]

Feature Stores

  • Tecton - enterprise feature store for machine learning [Paid Service]
  • FEAST - open source feature store for machine learning Website [GitHub, 2084 stars]
  • Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]

Metadata Management

  • ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 358 stars]
  • Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]

MLOps Frameworks

  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 4500 stars]
  • kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 4200 stars]
  • Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 2500 stars]
  • ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 1200 stars]
  • Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]

Transformer-based Architectures

General

Multi-GPU Transfomers

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

  • Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 496 stars]
  • Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 975 stars]
  • FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 147 stars]
  • LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 464 stars]
  • NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
  • Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 296 stars]
  • BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 159 stars]

The-NLP-Speech

General Speech Recognition

  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5811 stars]
  • DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 17826 stars]
  • Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
  • kaldi - Kaldi is a toolkit for speech recognition [GitHub, 10711 stars]
  • awesome-kaldi - resources for using Kaldi [GitHub, 434 stars]
  • ESPnet - End-to-End Speech Processing Toolkit [GitHub, 4015 stars]
  • HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]

Text to Speech

  • FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 637 stars]
  • TTS - a deep learning toolkit for Text-to-Speech [GitHub, 1954 stars]

Datasets

  • VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 199 stars]

The-NLP-Topics

Blogs

Frameworks for Topic Modeling

  • gensim - framework for topic modeling [GitHub, 12305 stars]
  • Spark NLP [GitHub, 2258 stars]

Repositories

Keyword-Extraction

Text Rank

  • PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1584 stars]
  • textrank - TextRank implementation for Python 3 [GitHub, 1045 stars]

RAKE - Rapid Automatic Keyword Extraction

  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 835 stars]
  • yake - Single-document unsupervised keyword extraction [GitHub, 735 stars]
  • RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 352 stars]
  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 835 stars]

Other Approaches

  • flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 4891 stars]
  • BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 191 stars]
  • keyBERT - Minimal keyword extraction with BERT [GitHub, 754 stars]

Further Reading

Responsible-NLP

NLP and ML Interpretability

NLP-centric

General

  • Language Interpretability Tool (LIT) [GitHub, 2610 stars]
  • WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 288 stars]
  • Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 261 stars]
  • InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 3987 stars]

Ethics, Bias, and Equality in NLP

Adversarial Attacks for NLP

The-NLP-Frameworks

General Purpose

  • spaCy by Explosion AI [GitHub, 21027 stars]
  • flair by Zalando [GitHub, 10638 stars]
  • AllenNLP by AI2 [GitHub, 10371 stars]
  • stanza (former Stanford NLP) [GitHub, 5596 stars]
  • spaCy stanza [GitHub, 552 stars]
  • nltk [GitHub, 10030 stars]
  • gensim - framework for topic modeling [GitHub, 12305 stars]
  • pororo - Platform of neural models for natural language processing [GitHub, 960 stars]
  • NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2705 stars]
  • FARM [GitHub, 1270 stars]
  • gobbli by RTI International [GitHub, 260 stars]
  • headliner - training and deployment of seq2seq models [GitHub, 230 stars]
  • SyferText - A privacy preserving NLP framework [GitHub, 178 stars]
  • DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1142 stars]
  • TextHero - Text preprocessing, representation and visualization [GitHub, 2302 stars]
  • textblob - TextBlob: Simplified Text Processing [GitHub, 7778 stars]
  • AdaptNLP - A high level framework and library for NLP [GitHub, 342 stars]
  • textacy - NLP, before and after spaCy [GitHub, 1715 stars]
  • texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2188 stars]
  • jiant - jiant is an NLP toolkit [GitHub, 1310 stars]

Data Augmentation

  • WildNLP Text manipulation library to test NLP models [GitHub, 66 stars]
  • snorkel Framework to generate training data [GitHub, 4724 stars]
  • NLPAug Data augmentation for NLP [GitHub, 2296 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 328 stars]
  • faker - Python package that generates fake data for you [GitHub, 12849 stars]
  • textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 490 stars]
  • Parrot - Practical and feature-rich paraphrasing framework [GitHub, 258 stars]
  • AugLy - data augmentations library for audio, image, text, and video [GitHub, 3721 stars]
  • TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 160 stars]

Adversarial NLP Attacks & Behavioral Testing

  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1568 stars]
  • CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5213 stars]
  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1452 stars]

Transformer-oriented

  • transformers by HuggingFace [GitHub, 49306 stars]
  • Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 493 stars]
  • haystack - Transformers at scale for question answering & neural search. [GitHub, 2174 stars]

Dialog Systems and Speech

  • DeepPavlov by MIPT [GitHub, 5333 stars]
  • ParlAI by FAIR [GitHub, 8105 stars]
  • rasa - Framework for Conversational Agents [GitHub, 11836 stars]
  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5811 stars]
  • ChatterBot - conversational dialog engine for creating chat bots [GitHub, 11381 stars]
  • SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 2800 stars]

Word/Sentence-embeddings oriented

  • MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2830 stars]
  • vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 559 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 5733 stars]

Social Media Oriented

  • Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 482 stars]

Phonetics

  • DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 57 stars]

Morphology

  • LemmInflect - python module for English lemmatization and inflection [GitHub, 131 stars]
  • Inflect - generate plurals, ordinals, indefinite articles [GitHub, 552 stars]
  • simplemma - simple multilingual lemmatizer for Python [GitHub, 2 stars]

Multi-lingual tools

  • polyglot - Multi-lingual NLP Framework [GitHub, 1878 stars]
  • trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 507 stars]

Distributed NLP / Multi-GPU NLP

Machine Translation

  • COMET -A Neural Framework for MT Evaluation [GitHub, 70 stars]
  • marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 818 stars]
  • argos-translate - Open source neural machine translation in Python [GitHub, 668 stars]
  • Opus-MT - Open neural machine translation models and web services [GitHub, 141 stars]
  • dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 156 stars]

Entity and String Matching

  • PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 366 stars]
  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 627 stars]
  • fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8388 stars]
  • jellyfish - approximate and phonetic matching of strings [GitHub, 1495 stars]
  • textdistance - Compute distance between sequences [GitHub, 2013 stars]
  • DeepMatcher - Compute distance between sequences [GitHub, 358 stars]

Discourse Analysis

  • ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 283 stars]

PII scrubbing

  • scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 241 stars]

Non-English oriented

Japanese

  • fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 150 stars]
  • SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 230 stars]
  • Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 230 stars]
  • jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 138 stars]
  • Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 489 stars]
  • kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 776 stars]
  • nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 276 stars]
  • KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 180 stars]
  • Jigg - Pipeline framework for easy natural language processing [GitHub, 68 stars]
  • Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 265 stars]
  • RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 438 stars]
  • toiro - a comparison tool of Japanese tokenizers [GitHub, 100 stars]

Other

  • textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 86 stars]
  • Kashgari Transfer Learning with focus on Chinese [GitHub, 2152 stars]
  • Underthesea - Vietnamese NLP Toolkit [GitHub, 873 stars]

Text Data Labelling

  • Small-Text - Active Learning for Text Classifcation in Python [GitHub, 73 stars]
  • Doccano - open source annotation tool for machine learning practitioners [GitHub, 5100 stars]
  • Prodigy - annotation tool powered by active learning [Paid Service]

The-NLP-Learning

General

Courses

Books

Tutorials

The-NLP-Communities

Other-NLP-Topics

Tokenization

  • tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 4721 stars]
  • SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 5262 stars]
  • SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 91 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks
  • WildNLP Text manipulation library to test NLP models [GitHub, 66 stars]
  • NLPAug Data augmentation for NLP [GitHub, 2296 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 328 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1568 stars]
  • skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 323 stars]
  • NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 219 stars]
  • EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1000 stars]
  • snorkel Framework to generate training data [GitHub, 4724 stars]
Reading Material and Tutorials

Named Entity Recognition (NER)

Relation Extraction

  • tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 290 stars]
  • tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 43 stars]
  • tac-self-attention Relation extraction with position-aware self-attention [GitHub, 60 stars]

Coreference Resolution

Domain Adaptation

Low Resource NLP

Spell Correction

  • Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 707 stars]
  • NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 226 stars]
  • SymSpellPy - Python port of SymSpell [GitHub, 458 stars]
  • Speller100 by Microsoft [Blog, Feb 2021]
  • JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 405 stars]

Style Transfer for NLP

  • Styleformer - Neural Language Style Transfer framework [GitHub, 216 stars]

Automata Theory for NLP

  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 627 stars]

Obscene words detection

  • LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1467 stars]

Reinforcement Learning for NLP

  • nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 95 stars]

AutoML / AutoNLP

  • AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 623 stars]
  • TPOT - Python Automated Machine Learning tool [GitHub, 8158 stars]
  • Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1287 stars]
  • HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 625 stars]
  • AutoML Natural Language - Google's paid AutoML NLP service
  • Optuna - hyperparameter optimization framework [GitHub, 4894 stars]
  • FLAML - fast and lightweight AutoML library [GitHub, 603 stars]

Text Generation

License CC0

Attributions

Resources

  • All linked resources belong to original authors

Icons

Fonts


The Pandect Series also includes

     

the-nlp-pandect's People

Contributors

ivan-bilan avatar anoopkunchukuttan avatar dwhitena avatar stephenroller avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.