Thai Natural Language Processing (Thai NLP) Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Library
Description
Programming Languages
Features
License
Author & Link
JTCC
Thai Character Cluster
Java
GPL-3.0
Wittawat
TCC
Thai Character Cluster
Python
Apache 2.0
Wannaphong
Library
Description
Programming Languages
Features
License
Author & Link
LK82 + Udom83
Thai Soundex
Python
Korakot
Library
Description
Programming Languages
Features
License
Author & Link
Swath
SWATH (Smart Word Analysis for THai) is a word segmentation for Thai
C
Longest Matching, Maximal Matching and Part-of-Speech Bigram.
GPL
CMU
Lexto
Lexto: Thai Lexeme Tokenizer
Java
LGPL
NECTEC
Python 2
LGPL
Python2 Wrapper
Python 3
LGPL
Python3 Wrapper
Wordcut
Thai word breaker for Node.js
JavaScript, Node.JS
LGPL-3.0
veer66, github
wordcutpy
A simple Thai word tokenizer written in 1 Python file
Python 3
LGPL-3.0
veer66, github
CutKum
Thai Word-Segmentation with Deep Learning in Tensorflow. RNN.
Python
0.93 F-measure.
MIT
Pucktada, github
Thai Language Toolkit (tltk)
Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)
Python
0.9786 F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)
GPLv3
awirote, the Python Package Index
DeepCut
A Thai word tokenization library using Deep Neural Network. CNN.
Python
0.988 F-measure.
MIT
rkcosmos, github
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
0.992 F-measure.
MIT
KenjiroAI, github
CutThai
Thai word segmentation written in coffee-script Edit
Coffee-script
MIT
Pureexe/cutthai Github
Part of Speech Tagging (POS Tagging)
Library
Description
Programming Languages
Features
License
Author & Link
Jitar+NAiST
A simple Trigram HMM part-of-speech tagger
Java
Ver66 , Jitar + NAiST, 1 + NAiST, 2
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
0.9163 F-measure. RNN. LSTM
MIT
KenjiroAI, github
Library
Description
Programming Languages
Features
License
Author & Link
Named Entity Tagging (Thai NEST)
Thai Named Entity tagging Specification and Tools
GPL
KINDML, SIIT , AIAT
Library
Description
Programming Languages
Features
License
Author & Link
News Structure Tagging Program
Thai News Structure Tagging Program
Metadata tagging, Structure tagging, Automatic News Title Generation
GPL
AIAT
Syntactic Parsing & Tools
Library
Description
Programming Languages
Features
License
Author & Link
Chart-parser
Extract Syntactic Structure from POS Tagged Sentence.
C
All rights reserved
Thanaruk T. ([email protected] )
Grammar Processing
Labelled Brackets -> Context Free Grammars (CFGs)
Python
Transform and compute probability
Thodsaporn C.
Library
Description
Programming Languages
Features
License
Author & Link
kobkrit-word-embedding
Tensorflow implementation of Thai word embedding
Python
Source code, Example, Word distance graph
LGPL
Kobkrit V.
Dictionaries / Translation Pairs
Library
Description
Size
Features
License
Link
Transliteration Corpus
31K pairs
Thai-Eng Translation Pair
CC BY-NC-SA 3.0 TH
NECTEC
LEXiTRON
Thai<->English Dictionary
TH->EN, EN->TH
LEXiTRON License
NECTEC
Yaitron
LEXiTRON in machine readable format (XML)
TH->EN, EN->TH
LEXiTRON License
Veer66 Schema , Data & Conversion Code
Library
Description
Size
Features
License
Link
Thai National Corpus 2
32M words
Query text by genre, domain
All rights reserved
CHULA
Thai Medical Document
3,594 docs
Document and dynamic keyword map
All rights reserved
KINDML, SIIT
Southeast Asian Languages Library
Thai News, Web Text, Pop Music, Literature, Toponyms
20M chars
Phase around a search text
SEALang
HSE Thai Corpus
Modern texts written in Thai language (mostly news websites)
50M tokens
Query by word form, lexeme, translation, grammatical attributes, lexical attributees
HSE School of Linguistics
Pre-trained Model
Description
Size
Dimensions
License
Link
fastText
Skip-Gram model trained on Wikipedia using fastText
300
CC BY-SA 3.0
Facebook + Bin & Text + Text Only
thai2vec
AWS LSTM Language Model trained on Wikipedia. Perplexity of 46.61 with 51556 embeddings.
147.6MB
300
MIT
thai2vec / pyThaiNLP
Text Classification Benchmarks
Model
Description
Dataset
Accuracy
License
Link
thai2vec
Finetuned AWS LSTM Language Model
BEST
94.4%
MIT
thai2vec / pyThaiNLP
Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)
http://aiat.in.th/resources/