Code Monkey home page Code Monkey logo

thai2transformers's Introduction

thai2transformers

Pretraining transformer-based Thai language models


thai2transformers provides customized scripts to pretrain transformer-based masked language model on Thai texts with various types of tokens as follows:

  • spm: a subword-level token from SentencePiece library.
  • newmm : a dictionary-based Thai word tokenizer based on maximal matching from PyThaiNLP.
  • syllable: a dictionary-based Thai syllable tokenizer based on maximal matching from PyThaiNLP. The list of syllables used is from pythainlp/corpus/syllables_th.txt.
  • sefr: a ML-based Thai word tokenizer based on Stacked Ensemble Filter and Refine (SEFR) [Limkonchotiwat et al., 2020] based on probabilities from CNN-based deepcut and SEFR tokenizer is loaded with engine="best".


Thai texts for language model pretraining


We curate a list of sources that can be used to pretrain language model. The statistics for each data source are listed in this page.

Also, you can download current version of cleaned datasets from here.



Model pretraining and finetuning instructions:


a) Instruction for RoBERTa BASE model pretraining on Thai Wikipedia dump:

In this example, we demonstrate how pretrain RoBERTa base model on Thai Wikipedia dump from scratch

  1. Install required libraries: 1_installation.md

  2. Prepare thwiki dataset from Thai Wikipedia dump: 2_thwiki_data-preparation.md

  3. Tokenizer training and vocabulary building :

    a) For SentencePiece BPE (spm), word-level token (newmm), syllable-level token (syllable): 3_train_tokenizer.md

    b) For word-level token from Limkonchotiwat et al., 2020 (sefr-cut) : 3b_sefr-cut_pretokenize.md

  4. Pretrain a masked language model: 4_run_mlm.md


b) Instruction for RoBERTa model finetuning on existing Thai text classification, and NER/POS tagging datasets.

In this example, we demonstrate how to finetune WanchanBERTa, a RoBERTa base model pretrained on Thai Wikipedia dump and Thai assorted texts.

  • Finetune model for sequence classification task from exisitng datasets including wisesight_sentiment, wongnai_reviews, generated_reviews_enth (review star prediction), and prachathai67k: 5a_finetune_sequence_classificaition.md

  • Finetune model for token classification task (NER and POS tagging) from exisitng datasets including thainer and lst20: 5b_finetune_token_classificaition.md



BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

thai2transformers's People

Contributors

computerscienceiscool avatar cstorm125 avatar dependabot[bot] avatar lalital avatar wannaphong avatar zincorca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

thai2transformers's Issues

Wrong feature extraction in the WangchanBERTa Getting Start Notebook

Under the Feature Extraction section in WangchanBERTa: Getting Started Notebook, there's this function

def extract_last_k(input_text, feature_extractor, last_k=4):
    hidden_states = feature_extractor(input_text)[0]
    last_k_layers = [hidden_states[i] for i in [-i for i in range(1,last_k+1)]]
    cat_hidden_states = sum(last_k_layers, [])
    return np.array(cat_hidden_states)

If I'm not mistaken, this function is meant to extract embeddings from the last k layers, but what it actually is that it extracts embeddings from only the last layer and takes only last k token embedding vectors.

Below is a snippet that I think it should be:

def extract_embeddings(model, tokenizer, text, last_k=4):
    inputs = tokenizer(text1, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    embeddings = torch.cat([outputs.hidden_states[-i] for i in range(1, last_k + 1)])
    return embeddings

Some examples of how to use the modified function:

text1 = "เพื่อน"
model_name = "wangchanberta-base-att-spm-uncased"
model = AutoModel.from_pretrained(f'airesearch/{model_name}', revision='main')
tokenizer = tokenizers[model_name].from_pretrained(f'airesearch/{model_name}', revision='main', model_max_length=416)
embeddings = extract_embeddings(model, tokenizer, text1) # (#layers, #tokens, hidden_size) -> this case (4, 5, 768)

# sum across layers first
token_embeddings_last_4_layers = embeddings.sum(dim=0) # (#tokens, hidden_size) -> this case (5, 768)
sum_sentence_embedding = token_embeddings_last_4_layers.sum(dim=0) # (hidden_size) -> this case (768)

# sum across tokens first
layer_embeddings = embeddings.sum(dim=1) # (#layers, hidden_size) -> this case (4, 768)
concat_sentence_embedding = torch.cat(layer_embeddings) # (#layers x hidden_size) -> this case (3072)

Not sure how to make it to be a pipeline. But I hope this code would help.
Please confirm if this makes sense.

New WangchanBERTa

List of issues to take care of:

  • CC-100 dataset
  • Space token fix

developing

Experiment on full set of `iapp_wiki_qa_squad`

Sub tasks

Optimize for QA performance

WangchanBERTa still underperforms xlm-roberta-base:

  • Perform error analysis on iapp_wiki_qa_squad validation and test set
  • Error analysis
  • Re-finetune for optimal performance

Refactor thai2transformers as utility package for transformers

transformers is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization for sequence-based metrics such as BLEU is based on space tokenization. We also want to include some quality-of-life functions such as easily loading datasets into datasets objects and preprocessing functions that are available in the tutorial notebooks.

  • Thai-language specific metrics
  • Preprocessing functions
  • Load datasets

PR #67

Error บน wongnai_reviews

สวัสดีครับ

ผมลองใช้ wangchanberta ดูบน wongnai_reviews ตามโค้ดข้างล่างนี้ แล้วเจอ error แปลก ๆ ไม่ทราบว่าต้องแก้อย่างไรครับ

from transformers import (
    CamembertTokenizer,
    AutoModelForSequenceClassification,
    pipeline
)
from thai2transformers.preprocess import process_transformers

# Load pre-trained tokenizer
tokenizer = CamembertTokenizer.from_pretrained(
                                  'airesearch/wangchanberta-base-att-spm-uncased',
                                  revision='main')
tokenizer.additional_special_tokens = ['<s>NOTUSED', '</s>NOTUSED', '<_>']

# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
                                  'airesearch/wangchanberta-base-att-spm-uncased',
                                  revision='finetuned@wongnai_reviews')

classify_sequence = pipeline(task='text-classification',
          tokenizer=tokenizer,
          model=model)

from datasets import load_dataset

dataset = load_dataset('wongnai_reviews', cache_dir='huggingface_cache')

text = dataset['test'][3]['review_body']

processed_input_text = process_transformers(text)

result = classify_sequence(processed_input_text)
print(result)

Error ที่เจอคือ
image

Consolidate Thai extractive question answering datasets

Consolidate all Thai QA datasets into one benchmark. This is done to have sizeable sample to train Thai QA.
Training sets (checklist of availability on huggingface datasets):

  • thaiqa_squad
  • iapp_wiki_qa_squad
  • xquad

Test set:

  • iapp_wiki_qa_squad test set; removing questions with contexts overlapping with training sets

Available on huggingface datasets but cannot be used for Thai extractive QA:

  • tydiqa
  • mkqa

Fix an issue where input tokens to the WangchanBERTa NER pipeline may be in the incorrect form

According to the model finetuning pipeline, the input tokens are tokenized with other tokenizer (e.g. PyThaiNLp's newmm for thainer dataset) and then retokenizer with SentencePiece tokenizer. However, the input tokens fed to the finetuned model is the tokens tokenized with SentencePiece only (not newmm first, and then followed by SentencePiece)

Proposed solution:

  1. Pretokenize with PyThaiNLP's newmm tokenizer
  2. Retokenize with the subword tokenizer (SentencePiece)
  3. Map the prediction results of the subword tokens to the tokens tokenized with newmm
  4. Return the prediction results in word-level and chunk-level

Branch name: feature/ner_pipeline

NER Pipeline Demo (via Colab): https://colab.research.google.com/drive/1-54NeM_wsjitaiSXfMBpcnqzbPMR0a9R#scrollTo=VzSGZbwWaiOI

Missing model_max_length in roberta config

When loaded with transformers.AutoTokenizer.from_pretrained, the model_max_len was set to 1000000000000000019884624838656.

This results in IndexError: index out of range in self when using with flair in the code below.

from flair.embeddings import TransformerDocumentEmbeddings

wangchanberta = TransformerDocumentEmbeddings('airesearch/wangchanberta-base-att-spm-uncased')
wangchanberta .embed(sentence)

After searching, I found this issue huggingface/transformers#14315 (comment) and it stated that model_max_length is missing from the configuration file.

My current workaround is manually calling the following code to overrides the missing config.

wangchanberta.tokenizer.model_max_length = 510

Broken URLs

Links of data source statistics and cleaned datasets in README are broken or removed. Please update the links to the latest version.

tokenizers package conflicting

Hi, I'm not sure this is the problem. my environment is an anaconda with python 3.7 fresh environment. I try to install your package from pip install with the command pip install thai2transformers==0.1.0. so I got the error
Screen Shot 2021-02-01 at 2 05 06 PM

thanks.

Major, breaking refactoring to v1

Refactoring thai2transformers as

Huggingface Utility Functions, Scripts and Notebooks for Thai language

thai2transformers provides utility functions, scripts and notebooks to pretrain, finetune, evaluate and infer Huggingface models and datasets in Thai language.

Progress tracking: https://docs.google.com/spreadsheets/d/1Arusyp3NOiBSv3KAdNPnN1VWOsVnj5ZHG71fKxnEK7o/edit?usp=sharing
Planned benchmark downstream tasks: https://docs.google.com/spreadsheets/d/1HE-6A4VuxVMvc78WTVGF0aOkrogd-BnJrBUdZBB13U8/edit?usp=sharing
Working branch: https://github.com/vistec-AI/thai2transformers/tree/refactor-as-package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.