The thai2transformers from vistec-ai

Wrong feature extraction in the WangchanBERTa Getting Start Notebook

Under the Feature Extraction section in WangchanBERTa: Getting Started Notebook, there's this function

def extract_last_k(input_text, feature_extractor, last_k=4):
    hidden_states = feature_extractor(input_text)[0]
    last_k_layers = [hidden_states[i] for i in [-i for i in range(1,last_k+1)]]
    cat_hidden_states = sum(last_k_layers, [])
    return np.array(cat_hidden_states)

If I'm not mistaken, this function is meant to extract embeddings from the last k layers, but what it actually is that it extracts embeddings from only the last layer and takes only last k token embedding vectors.

Below is a snippet that I think it should be:

def extract_embeddings(model, tokenizer, text, last_k=4):
    inputs = tokenizer(text1, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    embeddings = torch.cat([outputs.hidden_states[-i] for i in range(1, last_k + 1)])
    return embeddings

Some examples of how to use the modified function:

text1 = "เพื่อน"
model_name = "wangchanberta-base-att-spm-uncased"
model = AutoModel.from_pretrained(f'airesearch/{model_name}', revision='main')
tokenizer = tokenizers[model_name].from_pretrained(f'airesearch/{model_name}', revision='main', model_max_length=416)
embeddings = extract_embeddings(model, tokenizer, text1) # (#layers, #tokens, hidden_size) -> this case (4, 5, 768)

# sum across layers first
token_embeddings_last_4_layers = embeddings.sum(dim=0) # (#tokens, hidden_size) -> this case (5, 768)
sum_sentence_embedding = token_embeddings_last_4_layers.sum(dim=0) # (hidden_size) -> this case (768)

# sum across tokens first
layer_embeddings = embeddings.sum(dim=1) # (#layers, hidden_size) -> this case (4, 768)
concat_sentence_embedding = torch.cat(layer_embeddings) # (#layers x hidden_size) -> this case (3072)

Not sure how to make it to be a pipeline. But I hope this code would help.
Please confirm if this makes sense.

New WangchanBERTa

List of issues to take care of:

CC-100 dataset
Space token fix

developing

Experiment on full set of `iapp_wiki_qa_squad`

Sub tasks

Find sizeable dataset for benchmarking https://github.com/iapp-technology/iapp-wiki-qa-dataset
Convert to iapp_wiki_qa_squad: https://github.com/iapp-technology/iapp-wiki-qa-dataset
PR to huggingface/datasets: huggingface/datasets#1873
Runnable notebook: https://colab.research.google.com/drive/1o-bM5V2C7m2BaXapwZBcavq5Ig3VF0fK?usp=sharing
Notebook that works 100% with WangchanBERTa and iapp_wiki_qa_squad (shared with @artificiala )
Train on combined iapp_wiki_qa_squad and thaiqa_squad as training set (duplicates removed)
Acceptable finetuning performance
Add helper functions to thai2transformers
Experiment scripts

Optimize for QA performance

WangchanBERTa still underperforms xlm-roberta-base:

Perform error analysis on iapp_wiki_qa_squad validation and test set
Error analysis
Re-finetune for optimal performance

Documentation on HuggingFace

Model cards
Tokenizer cards

Refactor experiment/training/finetuning scripts with documentation

Experiment scripts
Finetuning scripts / notebooks
Training MLM / finetuning MLM scripts / notebooks

Translate and align SQuAD 1.1 and SQuAD 2.0

Both datasets have over 100k questions. Translations will make training sets:

1.1 machine translation
2.0 machine translation
1.1 human translation
2.0 human translation

https://rajpurkar.github.io/SQuAD-explorer/

Refactor thai2transformers as utility package for transformers

transformers is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization for sequence-based metrics such as BLEU is based on space tokenization. We also want to include some quality-of-life functions such as easily loading datasets into datasets objects and preprocessing functions that are available in the tutorial notebooks.

Thai-language specific metrics
Preprocessing functions
Load datasets

PR #67

Error บน wongnai_reviews

สวัสดีครับ

ผมลองใช้ wangchanberta ดูบน wongnai_reviews ตามโค้ดข้างล่างนี้ แล้วเจอ error แปลก ๆ ไม่ทราบว่าต้องแก้อย่างไรครับ

from transformers import (
    CamembertTokenizer,
    AutoModelForSequenceClassification,
    pipeline
)
from thai2transformers.preprocess import process_transformers

# Load pre-trained tokenizer
tokenizer = CamembertTokenizer.from_pretrained(
                                  'airesearch/wangchanberta-base-att-spm-uncased',
                                  revision='main')
tokenizer.additional_special_tokens = ['<s>NOTUSED', '</s>NOTUSED', '<_>']

# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
                                  'airesearch/wangchanberta-base-att-spm-uncased',
                                  revision='finetuned@wongnai_reviews')

classify_sequence = pipeline(task='text-classification',
          tokenizer=tokenizer,
          model=model)

from datasets import load_dataset

dataset = load_dataset('wongnai_reviews', cache_dir='huggingface_cache')

text = dataset['test'][3]['review_body']

processed_input_text = process_transformers(text)

result = classify_sequence(processed_input_text)
print(result)

Error ที่เจอคือ

Explore percentage of repetitive characters in wisesight corpus

Todo

Sample sentences from ws-large corpus and find words with repetitive characters

Consolidate Thai extractive question answering datasets

Consolidate all Thai QA datasets into one benchmark. This is done to have sizeable sample to train Thai QA.
Training sets (checklist of availability on huggingface datasets):

thaiqa_squad
iapp_wiki_qa_squad
xquad

Test set:

iapp_wiki_qa_squad test set; removing questions with contexts overlapping with training sets

Available on huggingface datasets but cannot be used for Thai extractive QA:

tydiqa
mkqa

Translate XNLI train from en to th with AIResearch NMT

Benchmark against AI4Thai

Benchmark wanchanberta results (all models; see https://arxiv.org/abs/2101.09635) against AI4thai APIs](https://aiforthai.in.th/service_bn.php):

en-th machine translation
zh-th machine translation (pending model from AI Builders)
word tokenization (possibly against newmm, deepcut, attacut, sefr)
sentiment analysis
POS
NER
Word Similarity (vs our zero-shot model?)
Extractive Question Answering

Refactor `thai2transformers` as helper library on PyPi

data processing
tokenizers
metrics
...

Fix an issue where input tokens to the WangchanBERTa NER pipeline may be in the incorrect form

According to the model finetuning pipeline, the input tokens are tokenized with other tokenizer (e.g. PyThaiNLp's newmm for thainer dataset) and then retokenizer with SentencePiece tokenizer. However, the input tokens fed to the finetuned model is the tokens tokenized with SentencePiece only (not newmm first, and then followed by SentencePiece)

Proposed solution:

Pretokenize with PyThaiNLP's newmm tokenizer
Retokenize with the subword tokenizer (SentencePiece)
Map the prediction results of the subword tokens to the tokens tokenized with newmm
Return the prediction results in word-level and chunk-level

Branch name: feature/ner_pipeline

NER Pipeline Demo (via Colab): https://colab.research.google.com/drive/1-54NeM_wsjitaiSXfMBpcnqzbPMR0a9R#scrollTo=VzSGZbwWaiOI

Update all tokenizers to fast

Question answering task for current transformers==0.4.2 is only possible with fast tokenizers
https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py

We should bump up our transformers version to 0.4.2 and update all tokenizers to fast

Missing model_max_length in roberta config

When loaded with transformers.AutoTokenizer.from_pretrained, the model_max_len was set to 1000000000000000019884624838656.

This results in IndexError: index out of range in self when using with flair in the code below.

from flair.embeddings import TransformerDocumentEmbeddings

wangchanberta = TransformerDocumentEmbeddings('airesearch/wangchanberta-base-att-spm-uncased')
wangchanberta .embed(sentence)

After searching, I found this issue huggingface/transformers#14315 (comment) and it stated that model_max_length is missing from the configuration file.

My current workaround is manually calling the following code to overrides the missing config.

wangchanberta.tokenizer.model_max_length = 510

some markdown files are lost

can not open follow files:

3b_sefr-cut_pretokenize.md
4_run_mlm.md

Huggingface Utility Functions, Scripts and Notebooks for Thai language

thai2transformers provides utility functions, scripts and notebooks to pretrain, finetune, evaluate and infer Huggingface models and datasets in Thai language.

Progress tracking: https://docs.google.com/spreadsheets/d/1Arusyp3NOiBSv3KAdNPnN1VWOsVnj5ZHG71fKxnEK7o/edit?usp=sharing
Planned benchmark downstream tasks: https://docs.google.com/spreadsheets/d/1HE-6A4VuxVMvc78WTVGF0aOkrogd-BnJrBUdZBB13U8/edit?usp=sharing
Working branch: https://github.com/vistec-AI/thai2transformers/tree/refactor-as-package

vistec-ai / thai2transformers Goto Github PK

thai2transformers's Introduction

thai2transformers

Thai texts for language model pretraining

Model pretraining and finetuning instructions:

BibTeX entry and citation info

thai2transformers's People

Contributors

Stargazers

Watchers

Forkers

thai2transformers's Issues

Recommend Projects

Recommend Topics

Recommend Org