Code Monkey home page Code Monkey logo

albert-mongolian's Introduction

ALBERT-Mongolian

HuggingFace demo

This repo provides pretrained ALBERT model ("A Lite" version of BERT) and SentencePiece model (unsupervised text tokenizer and detokenizer) trained on Mongolian text corpus.

Contents:

Usage

You can use ALBERT-Mongolian in both PyTorch and TensorFlow2.0 using transformers library.

link to HuggingFace model card 🤗

import torch
from transformers import AlbertTokenizer, AlbertForMaskedLM

tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/albert-mongolian')
model = AlbertForMaskedLM.from_pretrained('bayartsogt/albert-mongolian')

Tutorials

AWS-Mongolians e-meetup #3

Results

Model Problem Task weighted F1
ALBERT-base Text Classification Eduge dataset 0.90
... ... ... ...

Comparison between ALBERT and BERT

Note that While ALBERT-base is compatible in terms of results shown below, it is over 10 times (only 135MB) smaller than BERT-base (1.2GB).

  • ALBERT-Mongolian:
                          precision    recall  f1-score   support

            байгал орчин       0.85      0.83      0.84       999
               боловсрол       0.80      0.80      0.80       873
                   спорт       0.98      0.98      0.98      2736
               технологи       0.88      0.93      0.91      1102
                 улс төр       0.92      0.85      0.89      2647
              урлаг соёл       0.93      0.94      0.94      1457
                   хууль       0.89      0.87      0.88      1651
             эдийн засаг       0.83      0.88      0.86      2509
              эрүүл мэнд       0.89      0.92      0.90      1159

                accuracy                           0.90     15133
               macro avg       0.89      0.89      0.89     15133
            weighted avg       0.90      0.90      0.90     15133
                          precision    recall  f1-score   support

            байгал орчин       0.82      0.84      0.83       999
               боловсрол       0.91      0.70      0.79       873
                   спорт       0.97      0.98      0.97      2736
               технологи       0.91      0.85      0.88      1102
                 улс төр       0.87      0.86      0.86      2647
              урлаг соёл       0.88      0.96      0.92      1457
                   хууль       0.86      0.85      0.86      1651
             эдийн засаг       0.84      0.87      0.85      2509
              эрүүл мэнд       0.90      0.90      0.90      1159

                accuracy                           0.88     15133
               macro avg       0.88      0.87      0.87     15133
            weighted avg       0.88      0.88      0.88     15133

Reproduce

Pretrain from Scratch: You can follow the PRETRAIN_SCRATCH.md to reproduce the results.

Here is pretraining loss: Pretraining Loss

Reference

  1. ALBERT - official repo
  2. WikiExtrator
  3. Mongolian BERT
  4. ALBERT - Japanese
  5. Mongolian Text Classification
  6. You's paper
  7. AWS-Mongolia e-meetup #3

Citation

@misc{albert-mongolian,
  author = {Bayartsogt Yadamsuren},
  title = {ALBERT Pretrained Model on Mongolian Datasets},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}

albert-mongolian's People

Contributors

bayartsogt-ya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

undarmaa biligee

albert-mongolian's Issues

Хагас И

Й-г И-г ээр солиод байх шиг байна. Text normalization чинь проблемтай байгаа юм уу даа. Сургахад бас ингэсэн үү?

[CLS] машины ихэнх ачааг жолооч нь дааж баигаа.[SEP]

Albert for long text

Сайн байна уу

  1. Таны энэ Albert дээр Fine-tune хийж болохуу
  2. Урт хэмжээтэй өөрөөр хэлбэл 512-с урт текст дээр текст ангилагч хийхдээ яах вэ?
  3. Transformer_XL монгол хэл дээр сургах төлөвлөгөө байгаа юу?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.