Code Monkey home page Code Monkey logo

bangla-bert's Introduction

Bangla BERT Base

A long way passed. Here is our Bangla-Bert! It is now available in huggingface model hub.

Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Download Model

TF Version Pytorch Version Vocab
Bangla BERT Base ----- Huggingface Hub Vocab

Pretrain Corpus Details

Corpus was downloaded from two main sources:

After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.

sentence 1
sentence 2

sentence 1
sentence 2

Building Vocab

We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.

Training Details

  • Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
  • Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
  • Total Training Steps: 1 Million
  • The model was trained on a single Google Cloud TPU

Evaluation Results

LM Evaluation Results

After training 1 millions steps here is the evaluation resutls.

global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227

Downstream Task Evaluation Results

Huge Thanks to Nick Doiron for providing evalution results of classification task. He used Bengali Classification Benchmark datasets for classification task. Comparing to Nick's Bengali electra and multi-lingual BERT, Bangla BERT Base achieves state of the art result. Here is the evaluation script.

Model Sentiment Analysis Hate Speech Task News Topic Task Average
mBERT 68.15 52.32 72.27 64.25
Bengali Electra 69.19 44.84 82.33 65.45
Bangla BERT Base 70.37 71.83 89.19 77.13

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Check Bangla BERT Visualize

bertviz

How to Use

Bangla BERT Tokenizer

from transformers import AutoTokenizer, AutoModel

bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
text = "আমি বাংলায় গান গাই।"
bnbert_tokenizer.tokenize(text)
# ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

MASK Generation

You can use this model directly with a pipeline for masked language modeling:

from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
  print(pred)

# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

Author

Sagor Sarker

Acknowledgements

  • Thanks to Google TensorFlow Research Cloud (TFRC) for providing the free TPU credits - thank you!
  • Thank to all the people around, who always helping us to build something for Bengali.

Reference

bangla-bert's People

Contributors

sagorbrur avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.