Code Monkey home page Code Monkey logo

bert-bangla-mlm's Introduction

Masked Language Modeling for Bangla Text Using BERT Model

  • Input: Bangla Text with mask. i.e.: আমার সোনার বাংলা < blank > তোমায় ভালবাসি
  • Output: Input Text with Predicted mask. i.e.: আমার সোনার বাংলা আমি তোমায় ভালবাসি

Introduction

This repository contains the code and resources for a BERT (Bidirectional Encoder Representations from Transformers) model trained on a Bangla language dataset for Masked Language Modeling ( we can use any bangla .txt format dataset). Masked Language Model (MLM) involves providing BERT with a sentence and fine-tuning its weights to produce the same sentence as output. Prior to presenting the input sentence to BERT, masking done in certain tokens. Thus, the BERT model is used to generate the original sentence after receiving the masked input. Alt Text

Figure: In this image, before passing tokens into BERT — the lincoln token have masked, replacing it with [MASK] (Here in this project I use < blank >).

So the task is actually inputting an incomplete sentence and asking BERT to complete it for us.
IT IS ADVISABLE TO UTILIZE PRETRAINED BERT MODEL FOR BETTER OUTPUT.

Project Structure

├── images
├── dataset
├── bert_module.py
├── config.py
├── data_preprocess.py
├── inference.py
├── logs
│   ├── bert_mlm_bangla.keras
│   ├── fit
│   └── vectorizer_layer.pkl
├── README.md
├── environment.yml
├── train.py
└── utils.py

Installation

To use the Bangla BERT-MLM model in your project, follow these steps:

1. Clone the repository:
git clone https://github.com/kamrul-brur/BERT-Bangla-MLM.git
2. Install the required dependencies:
  • Create virtual environment and install dependiencies with the below command
conda env create -f environment.yml
  • Then activate the environment
conda activate bert_mlm

Edit config to change parameters according to the needs

In the config.py file, update the dataset path, log directory, model name, vectorizer layer name, max_epoch, and any other necessary parameters for training. Default values have been provided.

    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 0.001
    VOCAB_SIZE = 10000
    EMBED_DIM = 128
    NUM_HEAD = 8
    FF_DIM = 128
    NUM_LAYERS = 1
    EPOCHS = 100
    DATASET_PATH = "./dataset/dataset.txt"
    LOG_DIRECTORY = "logs"
    SAVED_MODEL_NAME = "bert_mlm_bangla.keras"
    SAVED_VECTORIZED_LAYER_NAME = "vectorizer_layer.pkl"
    TENSORBOARD_LOG_DIR = "logs/fit"

Model Training and Performance Monitoring

  • Run the train.py file using the command
python train.py

The training will be completed and log file is saved on the directory /logs.

View in tensorboard

  • Traning metrics in tensorboard can be visualized by the following command.
tensorboard -logdir logs

Then click on the url given in terminal to monitor performance in tensorboard

Model Inference and Evaluation

  • Write your input masked sentences in a txt file. i.e.: inputs.txt
  • In the ineference.py file, change the path of the input file to accordingly.
input_text_path = "dataset/inputs.txt"
  • change the path of model and vectorizer_layer files if needed.
vectorized_layer_path = os.path.join(config.LOG_DIRECTORY, config.SAVED_VECTORIZED_LAYER_NAME)
model_path = os.path.join(config.LOG_DIRECTORY, config.SAVED_MODEL_NAME)
  • Run the train.py file using the command
python ineference.py file

Resources

Citation

@article{devlin2018bert,
  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

bert-bangla-mlm's People

Contributors

kamrulhasanrony avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.