Code Monkey home page Code Monkey logo

bn_tokenizer's Introduction

Bangla Tokenizer

Overview

This repository contains a tokenizer that has been trained on a diverse dataset comprising 20GB of text data. The primary focus of the data is Bangla, with some English text intentionally left in to facilitate future expansions. The tokenizer aims to provide robust tokenization for Bangla texts (English will be added very soon), making it an essential tool for various natural language processing (NLP) tasks, particularly for those working with limited data resources.

Additionally, alongside the tokenizer, a tokenizer visualizer has been developed. This visualizer generates tokenization results in HTML format, allowing users to view the tokenization output directly in their web browsers. It provides an intuitive way to inspect the tokenization process and can even accept custom models for comparison.

Motivation

The main motivation behind creating this tokenizer is to contribute to the open-source community and assist those who may not have access to large datasets. Tokenization is a crucial step in NLP, and incorrect tokenization can significantly impact the performance of subsequent models. For instance, with limited training data, Bangla words might be incorrectly split, such as "আমার" being tokenized as "আম+ ার", which is not desirable. This tokenizer will help to mitigate such issues by leveraging a large dataset to generalize tokenization and ensure accurate results.

Furthermore, it's essential to note that tokenizers trained on small datasets may struggle to handle unknown Bangla words properly. However, this tokenizer, trained on a large amount of data, excels in handling such scenarios, enhancing its utility and reliability in real-world applications.

Requirements

sentencepiece
matplotlib

Usage

import sentencepiece as spm

# Load the tokenizer model
tokenizer = spm.SentencePieceProcessor()
tokenizer.load('bn_tokenizer/tokenizer.model.model')

# Sample text
input_text = 'বাংলা টোকেনাইজার যা 20 গিগাবাইট টেক্সট ডেটার উপর প্রশিক্ষিত এবং ১০০ হাজার টোকেন রয়েছে'

# Encode the text
encoded_tokens = tokenizer.encode(input_text)

# Decode and print tokens separately
decoded_tokens = [tokenizer.decode([token]) for token in encoded_tokens]
print(decoded_tokens)
# output
['বাংলা', 'টোকেন', 'াইজার', 'যা', '20', 'গিগাবাইট', 'টেক্সট', 'ডেটা', 'র', 'উপর', 'প্রশিক্ষিত', 'এবং', '১০০', 'হাজার', 'টোকেন', 'রয়েছে']

Visualize Tokenizer

Run

python visualize_token.py

It will produce a token_visualize.html file, now open the token_visualize.html file on browser. The output will be like tokenizer_image

Visualize Tokenizer on Notebook

To visualize the tokenizer on notebook, open notebook and run

from tokenizer_viz import TokenVisualization
from IPython.display import HTML
import sentencepiece as spm

# Load the tokenizer model
tokenizer = spm.SentencePieceProcessor()
tokenizer.load('bn_tokenizer/tokenizer.model.model')

# Initialize the TokenVisualization class with the encoder and decoder functions
token_viz = TokenVisualization(
    encoder=get_encoder,
    decoder=get_decoder
)

# Define a sample text to visualize tokenization boundaries
sample_text = "বাংলা টোকেনাইজার যা 20 গিগাবাইট টেক্সট ডেটার উপর প্রশিক্ষিত এবং ১০০ হাজার টোকেন রয়েছে"

# Visualize the tokenization boundaries
html = token_viz.visualize(sample_text)
HTML(html)

Future Work

Expand English Dataset: Integrate an additional 50GB of English text data to get English tokenization capabilities. Right now there are some english but not good for standard english tokenization.

Contact

For any queries or suggestions, please open an issue on this repository or contact me at [email protected] or knock me at linkedin

bn_tokenizer's People

Contributors

hassanaliemon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.