Code Monkey home page Code Monkey logo

hinditokenizer's Introduction

Tokenizer


Tokenizer

Video demonstration of code: https://drive.google.com/file/d/1KjsPPoEl-lHvEuN2WyoBWHTx6HgqP47Y/view?usp=sharing

About Tokenization

Natural Language Processing (NLP) enables machine learning algorithms to organize and understand human language. NLP enables machines to not only gather text and speech but also identify the core meaning it should respond to. Tokenization is one of the many pieces of the puzzle in how NLP works. Tokenization is a simple process that takes raw data and converts it into a useful data string. While tokenization is well known for its use in cybersecurity and the creation of NFTs, tokenization is also an important part of the NLP process. Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words). Here’s an example of a string of data:

“What restaurants are nearby?”

For this sentence to be understood by a machine, tokenization is performed on the string to break it into individual parts. With tokenization, we’d get something like this:

‘what’ ‘restaurants’ ‘are’ ‘nearby’

This may seem simple, but breaking a sentence into its parts allows a machine to understand the parts as well as the whole. This will help the program understand each of the words by themselves, as well as how they function in the larger text.

Data/Packages used

We have used the following data/package:
Natural Language Toolkit for Indic Languages (iNLTK). This package helps by providing out-of-the-box support for various NLP tasks that an application developer might need. It supports a wide variety of languages:

Language Hindi Punjabi Gujarati Kannada Malayalam Oriya Marathi Bengali Tamil Urdu Nepali Sanskrit English Telugu
Code hi pa gu kn ml or mr bn ta ur ne sa en te

https://github.com/goru001/inltk

We have used the dataset called as “HindiEnglish Corpora” provided by Aiswaryaramachandran. The dataset comprises Hindi English Truncated Corpus that is, it contains a huge list of sentences translated from English to Hindi, thus providing us with enough data to work on. https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora

Code

https://colab.research.google.com/drive/1deNNkra2rS2imrAvGj90mHYp05EA6lYp?usp=sharing

Code Explanation

To upload the file from the local drive we write the following code in the cell and run it

from google.colab import files
uploaded = files.upload()

We click on the “choose files” option, then select and download the CSV data set file (which we downloaded from Kaggle known as 'Hindi_English_Truncated_Corpus.csv') from our local drive. Later we write the following code snippet to import it into a pandas data frame.

import pandas as pd
import io

df = pd.read_csv(io.BytesIO(uploaded['Hindi_English_Truncated_Corpus.csv']))

The head() function is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

df.head()

Next, we install the torch. PyTorch is a Python package that provides two high-level features:

  • Tensor computation (like NumPy) with strong GPU acceleration
  • Deep neural networks built on a tape-based autograd system

pip install torch==1.12.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

iNLTK runs on CPU, as is the desired behaviour for most of the Deep Learning models in production. The command above will install PyTorch for CPU, which, as the name suggests, does not have Cuda support.
The iNLTK is installed once all its requirements are satisfied with python libraries and packages by the following code:

pip install inltk

The torch-1.12.1-cp37-cp37m-manylinux1_x86_64.whl version gets downloaded. Once the download has been successfully completed we set up the language we want to use the tokenizer for:

from inltk.inltk import setup
setup('hi')

We used ‘hi’ since we will be using the language Hindi for the tokenizer.
Note: ignore the runtime error as it is probably caused by the difference in the torch version of the package used and the latest one we are using. At the end of the output, we can see the code does run without error and provides output as “Done!”.
We import the tokenizer using the following command from the iNLTK package:

from inltk.inltk import tokenize

Since we have already provided data set for the program. Therefore we just call the tokenizer function and sentence by its code which was shown in the df.head() command’s output.

tokenize(df.hindi_sentence[0],"hi")
tokenize(df.hindi_sentence[1],"hi")
tokenize(df.hindi_sentence[2],"hi")
tokenize(df.hindi_sentence[3],"hi")
tokenize(df.hindi_sentence[4],"hi")

We will receive the output in the form of tokens of the sentence provided.
Alternative way to provide sentence to our program is by specifying the string name and providing the sentence or paragraph as input, like this:

hindi_input = """प्राचीन काल में विक्रमादित्य नाम के एक आदर्श राजा हुआ करते थे।
अपने साहस, पराक्रम और शौर्य के लिए राजा विक्रम मशहूर थे।
ऐसा भी कहा जाता है कि राजा विक्रम अपनी प्राजा के जीवन के दुख दर्द जानने के लिए रात्री के पहर में भेष बदल कर नगर में घूमते थे।"""

The tokenize command now will be provided in the format of:
tokenize(input text, language code)

tokenize(hindi_input, "hi")

This command’s output will also provide us tokens of the given paragraph which we provided in “hindi_input”.
Further in this tokenizer, we have imported the feature to remove foreign languages as well.

from inltk.inltk import remove_foreign_languages

The command to implement this import is of the format:
Remove_foreign_languages(text, “”)
If any word in the sentence is detected by the program which doesn’t belong to the language whose language code we have provided in the command, then the word will turn out in the output as

remove_foreign_languages("इस्लाम धर्म (الإسلام) ईसाई धर्म के बाद अनुयाइयों के आधार पर दुनिया का दूसरा सब से बड़ा धर्म है।", "hi")

Here, الإسلام is not a Hindi word, hence it will be in the output.

hinditokenizer's People

Contributors

apoorva57 avatar

Watchers

 avatar

Forkers

nikhil-pandarge

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.