Code Monkey home page Code Monkey logo

gujarati-nlp-toolkit's Introduction

Gujarati-NLP-Toolkit

Added Features:

1) POS Tagger:

a) Implementation:

import posTagger as pt
tagger = pt.posTagger(corpus='prose')	# corpus='poetry' if you are tagging a sentence of a poem
tagger.eval()	# Set the tagger in evaluation/inference mode
sentence = 'તારુ નામ શુ છે?'  # What is your name?
print(tagger.pos_tag(sentence))   # [('તારુ', 'PR_PRP'), ('નામ', 'N_NN'), ('શુ', 'N_NN'), ('છે', 'V_VAUX'), ('?', 'RD_PUNC')]

b) Training your own posTagger:

import posTagger as pt
tagger = pt.posTagger('model_name')   # Give any name for your model
train_data = tagger.structure_data('path/to/corpus/file')

Your corpus must be in txt(tab delimited) or tsv format (as having csv might conflict because Gujarati has the comma [ , ] as a punctuation) and must contain a column 'Value' having the data in the form:

Row1:    'word1\tag1 word2\tag2 word3\tag3 ....... wordn\tagn'
Row2:    'word1\tag1 word2\tag2 word3\tag3 ....... wordn\tagn'

Else You may create your own train_X, train_y of the form:

train_X:  [[feature_dict of word1, feature_dict of word2, ........ , feature_dict of wordn],     //Sentence 1
	      [feature_dict of word1, feature_dict of word2, ........ , feature_dict of wordn]]	   // Sentence n
train_y:  [[tags of sentence 1], [tags of sentence 2], ........, [tags of sentence n]]

 

Moreover, you may also train data in languages other than Gujarati for creating posTagger in your own language.

 

Enter the Following block of code after this:

tagger.train(train_data, tagger._model_file)

c) Loading your trained posTag Model:

import posTagger as pt
tagger = pt.posTagger('model_name')
tagger.eval()
## Carry out your processes

   

2) Tokenizers:

a) Word Tokenizer:

from tokenizer import WordTokenizer
sentence = 'તારુ નામ શુ છે?'  # What is your name?
tokens = WordTokenizer(sentence, keep_punctuations=True, keep_stopwords=True)   # Set False to remove Punctuations and Stopwords respectively
print(tokens)  # ['તારુ', 'નામ', 'શુ', 'છે', '?']

b) Sentence Tokenizer:

from tokenizer import SentenceTokenizer
sentence = 'તારુ નામ શુ છે? તુ શુ કરે છો?'  # What is your name? What are you doing?
tokens = SentenceTokenizer(sentence)
print(tokens)  # ['તારુ નામ શુ છે?', 'તુ શુ કરે છો?']

   

3) Transliterator:

This feature is helpful for people not acquainted with Gujarati. This function takes input of a Gujarati Word or Letter and gives out the Pronounciation of the respective input in English.

a) Letter Transliteration:

from transliterator import Transliterator
transliterator = transliterator()
transliteration = transliterator.letter_transliterate_gujarati_to_english('ત')  # Letter 'ta'
print(transliteration)   # ta

b) Word/Sentence Translation:

from transliterator import Transliterator
transliterator = transliterator()
transliteration = transliterator.transliterate('તારુ')  # Meaning 'your' or 'yours'
print(transliteration)   # taaru
transliteration = transliterator.transliterate('મારુ નામ રુત્વિક છે')	# meaning 'My name is Rutvik'
print(transliteration)   # maaru naam rutvik chhe

   

4) Stemmer:

The implementation of the Stemmer is completely rule based. So, it will not be able to give accurate stems and meaningful words. The stemmer is able to give stemmed words by stripping both prefixes and suffixes. An example covering both the strips is given below. The implementation of the stemmer can be done as follows:

a) Stem a text:

from stemmer import Stemmer
stemmer = Stemmer()
example_text = 'હું દોડ​વા જઉં છું. દોડ​વા માટે તમારે સાથે આવ​વું પ‌ડશે. ચાલશે ને? અન્યાય ના કરતા.'
# Meaning: 'I am going for a run. You will have to come with me. Is it ok? Don't be unfair.'
# This will exhibit both prefix (in 'અન્યાય') and suffix (in 'દોડ​વા' and other words) stripping.
stemmed_text = stemmer.stem(example_text)
print(stemmed_text)
# Output is a list of stemmed sentences: ['હુ દોડ જઉં છું.', 'દોડ માટે તમાર સાથ ચાલ પડશ.', 'ચાલશ ને', 'ન્યાય ના કર.']

b) Add and Remove Suffixes:

from stemmer import Stemmer
stemmer = Stemmer()
stemmer.add_suffix('suffix_in_gujarati') # Will add the entered suffix to the stripping list. Will strip the suffix after this
stemmer.stem('your_sentence') # This sentence will be stripped of your added suffix also.
stemmer.add_prefix('prefix_in_gujarati') # Similar to add_suffix(). But it will work for prefix.
stemmer.delete_suffix('suffix_to_delete') # Stemmer won't consider stripping the suffix anymore for the session.
stemmer.delete_prefix('prefix_to_delete') # Similar to delete_suffix() but for prefix.

   

TODO:

  • Improve the POS Tagger performance
  • Create a morphological tokenizer
  • Create a Sentiment Analyser

gujarati-nlp-toolkit's People

Contributors

rutvik-trivedi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gujarati-nlp-toolkit's Issues

pandas.errors.ParserError: Error tokenizing data. C error:

Unable to training model. plz help to solve

Traceback (most recent call last):
File "pos_guj.py", line 5, in
train_data = tagger.structure_data('/home/jigisha/Downloads/HIN-GUJ_Sample/Sample_Gujarati.txt')
File "/home/jigisha/Desktop/new_pos/Gujarati-NLP-Toolkit-dev/posTagger.py", line 197, in structure_data
data = self.collect_train_data(file)
File "/home/jigisha/Desktop/new_pos/Gujarati-NLP-Toolkit-dev/posTagger.py", line 176, in collect_train_data
data = pd.read_csv(file)
File "/home/jigisha/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 611, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/jigisha/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 469, in _read
return parser.read(nrows)
File "/home/jigisha/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1059, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/jigisha/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2064, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2

ValueError: Tagger is not opened

તારુ નામ શુ છે?
['તારુ', 'નામ', 'શુ', 'છે']
[['WORD_તારુ', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'NEXT_WORD_નામ', 'SUF_ુ', 'PRE_શ', 'NEXT_NEXT_WORD_શુ'], ['WORD_નામ', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_WORD_તારુ', 'SUF_ુ', 'PRE_શ', 'NEXT_WORD_શુ', 'SUF_ે', 'PRE_છ', 'NEXT_NEXT_WORD_છે'], ['WORD_શુ', 'SUF_ુ', 'PRE_શ', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'PREV_WORD_નામ', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_PREV_WORD_તારુ', 'SUF_ે', 'PRE_છ', 'NEXT_WORD_છે'], ['WORD_છે', 'SUF_ે', 'PRE_છ', 'SUF_ુ', 'PRE_શ', 'PREV_WORD_શુ', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'PREV_PREV_WORD_નામ']]
Traceback (most recent call last):
File "pos_guj.py", line 4, in
print(tagger.pos_tag(sen))
File "/home/jigisha/Desktop/Gujarati-NLP-Toolkit-master/posTagger.py", line 213, in pos_tag
tags = self.tag(sent)
File "/home/jigisha/.local/lib/python3.8/site-packages/nltk/tag/crf.py", line 210, in tag
return self.tag_sents([tokens])[0]
File "/home/jigisha/.local/lib/python3.8/site-packages/nltk/tag/crf.py", line 168, in tag_sents
labels = self._tagger.tag(features)
File "pycrfsuite_pycrfsuite.pyx", line 637, in pycrfsuite._pycrfsuite.Tagger.tag
File "pycrfsuite_pycrfsuite.pyx", line 695, in pycrfsuite._pycrfsuite.Tagger.set
ValueError: Tagger is not opened.

As feature list is generated (as printed above) in crf.py file. so kindly help me solve this issue

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.