Comments (7)
Hi @jigishaSA, this error seems to be due to a problem in your training data file. As the error says in the last line,
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2
you need to check your Sample_Gujarati.txt
file on line number 8. Here, it seems that there are two fields, instead of one and hence this error is generated. You fix this problem and the error will be fixed. Hope this helps.
from gujarati-nlp-toolkit.
as I observe that it shows error in only that lines which contain ,(comma) as RD_PUNC tag. So here i am sending tag file. plz guide me
Sample_Gujarati.txt
from gujarati-nlp-toolkit.
Hi @jigishaSA, sorry for the little late reply. I have updated the code. There was a typo in the code which created this problem. Please run
git pull origin master
and run the code again. It should work now.
Moreover, as the requirement of the file says, you must have a Value
heading for the column having the text. The data you have does not have that initial header row. Do add that header row so that your dataset looks something like
ID | Value |
---|---|
id1 | text1 |
id2 | text2 |
otherwise, you will again get an error something like
KeyError: 'Value'
Please make these changes, and you should be good to continue. Let me know in case you face any more problems. Please do close the issue if everything works fine for you. Thanks
from gujarati-nlp-toolkit.
Thanks you for reply. Now your tagger is successfully run but when it train using given tag file its output like this
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 0
Seconds required: 0.001
L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20
L-BFGS terminated with error code (-1020)
Total seconds required for training: 0.000
Storing the model
Number of active features: 0 (0)
Number of active attributes: 0 (0)
Number of active labels: 1 (1)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.001
[['WORD_તારુ', 'LENGTH_4', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'NEXT_WORD_નામ', 'NEXT_LENGTH_3', 'SUF_ુ', 'PRE_શ', 'NEXT_NEXT_WORD_શુ', 'NEXT_NEXT_LENGTH_2'], ['WORD_નામ', 'LENGTH_3', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_WORD_તારુ', 'PREV_LENGTH_4', 'SUF_ુ', 'PRE_શ', 'NEXT_WORD_શુ', 'NEXT_LENGTH_2'], ['WORD_શુ', 'LENGTH_2', 'SUF_ુ', 'PRE_શ', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'PREV_WORD_નામ', 'PREV_LENGTH_3', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_PREV_WORD_તારુ', 'PREV_PREV_LENGTH_4']]
[[('તારુ', 'EMPTY'), ('નામ', 'EMPTY'), ('શુ', 'EMPTY')]]
My code is
tagger=pt.posTagger('guj_pos_tagger')
train_data=tagger.structure_data('/home/jigisha/example/Gujarati-NLP-Toolkit-master/HIN-GUJ_Sample/train_Gujarati.txt')
tagger.train(train_data,'model_fl.txt')
tagger.eval()
sentence = 'તારુ નામ શુ છે?' # What is your name?
fp=open('tag_guj.txt','w')
fp.write(str(tagger.pos_tag(sentence)))
fp.close()
from gujarati-nlp-toolkit.
Hi @jigishaSA, the problem you are facing here is because of the
tagger.train(train_data, 'model_fl.txt')
you are calling. Here, the function must be called in a typical manner. The first argument of the function is the training data and the second argument is the name of the file to save the model. The name of the file must be the string which you entered while creating the tagger
object, here, in you code, which is 'guj_pos_tagger'
. So, instead of 'model_fl.txt'
, please use 'guj_pos_tagger'
(no file extensions to be added). Moreover, to generalize, you can even use
tagger.train(train_data, tagger._model_file)
Here, the tagger._model_file
will automatically fetch the correct name, and you don't need to worry about anything.
And about the output you are getting while the training script is running, if you do not want the script to print the output, while initializing the tagger object, you can use
tagger = pt.posTagger('guj_pos_tagger', verbose=False)
from gujarati-nlp-toolkit.
Sorry to say but after applying all changes as u said it gives same output
[[('તારુ', 'EMPTY'), ('નામ', 'EMPTY'), ('શુ', 'EMPTY')]]
from gujarati-nlp-toolkit.
I hope after our discussion, this problem has been established due to improper structure of the dataset and not due to any bugs in the code. So, closing this issue as there as not been any activity here. Feel free to reopen if the problem persists.
from gujarati-nlp-toolkit.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gujarati-nlp-toolkit.