Unable to training model. plz help to solve Traceback (most recent c

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

pandas.errors.ParserError: Error tokenizing data. C error: about gujarati-nlp-toolkit HOT 7 CLOSED

rutvik-trivedi commented on June 2, 2024

pandas.errors.ParserError: Error tokenizing data. C error:

from gujarati-nlp-toolkit.

Comments (7)

Rutvik-Trivedi commented on June 2, 2024

Hi @jigishaSA, this error seems to be due to a problem in your training data file. As the error says in the last line,

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2

you need to check your Sample_Gujarati.txt file on line number 8. Here, it seems that there are two fields, instead of one and hence this error is generated. You fix this problem and the error will be fixed. Hope this helps.

from gujarati-nlp-toolkit.

jigishaSA commented on June 2, 2024

as I observe that it shows error in only that lines which contain ,(comma) as RD_PUNC tag. So here i am sending tag file. plz guide me
Sample_Gujarati.txt

from gujarati-nlp-toolkit.

Rutvik-Trivedi commented on June 2, 2024

Hi @jigishaSA, sorry for the little late reply. I have updated the code. There was a typo in the code which created this problem. Please run

git pull origin master

and run the code again. It should work now.

Moreover, as the requirement of the file says, you must have a Value heading for the column having the text. The data you have does not have that initial header row. Do add that header row so that your dataset looks something like

ID	Value
id1	text1
id2	text2

otherwise, you will again get an error something like

KeyError: 'Value'

Please make these changes, and you should be good to continue. Let me know in case you face any more problems. Please do close the issue if everything works fine for you. Thanks

from gujarati-nlp-toolkit.

jigishaSA commented on June 2, 2024

Thanks you for reply. Now your tagger is successfully run but when it train using given tag file its output like this
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 0
Seconds required: 0.001

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

L-BFGS terminated with error code (-1020)
Total seconds required for training: 0.000

Storing the model
Number of active features: 0 (0)
Number of active attributes: 0 (0)
Number of active labels: 1 (1)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.001

[['WORD_તારુ', 'LENGTH_4', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'NEXT_WORD_નામ', 'NEXT_LENGTH_3', 'SUF_ુ', 'PRE_શ', 'NEXT_NEXT_WORD_શુ', 'NEXT_NEXT_LENGTH_2'], ['WORD_નામ', 'LENGTH_3', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_WORD_તારુ', 'PREV_LENGTH_4', 'SUF_ુ', 'PRE_શ', 'NEXT_WORD_શુ', 'NEXT_LENGTH_2'], ['WORD_શુ', 'LENGTH_2', 'SUF_ુ', 'PRE_શ', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'PREV_WORD_નામ', 'PREV_LENGTH_3', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_PREV_WORD_તારુ', 'PREV_PREV_LENGTH_4']]
[[('તારુ', 'EMPTY'), ('નામ', 'EMPTY'), ('શુ', 'EMPTY')]]

My code is
tagger=pt.posTagger('guj_pos_tagger')
train_data=tagger.structure_data('/home/jigisha/example/Gujarati-NLP-Toolkit-master/HIN-GUJ_Sample/train_Gujarati.txt')
tagger.train(train_data,'model_fl.txt')
tagger.eval()
sentence = 'તારુ નામ શુ છે?' # What is your name?
fp=open('tag_guj.txt','w')
fp.write(str(tagger.pos_tag(sentence)))
fp.close()

from gujarati-nlp-toolkit.

Rutvik-Trivedi commented on June 2, 2024

Hi @jigishaSA, the problem you are facing here is because of the

tagger.train(train_data, 'model_fl.txt')

you are calling. Here, the function must be called in a typical manner. The first argument of the function is the training data and the second argument is the name of the file to save the model. The name of the file must be the string which you entered while creating the tagger object, here, in you code, which is 'guj_pos_tagger'. So, instead of 'model_fl.txt', please use 'guj_pos_tagger' (no file extensions to be added). Moreover, to generalize, you can even use

tagger.train(train_data, tagger._model_file)

Here, the tagger._model_file will automatically fetch the correct name, and you don't need to worry about anything.

And about the output you are getting while the training script is running, if you do not want the script to print the output, while initializing the tagger object, you can use

tagger = pt.posTagger('guj_pos_tagger', verbose=False)

from gujarati-nlp-toolkit.

jigishaSA commented on June 2, 2024

Sorry to say but after applying all changes as u said it gives same output
[[('તારુ', 'EMPTY'), ('નામ', 'EMPTY'), ('શુ', 'EMPTY')]]

from gujarati-nlp-toolkit.

Rutvik-Trivedi commented on June 2, 2024

I hope after our discussion, this problem has been established due to improper structure of the dataset and not due to any bugs in the code. So, closing this issue as there as not been any activity here. Feel free to reopen if the problem persists.

from gujarati-nlp-toolkit.

pandas.errors.ParserError: Error tokenizing data. C error: about gujarati-nlp-toolkit HOT 7 CLOSED

Comments (7)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent