Code Monkey home page Code Monkey logo

Comments (7)

Rutvik-Trivedi avatar Rutvik-Trivedi commented on June 2, 2024

Hi @jigishaSA, this error seems to be due to a problem in your training data file. As the error says in the last line,

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2

you need to check your Sample_Gujarati.txt file on line number 8. Here, it seems that there are two fields, instead of one and hence this error is generated. You fix this problem and the error will be fixed. Hope this helps.

from gujarati-nlp-toolkit.

jigishaSA avatar jigishaSA commented on June 2, 2024

as I observe that it shows error in only that lines which contain ,(comma) as RD_PUNC tag. So here i am sending tag file. plz guide me
Sample_Gujarati.txt

from gujarati-nlp-toolkit.

Rutvik-Trivedi avatar Rutvik-Trivedi commented on June 2, 2024

Hi @jigishaSA, sorry for the little late reply. I have updated the code. There was a typo in the code which created this problem. Please run

git pull origin master

and run the code again. It should work now.

Moreover, as the requirement of the file says, you must have a Value heading for the column having the text. The data you have does not have that initial header row. Do add that header row so that your dataset looks something like

ID Value
id1 text1
id2 text2

otherwise, you will again get an error something like

KeyError: 'Value'

Please make these changes, and you should be good to continue. Let me know in case you face any more problems. Please do close the issue if everything works fine for you. Thanks

from gujarati-nlp-toolkit.

jigishaSA avatar jigishaSA commented on June 2, 2024

Thanks you for reply. Now your tagger is successfully run but when it train using given tag file its output like this
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 0
Seconds required: 0.001

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

L-BFGS terminated with error code (-1020)
Total seconds required for training: 0.000

Storing the model
Number of active features: 0 (0)
Number of active attributes: 0 (0)
Number of active labels: 1 (1)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.001

[['WORD_તારુ', 'LENGTH_4', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'NEXT_WORD_નામ', 'NEXT_LENGTH_3', 'SUF_ુ', 'PRE_શ', 'NEXT_NEXT_WORD_શુ', 'NEXT_NEXT_LENGTH_2'], ['WORD_નામ', 'LENGTH_3', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_WORD_તારુ', 'PREV_LENGTH_4', 'SUF_ુ', 'PRE_શ', 'NEXT_WORD_શુ', 'NEXT_LENGTH_2'], ['WORD_શુ', 'LENGTH_2', 'SUF_ુ', 'PRE_શ', 'SUF_મ', 'PRE_ન', 'SUF_ામ', 'PRE_ના', 'PREV_WORD_નામ', 'PREV_LENGTH_3', 'SUF_ુ', 'PRE_ત', 'SUF_રુ', 'PRE_તા', 'SUF_ારુ', 'PRE_તાર', 'PREV_PREV_WORD_તારુ', 'PREV_PREV_LENGTH_4']]
[[('તારુ', 'EMPTY'), ('નામ', 'EMPTY'), ('શુ', 'EMPTY')]]

My code is
tagger=pt.posTagger('guj_pos_tagger')
train_data=tagger.structure_data('/home/jigisha/example/Gujarati-NLP-Toolkit-master/HIN-GUJ_Sample/train_Gujarati.txt')
tagger.train(train_data,'model_fl.txt')
tagger.eval()
sentence = 'તારુ નામ શુ છે?' # What is your name?
fp=open('tag_guj.txt','w')
fp.write(str(tagger.pos_tag(sentence)))
fp.close()

from gujarati-nlp-toolkit.

Rutvik-Trivedi avatar Rutvik-Trivedi commented on June 2, 2024

Hi @jigishaSA, the problem you are facing here is because of the

tagger.train(train_data, 'model_fl.txt')

you are calling. Here, the function must be called in a typical manner. The first argument of the function is the training data and the second argument is the name of the file to save the model. The name of the file must be the string which you entered while creating the tagger object, here, in you code, which is 'guj_pos_tagger'. So, instead of 'model_fl.txt', please use 'guj_pos_tagger' (no file extensions to be added). Moreover, to generalize, you can even use

tagger.train(train_data, tagger._model_file)

Here, the tagger._model_file will automatically fetch the correct name, and you don't need to worry about anything.

And about the output you are getting while the training script is running, if you do not want the script to print the output, while initializing the tagger object, you can use

tagger = pt.posTagger('guj_pos_tagger', verbose=False)

from gujarati-nlp-toolkit.

jigishaSA avatar jigishaSA commented on June 2, 2024

Sorry to say but after applying all changes as u said it gives same output
[[('તારુ', 'EMPTY'), ('નામ', 'EMPTY'), ('શુ', 'EMPTY')]]

from gujarati-nlp-toolkit.

Rutvik-Trivedi avatar Rutvik-Trivedi commented on June 2, 2024

I hope after our discussion, this problem has been established due to improper structure of the dataset and not due to any bugs in the code. So, closing this issue as there as not been any activity here. Feel free to reopen if the problem persists.

from gujarati-nlp-toolkit.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.