Code Monkey home page Code Monkey logo

Comments (3)

boudinfl avatar boudinfl commented on May 28, 2024

Hi @qyangcauc

The input should contain the documents, each document in XML format (already pre-processed using Stanford CoreNLP).

The reference file contain the keywords for each document in the SemEval-2010 format. Details about this format are available at http://docs.google.com/Doc?id=ddshp584_46gqkkjng4

You can look at the SemEval data for an example: http://semeval2.fbk.eu/semeval2.php?location=data

the use_lemmas parameter allows to use the Lemmas produced by Stanford CoreNLP instead of stems generated with nltk. For most uses, it is better to let it to False.

Best regards,

f.

from pke.

qyangcauc avatar qyangcauc commented on May 28, 2024

Dear boudinfl

The problem:
I already know how to train with your help, but you only write how to get keywords with unsupervised in readme. I already trian model with my set, when I visit visit https://boudinfl.github.io/pke/build/html/supervised.html, I don't know how to use the model that I've trained.

This is my copy in unsupervised for my model:

#test file, wait for the keywords to be extracted.
extractor = pke.supervised.Kea(input_file='./dataset/4046.txt')
#test` set format
extractor.read_document(format='preprocessed')
#test set df count
df_counts = pke.load_document_frequency_file(input_file='./dataset/df.txt') #
extractor.feature_extraction(df=df_counts)
#kea model that have been trained
extractor.classify_candidates(model='./dataset/kea_model/kea_kdd_model') .
#I don't know how to get keywords
print extractor.get_n_best(n=2)

I know it's wrong. So I want to you tell me how can I get keywords in test set in supervised models with that I have been trained.

Debatable:
After I calculate df, and then write to the file df_count.txt, I found the wrong encoding format.That is a mess. In function compute_document_frequency() , you use

with gzip.open(output_file, 'w') as f:
    f.write('--NB_DOC--' + delimiter + str(nb_documents) + '\n')
    for ngram in frequencies:
        f.write((ngram).encode('utf-8') + delimiter + str(len(frequencies[ngram])) + '\n')

when I use your code, I found the wrong encoding format.That is a mess in my df_count.txt.I tried to modify the encoding, but it also happened. So I modified that:

with open(output_file, 'w') as f: ...

This is successfully . My running environment:win10 & Pycharm & Python2.7 . I don't know why that is happening.

Best regards,
qyang.

from pke.

boudinfl avatar boudinfl commented on May 28, 2024

So for supervised models (this example is for Kea), the API is as follows:

import pke
extractor = pke.supervised.Kea(input_file='/path/to/input')

# here the input document should be in preprocessed format, i.e. 
# whitespace-separated POS-tagged tokens, one sentence per line.
extractor.read_document(format='preprocessed')

# load the df counts
# here the Df counts file is normally a gzip compressed file, this is why you have the issue in the
# second part of your comment
df_counts = pke.load_document_frequency_file(input_file='/path/to/dfcounts')

# Extract the candidates, I think you miss that
extractor.candidate_selection()

# Compute the features for the candidates
extractor.feature_extraction(df=df_counts)

# Classify the candidates using your trained model
extractor.classify_candidates(model='./dataset/kea_model/kea_kdd_model') .

# Get the best 2 candidates
print extractor.get_n_best(n=2)

Best,

f.

from pke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.