Comments (3)
Hi @qyangcauc
The input should contain the documents, each document in XML format (already pre-processed using Stanford CoreNLP).
The reference file contain the keywords for each document in the SemEval-2010 format. Details about this format are available at http://docs.google.com/Doc?id=ddshp584_46gqkkjng4
You can look at the SemEval data for an example: http://semeval2.fbk.eu/semeval2.php?location=data
the use_lemmas
parameter allows to use the Lemmas produced by Stanford CoreNLP instead of stems generated with nltk
. For most uses, it is better to let it to False.
Best regards,
f.
from pke.
Dear boudinfl
The problem:
I already know how to train with your help, but you only write how to get keywords with unsupervised in readme. I already trian model with my set, when I visit visit https://boudinfl.github.io/pke/build/html/supervised.html, I don't know how to use the model that I've trained.
This is my copy in unsupervised for my model:
#test file, wait for the keywords to be extracted.
extractor = pke.supervised.Kea(input_file='./dataset/4046.txt')
#test` set format
extractor.read_document(format='preprocessed')
#test set df count
df_counts = pke.load_document_frequency_file(input_file='./dataset/df.txt') #
extractor.feature_extraction(df=df_counts)
#kea model that have been trained
extractor.classify_candidates(model='./dataset/kea_model/kea_kdd_model') .
#I don't know how to get keywords
print extractor.get_n_best(n=2)
I know it's wrong. So I want to you tell me how can I get keywords in test set in supervised models with that I have been trained.
Debatable:
After I calculate df, and then write to the file df_count.txt, I found the wrong encoding format.That is a mess. In function compute_document_frequency() , you use
with gzip.open(output_file, 'w') as f:
f.write('--NB_DOC--' + delimiter + str(nb_documents) + '\n')
for ngram in frequencies:
f.write((ngram).encode('utf-8') + delimiter + str(len(frequencies[ngram])) + '\n')
when I use your code, I found the wrong encoding format.That is a mess in my df_count.txt.I tried to modify the encoding, but it also happened. So I modified that:
with open(output_file, 'w') as f: ...
This is successfully . My running environment:win10 & Pycharm & Python2.7 . I don't know why that is happening.
Best regards,
qyang.
from pke.
So for supervised models (this example is for Kea
), the API is as follows:
import pke
extractor = pke.supervised.Kea(input_file='/path/to/input')
# here the input document should be in preprocessed format, i.e.
# whitespace-separated POS-tagged tokens, one sentence per line.
extractor.read_document(format='preprocessed')
# load the df counts
# here the Df counts file is normally a gzip compressed file, this is why you have the issue in the
# second part of your comment
df_counts = pke.load_document_frequency_file(input_file='/path/to/dfcounts')
# Extract the candidates, I think you miss that
extractor.candidate_selection()
# Compute the features for the candidates
extractor.feature_extraction(df=df_counts)
# Classify the candidates using your trained model
extractor.classify_candidates(model='./dataset/kea_model/kea_kdd_model') .
# Get the best 2 candidates
print extractor.get_n_best(n=2)
Best,
f.
from pke.
Related Issues (20)
- KeyError: 'hinglish' HOT 7
- Langcode "ge" does not work with spacy HOT 2
- max_length parameter error with the latest version HOT 1
- Keyword dataset HOT 1
- How to manipulate length of key-phrases? HOT 3
- AttributeError: module 'scipy.sparse' has no attribute 'coo_array' HOT 1
- Throws zero division error HOT 3
- sklearn is deprecated, breaks PKE HOT 1
- error while installing pke HOT 1
- SnowballStemmer and Spacy-model use different langcodes HOT 1
- Error: module 'scipy.sparse' has no attribute 'coo_array' HOT 1
- KP-Miner: why candidate_df is 1 for n-grams except unigram? HOT 1
- Error in PositionRank: tuple index out of range HOT 1
- Error in TopicalPageRank: 'numpy.ndarray' object has no attribute 'index' HOT 3
- Does not support Chinese? HOT 1
- PositionRank algorithm fails with networkx version 3.0: IndexError in `candidate_weighting()` method HOT 1
- LookupError: Can't find any language named 'hinglish' HOT 1
- Memory leak in TopicRank HOT 5
- hinglish key error HOT 1
- extractor.load_document (Spacy) limitation of 1000000 characters
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pke.