boudinfl / pke Goto Github PK

Python Keyphrase Extraction module

License: GNU General Public License v3.0

Python 100.00%

keyphrase-extraction natural-language-processing information-retrieval computational-linguistics keyword-extraction python keyword keyphrase

pke's Introduction

`pke` - python keyphrase extraction

pke is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset.

Installation
Minimal example
Getting started
Implemented models
Model performances
Citing pke

Installation

To pip install pke from github:

pip install git+https://github.com/boudinfl/pke.git

pke relies on spacy (>= 3.2.3) for text processing and requires models to be installed:

# download the english model
python -m spacy download en_core_web_sm

Minimal example

pke provides a standardized API for extracting keyphrases from a document. Start by typing the 5 lines below. For using another model, simply replace pke.unsupervised.TopicRank with another model (list of implemented models).

import pke

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be a simple 
# test string and preprocessing is carried out using spacy
extractor.load_document(input='text', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

A detailed example is provided in the examples/ directory.

Getting started

To get your hands dirty with pke, we invite you to try our tutorials out.

Name	Link
Getting started with `pke` and keyphrase extraction
Model parameterization
Benchmarking models

Implemented models

pke currently implements the following keyphrase extraction models:

Unsupervised models
- Statistical models
  - FirstPhrases
  - TfIdf
  - KPMiner (El-Beltagy and Rafea, 2010)
  - YAKE (Campos et al., 2020)
- Graph-based models
  - TextRank (Mihalcea and Tarau, 2004)
  - SingleRank (Wan and Xiao, 2008)
  - TopicRank (Bougouin et al., 2013)
  - TopicalPageRank (Sterckx et al., 2015)
  - PositionRank (Florescu and Caragea, 2017)
  - MultipartiteRank (Boudin, 2018)
Supervised models
- Feature-based models
  - Kea (Witten et al., 2005)

Model performances

For comparison purposes, overall results of implemented models on commonly-used benchmark datasets are available in results. Code for reproducing these experiments are in the benchmarking notebook (also available on ).

Citing pke

If you use pke, please cite the following paper:

@InProceedings{boudin:2016:COLINGDEMO,
  author    = {Boudin, Florian},
  title     = {pke: an open source python-based keyphrase extraction toolkit},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {69--73},
  url       = {http://aclweb.org/anthology/C16-2015}
}

pke's People

Contributors

Stargazers

Watchers

Forkers

hardikjjoshi msolhab charleypeng1 limberc zuacubd nsecord luffycodes gupineee roozbehsanaei smrjans sodagreencellur shubhampachori12110095 rishiabhishek jim-kukla aburkov uotas windowxiaoming fajarnugrohoid tanthml midnitekoder krvenkat huihui-song moolighty xiaoman0220 edjzhang genesisxyl lusuon jordanlmx bambangdw xiongchenyan luzhongqiu arianpasquali yangfen669 zhangjianzhang tu-cao hetvee kai3n dz277 liu4lin allensmile legendtianjin stevenlol lovehoroscoper nofeetbird0321 yinhongyh hoi-nx ztx0728 fighting41love hibari1 blucehan mqlove hatchetproject churnikov nishalpattan milllllk yyht mjsajr abhinavm24 21rachitshukla dmitrykey juneou jwata shi2yu3 sundeeppidugu pidugusundeep tc64 shugrgr jinggz pombredanne typanda mingyates kyroad timrepke jiaruipeng1994 ayush488 sainiudit cjwk songym2020 shamoji101 paacmaan peng89 zhyq w601sxs dragomirradev sanjeeku kazgu b2220333 richiezwj xioaxin 915971408 limyeonsoo akash2109 syyunn starssummer onizukalab damienlancry g2crowd brunoberisso semal stevaras2

pke's Issues

tf-idf with lemmatizer

For tf-idf there is no way to have tf for lemmatize form of word (we can count tf for stemmed word or for word with no normalization). Maybe in load_file method in # word normalization section we need to add condition for lemmatization like : elif self.normalization is 'lemmatization':
for i, sentence in enumerate(self.sentences):
self.sentences[i].stems = sentence.stems

Pip install results in ImportErrors.

pip install git+https://github.com/boudinfl/pke.git results in:

In [1]: import pke
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-43231e3b5697> in <module>()
----> 1 import pke

/Users/martijn/anaconda/envs/py3k/lib/python3.5/site-packages/pke/__init__.py in <module>()
----> 1 from base import *
      2 from utils import *
      3 from readers import *
      4 from unsupervised import *
      5 from supervised import *

ImportError: No module named 'base'

Getting ZeroDivisionError in pke.unsupervised.MultipartiteRank() while performing candidate_weighting when reading input from a String(not loaded from File)in Ubuntu Python 3

I am trying to extract KeyPhrases from a String.

The Code snippet is as follows:

extractor = pke.unsupervised.MultipartiteRank()
extractor.read_text(input_text)
extractor.ngram_selection(n=5)
#extractor.read_document(format='raw')
extractor.candidate_weighting(alpha=1.1,
threshold=0.74,
method='average')
keyphrases = extractor.get_n_best(n=100,stemming=False,redundancy_removal=True)

Error Message is as follows:

In [55]: extractor.candidate_weighting(alpha=1.1,
...: threshold=0.74,
...: method='average')
...:
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-55-5f01a411dccc> in <module>()
1 extractor.candidate_weighting(alpha=1.1,
2 threshold=0.74,
**----> 3 method='average')**
4
/usr/local/lib/python3.5/dist-packages/pke/unsupervised/graph_based.py in candidate_weighting(self,
threshold, method, alpha)
572 ' 573 # build the topic graph'
'--> 574 self.build_topic_graph() 575 576 if alpha > 0.0:`

/usr/local/lib/python3.5/dist-packages/pke/unsupervised/graph_based.py in build_topic_graph(self)
496 gap -= len(self.candidates[node_j].lexical_form) - 1
497
**--> 498 weights.append(1.0 / gap)**
499
500 # add weighted edges

ZeroDivisionError: float division by zero

If I use the below snippet, ie using read from a file, does not throws any error.

extractor = pke.unsupervised.MultipartiteRank(aFilePath)
extractor.ngram_selection(n=5)
extractor.read_document(format='raw')

Please provide suggestion/guidance to fix this issue.

E050

Tried to load a text file using

extractor = pke.unsupervised.KPMiner()

2. load the content of the document.

extractor.load_document(input = 'C:\Users\mahadev\Downloads\401KPlan.txt', language = 'en')

and getting following error:
[E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

What am i doing wrong?

missing warning message for read_document()

the function : base.read_document() only supports the ''stemmer'' argument if format="raw", maybe it is clearer to add some warning messages

Moving reading/loading document from initializer

Moving reading/loading document from

extractor = pke.unsupervised.TopicRank(input_file='/path/to/input')
extractor.read_document(format='raw')

extractor = pke.unsupervised.TopicRank()
extractor.read_document(input_file='/path/to/input', format='raw')

question about the algorithm

Hi everyone,
I’m new in this forum and new to the world of AI and machine learning. I am very interested in supervised learning.
And I actually want to learn more about those algorithms.
After some researches, I understood that KEA is based on supervised learning but I’m not sure if i understood it as it should :
As inputs we give the model : the document from where we want to extract keywords, and a model of training and also the number of key words we would like to have?
So this model is served to compare and find out whether in our document there are words that already were keywords and exists on our model? If so , they are automatically considered as keyword? And for the rest , the kea apply the TF IDF and calculate the “distance” and apply also the naïve bayes model to have a list of word ranked according to the probability?
Is Wingnus based on the same algorithm or is it different?
I hope I was clear.. I don’t know if what I understood is the right thing or not? I will be pleased if someone can tell me and correct this information if incorrect.

Issue on Multipartite Rank

Hi.

I am using Multipartite Rank to extract keypharses from Persian documents and I have two questions:

In the paper, it is stated that the candidates with the pattern (/adj* noun+/) are selected. In Persian adjectives appear after nouns, how to make it work correctly in this case?
Topics are selected based on the stems of the words. How should I input the stems when I'm using the 'preprocessed' mode to read the documemts?

Thanks

Issue when using MultipartiteRank

Hello @boudinfl ,
I have an issue when i use MultipartiteRank, i get

File "xml_to_kw.py", line 11, in
extractor.candidate_weighting()
File "/home/matthias/.local/lib/python2.7/site-packages/pke/unsupervised/graph_based.py", line 481, in candidate_weighting
self.weight_adjustment(alpha)
File "/home/matthias/.local/lib/python2.7/site-packages/pke/unsupervised/graph_based.py", line 442, in weight_adjustment
for start, end in self.graph.edges_iter(first):
AttributeError: 'DiGraph' object has no attribute 'edges_iter'

Here is my code :

import pke
import sys
import os

extractor = pke.unsupervised.MultipartiteRank(input_file= sys.argv[1], language='french')
extractor.build_topic_graph()
extractor.read_document(format='corenlp', use_lemmas=False, stemmer='french')
extractor.candidate_selection()
extractor.candidate_weighting()
extractor.topic_clustering()
extractor.weight_adjustment()
tuples = extractor.get_n_best(n=60)
print(tuples)

I think it's a problem about the order in which i use functions, but i don't find example about that... :(

Thanks

Hitting "array is too big" error when loading a 3.72MB text file with 9K lines

Using the pke script as-is shown in the githud readme page ie, using topic rank

C:\Python27\lib\site-packages\unidecode_init_.py:50: RuntimeWarning: Surrogate character u'\ud83d' will be ignored. You might be using a narrow Python build.
return unidecode(string)
C:\Python27\lib\site-packages\unidecode_init.py:50: RuntimeWarning: Surrogate character u'\ude0a' will be ignored. You might be using a narrow Python build.
return unidecode(string)
C:\Python27\lib\site-packages\unidecode_init.py:50: RuntimeWarning: Surrogate character u'\ude1d' will be ignored. You might be using a narrow Python build.
return _unidecode(string)
Traceback (most recent call last):
File "E:\Projects\NLP\KeyPhraseExtraction.py", line 17, in
extractor.candidate_weighting()
File "C:\Python27\lib\site-packages\pke\unsupervised.py", line 437, in candidate_weighting
self.topic_clustering(threshold=threshold, method=method)
File "C:\Python27\lib\site-packages\pke\unsupervised.py", line 384, in topic_clustering
candidates, X = self.vectorize_candidates()
File "C:\Python27\lib\site-packages\pke\unsupervised.py", line 365, in vectorize_candidates
X = np.zeros((len(C), len(dim)))
ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.

Switching to 2 input formats (raw and preprocessed)

raw input is preprocessed using spacy
preprocessed input comes from Stanford CoreNLP

How to load a document which is too long and some bytes are non-utf8？

I am trying to use pke to extract keywords from papers with some non-utf8 bytes .

When I use the function load_document(), I got a problem:

If use the path as input, it will cause an exception that some bytes can't be decode.
But seems I can't do anything to avoid /ignore the error.
When I tried to use string(the content of the txt) as input directly, the program threw an exception "the the path is too long for windows", I guess the function os.stat(path) can‘t accept such a long string.

In this situation, how can I load such a long and partly non-utf8 file?

does it support east asian language like Chinese, Japanese...

Language Issue :"French" & clarifications about inputs

Hello,
I am trying to use the KEA algorithm for French language.
I want to know the difference between : df and model_file in the candidate_weighting function.
And i'm also facing a problem with the model fr; i'm getting this error: Can't find model 'fr'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. , knowing that spacy is well installed with all the languages.

Thank you in advance

Module import error

I am trying to use your pke, however by running your simplest example, I encounter a problem which shows:
"ModuleNotFoundError: No module named 'base'"
I have no idea about this problems. So could you help me? By the way, I am using your package under python3

Adding Spacy support

Moving from nltk to spacy for performance/speed issues.

ValueError: The number of observations cannot be determined on an empty distance matrix.

I m facing this particular error while running the below code.

import pke
import string
from nltk.corpus import stopwords

# 1. create a MultipartiteRank extractor.


pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
stoplist += stopwords.words('english')
stoplist += stop

# 2. load the content of the document.
for i, item in enumerate(df["Complaints"]):
    extractor = pke.unsupervised.MultipartiteRank()
    
    extractor.load_document(item)

# 3. select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.

    try:
        extractor.candidate_selection(pos=pos, stoplist=stoplist)
        extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
        keyphrases = extractor.get_n_best(n=10)
        print(keyphrases)
    except ZeroDivisionError:
        print()

Error


ValueError                                Traceback (most recent call last)
<ipython-input-24-03b5c5c29248> in <module>
     23     try:
     24         extractor.candidate_selection(pos=pos, stoplist=stoplist)
---> 25         extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
     26         keyphrases = extractor.get_n_best(n=10)
     27         print(keyphrases)

~/.local/lib/python3.5/site-packages/pke/unsupervised/graph_based/multipartiterank.py in candidate_weighting(self, threshold, method, alpha)
    213 
    214         # cluster the candidates
--> 215         self.topic_clustering(threshold=threshold, method=method)
    216 
    217         # build the topic graph

~/.local/lib/python3.5/site-packages/pke/unsupervised/graph_based/multipartiterank.py in topic_clustering(self, threshold, method)
    102 
    103         # compute the clusters
--> 104         Z = linkage(Y, method=method)
    105 
    106         # form flat clusters

~/.local/lib/python3.5/site-packages/scipy/cluster/hierarchy.py in linkage(y, method, metric, optimal_ordering)
   1110                          "finite values.")
   1111 
-> 1112     n = int(distance.num_obs_y(y))
   1113     method_code = _LINKAGE_METHODS[method]
   1114 

~/.local/lib/python3.5/site-packages/scipy/spatial/distance.py in num_obs_y(Y)
   2382     k = Y.shape[0]
   2383     if k == 0:
-> 2384         raise ValueError("The number of observations cannot be determined on "
   2385                          "an empty distance matrix.")
   2386     d = int(np.ceil(np.sqrt(k * 2)))

ValueError: The number of observations cannot be determined on an empty distance matrix.

Please let me know if somebody has resolved the mentioned issue.

How can I read multiple documents as a corpus?

Hi,
I am using pke.unsupervised.TfIdf to extract key phrases, but it seems that the function only takes one file as input. I have a csv file contains multiple documents and I want to read them together as a corpus, can I use pke to achieve this? Can I use string as input?

Can't find model 'en'.

How am I supposed to load/download this en package manually?

'module' object has no attribute 'YAKE'

I receive this error when I want to impoty YAKE.

  File "YAKE.py", line 10, in <module>
    extractor = pke.unsupervised.YAKE(input_file=input_file, language=None)
AttributeError: 'module' object has no attribute 'YAKE'

I'm using Python 2.7.12.

Error with TextRank

I get the following error when extracting keyphrases from file 2000_10_09-13_00_00-JornaldaTarde-8-topic-seg.txt-Nr1 that appears in the 110-PT-BN-KP dataset.

INFO:root:running [output/110-PT-BN-KP//110-PT-BN-KP.TextRank.json]
INFO:root:extracting keyphrases from [2000_10_09-13_00_00-JornaldaTarde-8-topic-seg.txt-Nr1]
Traceback (most recent call last):
  File "run.py", line 272, in <module>
    extractor.candidate_weighting()
  File "/usr/local/lib/python3.6/site-packages/pke/unsupervised/graph_based/textrank.py", line 239, in candidate_weighting
    self.weights[k] = sum([w[t] for t in tokens])
  File "/usr/local/lib/python3.6/site-packages/pke/unsupervised/graph_based/textrank.py", line 239, in <listcomp>
    self.weights[k] = sum([w[t] for t in tokens])
KeyError: 'encontr'

German support

Currently pke supports English, Portuguese and French.
What would it take to add support for German?

Tfidf

While trying out tfidf model for keyphrase extraction. I getting error at the candidate_weighting mthd... where i chose the argument df = None.
Error being : ValueError: Invalid mode ('rtb')
I have looked into the source code to get idea of what the argument df is and what the format of the file is to be and got little information. Can you throw some light on this issue. Thanks.

All zero weights when use a newly trained Kea Model

Hi boudinfl,

when I try to train a new kea model using given lvl-2 data set and train.reader.stem.final reference data, the model just don't seem to work. It always output keywords with all zero weights like below.

[(u'angl densiti', 0.0),
(u'significantli improv local', 0.0),
(u'tell that node', 0.0),
(u'boundari indic', 0.0),
(u'smaller statist mean', 0.0),
(u'clearli that iter', 0.0),
(u'undesir high deploy', 0.0),
(u'dbe vs non\u2212db', 0.0),
(u'range-bas and range-fre', 0.0),
(u'60 70 basic', 0.0)]

here is the code:
import pke

input_dir = r'lvl-2/'

load the DF counts from file

df_counts = pke.load_document_frequency_file(input_file='out.gz',delimiter='\t')

train a new Kea model

pke.train_supervised_model(input_dir=input_dir,
reference_file='ref/train.reader.stem.final',
model_file='output_mdl.pickle',
df=df_counts,
format='corenlp',
use_lemmas=False,
stemmer='porter',
model=pke.supervised.Kea(),
language ='english',
extension='xml')

Thank you!

Best,
Xiang

Keyphrase extraction not from text file

From most of your examples I have noticed that a text file path is being provided as an input. Is there any capability to actually take input of string from any other data type such as a list or a dictionary?

For example: say my input corpus has been stored in a list and I would like to provide the list as an input for the keyphrase extraction. Is it possible?

AttributeError: module 'pke' has no attribute 'supervised'

TopicRank and MultipartiteRank

The examples for TopicRank and MultipartiteRank list zero sentences for several of the highest scored candidates. I have verified that the keyphrases are in the document. I am using extractor.load_document on a plain text file.

In specific, the following line returns an empty array for some high score keyphrases:
extractor.candidates['top-keyphrase'].sentence_ids

Here is some example output from the attached file via the TopicRank example code (note that the keyphrase output has been modified to contain Rank, Sentence Count, Keyphrase, Weight):

First 10 keyphrases (Rank, Sentence Count, Keyphrase, Weight):
0. (87): "rude", Weight: 0.029621
1. (0): "service", Weight: 0.019362 
2. (0): "awful food", Weight: 0.016355
3. (38): "place", Weight: 0.013839
4. (20): "time", Weight: 0.010575
5. (0): "thanksgiving eveningthe manager", Weight: 0.010379
6. (0): "restaurant", Weight: 0.009929
7. (0): "good things", Weight: 0.009552
8. (0): "minutes", Weight: 0.009408
9. (20): "order", Weight: 0.007797
10. (16): "staff", Weight: 0.007071

review_text_rude-extremely.txt

extractor.load_document(input='/path/to/input.txt', language='en')

Hi, I'm so sorry to ask this simple question, but I've been struggling with this for days and think I should ask you this. I would really appreciate if you make some suggestions.

I believe I installed pke correctly (I can import pke).

I believe I installed spacy correctly (I could import spacy and do the following: nlp = spacy.load('en_core_web_sm')

And I tried the first two lines of your minimal example, which is:

extractor = pke.unsupervised.TopicRank()
extractor.load_document(input='C-1.txt', language='en')

I downloaded your C-1. txt and I also tried my sample .txt files and get the following error messages:

Traceback (most recent call last):
File "C:\Users\kjung2\AppData\Local\Programs\Python\Python37-32\test_pke.py", line 7, in
extractor.load_document(input='C-1.txt', language='en')
File "C:\Users\kjung2\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pke\base.py", line 107, in load_document
doc = parser.read(text=text, path=input, **kwargs)
File "C:\Users\kjung2\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pke\readers.py", line 74, in read
max_length=max_length)
File "C:\Users\kjung2\AppData\Local\Programs\Python\Python37-32\lib\site-packages\spacy_init_.py", line 21, in load
return util.load_model(name, **overrides)
File "C:\Users\kjung2\AppData\Local\Programs\Python\Python37-32\lib\site-packages\spacy\util.py", line 119, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I do have the .txt files in the following path:
C:\Users\kjung2\AppData\Local\Programs\Python\Python37-32
where I have the test_pke.py which currently has only the two lines in the above.

I know that this must be a really silly question for you, but I really want to use your codes. Please help and thanks again.

Unicode decode error while using KPMiner

While using KPMiner I am getting Unicode decode error. This is essentially during reading of Document frequency file which is hardcoded to pke\models\df-semeval2010.tsv.gz

Please find the detailed error below.

UnicodeDecodeError Traceback (most recent call last)
in ()
21 # minimum similarity for clustering, and the method parameter defines the
22 # linkage method
---> 23 extractor.candidate_weighting()
24
25 # print the n-highest (10) scored candidates

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pke\unsupervised\statistical\kpminer.py in candidate_weighting(self, df, sigma, alpha)
119 logging.warning('LoadFile._df_counts is hard coded to {}'.format(
120 self._df_counts))
--> 121 df = load_document_frequency_file(self._df_counts, delimiter='\t')
122
123 # initialize the number of documents as --NB_DOC-- + 1 (current)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pke\utils.py in load_document_frequency_file(input_file, delimiter)
55
56 # populate the dictionary
---> 57 for row in df_reader:
58 frequencies[row[0]] = int(row[1])
59

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 104: character maps to

issue about WINGNUS and Kea

I try to use WINGNUS to train a model using semeval2010, however, none of the candidate selected from candidate_selection() is in the reference file.
For example, the keyphrases in the C-41.txt are
C-41 : adapt resourc manag,distribut real-time embed system,end-to-end qualiti of servic+servic end-to-end qualiti,hybrid adapt resourcemanag middlewar,hybrid control techniqu,real-time video distribut system,real-time corba specif,video encod/decod,resourc reserv mechan,dynam environ,stream servic,distribut real-time emb system,hybrid system,qualiti of servic+servic qualiti(reference keyphrases)
but the candidate keyphrases selected are
differ execut, variou class of applic, qo of best-effort, rate with hyarm without, system architectur wireless, network bandwidth, endto-end real-tim qo, end-to-end qualiti of servic, dynam resourc, middlewar support in wide-area, system util, hyarm without, hyarm uav1 qo, receiv uav camera, system case, function of hyarm, natarajan, charter, system architectur, network with limit bandwidth, mpeg1 mpeg4 real, period of time...... and so on(selected candidates by candidate_selection())
As a result, when it labels whether the candidates are ture keyphrases,they all labeled by 0.
for candidate in model.instances: if candidate in references[doc_id]: training_classes.append(1) else: training_classes.append(0) training_instances.append(model.instances[candidate])
I have not modified the code of candidate_selection() and the same problem appears when I use the KEA. Could you please tell me why?
candidate_selection()
def candidate_selection(self, NP='^((JJ|NN) ){,2}NN$', NP_IN_NP='^((JJ|NN) )?NN IN ((JJ|NN) )?NN$'): self.ngram_selection(n=4) self.candidate_filtering(stoplist=list(string.punctuation) + ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']) for k, v in self.candidates.items(): valid_surface_forms = [] for i in range(len(v.pos_patterns)): pattern = ' '.join([u[:2] for u in v.pos_patterns[i]]) if re.search(NP, pattern) or re.search(NP_IN_NP, pattern): valid_surface_forms.append(i) if not valid_surface_forms: del self.candidates[k] else: self.candidates[k].surface_forms = [v.surface_forms[i] for i in valid_surface_forms] self.candidates[k].offsets = [v.offsets[i] for i in valid_surface_forms] self.candidates[k].pos_patterns = [v.pos_patterns[i] for i in valid_surface_forms]
Thanks a lot!!

Missing Dependency Unidecode

Missing dependency Unidecode :
$ pip install unidecode

Support for reading text directly instead of using an input file

The format of the input file for the supervised algorithms

I wonder what should the format of the input file for the supervised algorithms be. I mean I don't know where to save the gold keyphrases and how to feed it to the model.

ValueError: The number of observations cannot be determined on an empty distance matrix.

I'm trying to run python examples/keyphrase-extraction.py but I keep getting an error saying

  File "examples/keyphrase-extraction.py", line 25, in <module>
    method='average')
  File "/Volumes/Data/Projects/pke/src/pke/pke/unsupervised/graph_based/topicrank.py", line 203, in candidate_weighting
    self.topic_clustering(threshold=threshold, method=method)
  File "/Volumes/Data/Projects/pke/src/pke/pke/unsupervised/graph_based/topicrank.py", line 155, in topic_clustering
    Z = linkage(Y, method=method)
  File "/Volumes/Data/Projects/pke/lib/python3.7/site-packages/scipy/cluster/hierarchy.py", line 1112, in linkage
    n = int(distance.num_obs_y(y))
  File "/Volumes/Data/Projects/pke/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2384, in num_obs_y
    raise ValueError("The number of observations cannot be determined on "
ValueError: The number of observations cannot be determined on an empty distance matrix.

Anyone know why this is the case?

Thanks

python version to use pke

Hi,
I'm trying to implement a supervised algorithm using pke
I would like to know if it works with the python 3.7 version? i'm not able to install it using: pip3 install pke ( Could not find a version that satisfies the requirement pke (from versions: )
No matching distribution found for pke)
and even this command:"pip install git+https://github.com/boudinfl/pke.git" does not work.
Thank you in advance

ZeroDivisionError during candidate weighting in MultipartiteRank based key phrase extraction

My code:

import pke
import string
from nltk.corpus import stopwords
import os

extractor = pke.unsupervised.MultipartiteRank()
pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
stoplist += stopwords.words('english')

inputDir='D:\RevAnalyzer\TestingSet\'
outputDir='D:\RevAnalyzer\Unsupervised-KeyPhraseExtraction\MultipartiteRank\'
i=0
for doc in os.listdir(inputDir):
if doc.endswith(".txt"):
i=i+1
extractor.load_document(input=inputDir+doc)
extractor.candidate_selection(pos=pos, stoplist=stoplist)
extractor.candidate_weighting(alpha=1.1,threshold=0.74,method='average')
keyphrases = extractor.get_n_best(20)
kpFile=open(outputDir+doc.replace('.txt','.key'),'w')
for kp in keyphrases:
kpFile.write(kp[0]+'\n')
print(str(i)+' File Done')

Error

ZeroDivisionError Traceback (most recent call last)
in ()
19 extractor.load_document(input=inputDir+doc)
20 extractor.candidate_selection(pos=pos, stoplist=stoplist)
---> 21 extractor.candidate_weighting(alpha=1.1,threshold=0.74,method='average')
22 keyphrases = extractor.get_n_best(20)
23 kpFile=open(outputDir+doc.replace('.txt','.key'),'w')

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pke\unsupervised\graph_based\multipartiterank.py in candidate_weighting(self, threshold, method, alpha)
216
217 # build the topic graph
--> 218 self.build_topic_graph()
219
220 if alpha > 0.0:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pke\unsupervised\graph_based\multipartiterank.py in build_topic_graph(self)
142 gap -= len(self.candidates[node_j].lexical_form) - 1
143
--> 144 weights.append(1.0 / gap)
145
146 # add weighted edges

ZeroDivisionError: float division by zero

Expected 2D array, got 1D array instead

Hi I face a problem when I try to training the KEA model with files provided in the example

Traceback (most recent call last):

  File "<ipython-input-7-322b61673763>", line 1, in <module>
    runfile('C:/Users/123/Desktop/PKE/train/test/Train_model.py', wdir='C:/Users/123/Desktop/PKE/train/test')

  File "C:\Users\123\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "C:\Users\123\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/123/Desktop/PKE/train/test/Train_model.py", line 32, in <module>
    model=pke.supervised.Kea())

  File "C:\Users\123\Anaconda3\lib\site-packages\pke\utils.py", line 216, in train_supervised_model
    model_file=model_file)

  File "C:\Users\123\Anaconda3\lib\site-packages\pke\supervised\feature_based\kea.py", line 168, in train
    clf.fit(training_instances, training_classes)

  File "C:\Users\123\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 585, in fit
    X, y = check_X_y(X, y, 'csr')

  File "C:\Users\123\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 749, in check_X_y
    estimator=estimator)

  File "C:\Users\123\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 549, in check_array
    "if it contains a single sample.".format(array))

ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

I am not sure what problem is it.
BTW I face another problem when I try to train the model with my own data (which is in ‘raw’ format)
No error was given but I get following warning:

naive_bayes.py:465: RuntimeWarning: divide by zero encountered in log
  self.class_log_prior_ = (np.log(self.class_count_) -

And I test with the trained model, the result keyphrases generated are all with zero scores.

Thanks in advance for solving my problem

Issue about TopicalPageRank

Hi @boudinfl

I am using semeval2010(raw text) data for TopicalPageRank. But I got a UnicodeDecodeError for this.

Code:

class TopicalPageRank:

def unsupervised_TopicalPageRank(self,path,Top_n):
    extractor = pke.unsupervised.TopicalPageRank(input_file=path)
    extractor.read_document(format='raw')
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=Top_n, stemming=False)
    return keyphrases

Error:

UnicodeDecodeError Traceback (most recent call last)
in ()
4
5 Data_Processing = DataProcessing()
----> 6 Data_Processing.Test_Data(Keyphrases_pro_type)
7 #Data_Processing.Train_Data(Keyphrases_pro_type)

in Test_Data(self, GSK_type)
44
45
---> 46 Keyphrases = topicalpagerank.unsupervised_TopicalPageRank(full_path[i],Top_n)
47
48 if Keyphrases_pro_type == "combined":

in unsupervised_TopicalPageRank(self, path, Top_n)
5 extractor.read_document(format='raw')
6 extractor.candidate_selection()
----> 7 extractor.candidate_weighting()
8 keyphrases = extractor.get_n_best(n=Top_n, stemming=False)
9 return keyphrases

~\Anaconda3\lib\site-packages\pke\unsupervised\graph_based.py in candidate_weighting(self, window, pos, normalized, lda_model)
788 model.components_,
789 model.exp_dirichlet_component_,
--> 790 model.doc_topic_prior_) = pickle.load(f)
791
792 # build the document representation

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)

Data(Raw text):
semeval2010---->
Test data ---->
File name: C-1.txt.final

YAKE implementation error!

EDIT 29/10/2020: Converted results into markdown table for easy comparison

YAKE has some implementation errors. For example,

the stopwords list MUST be more complete, so we use this one here
the way we compute the stopwords and not stopwords weight is different
how to calculate the frequency of co-occurrence needs to consider punctuation. And some others...

In addition, I have already made the necessary changes to the proper functioning of YAKE in PKE, but as I have no privileges to do pull request, I leave in my git the corrected version.

All of these errors have been detrimental to YAKE's performance, see the table below for quick experimentation.
Where the first result is the original YAKEs version and the results of pke_yake_vX.Y represent:

X = 0 is the default version implemented by PKE
X = 1 is my corrections
Y = 1 is using the stopwords lists of nltk (using by PKE)
Y = 2 using these stopwords lists.

Original: Yake_n3_w1_seqm-0.90_f-NonC
pke_vX.Y: pke_yake_vX.Y

corpus / MAP	Original	pke_v0.1	pke_v0.2	pke_v1.1	pke_v1.2
110-PT-BN-KP	41.51	15.40	39.46	35.18	41.76
500N-KPCrowd-v1.1.1	13.88	7.93	13.72	12.99	15.13
Inspec	21.58	7.84	14.54	16.25	17.79
PubMed	4.45	1.88	0.10	3.74	4.04
SemEval2010	12.47	6.65	09.59	10.98	11.25
WKC	31.24	20.13	32.28	31.84	34.88

Alternatively, I suggest using the version of YAKE that is in pypi. It is updated frequently.

Error when call extractor.candidate_weighting()

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/coo.py:200: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
if np.rank(self.data) != 1 or np.rank(self.row) != 1 or np.rank(self.col) != 1:
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py:130: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
if np.rank(self.data) != 1 or np.rank(self.indices) != 1 or np.rank(self.indptr) != 1:

Issue importing pke

After installing pke with

pip install git+https://github.com/boudinfl/pke.git

it seems like pke is not imported correctly. More specifically, pke resolves to it's containing file and not to the package, thus nothing is found, at least for me.

I am on ubuntu with python2.7 and I looked into the directory where pke is installed (/usr/local/lib/python2.7/dist-packages/pke) and that content is the same as the pke folder in this repository. To be honest, I don't really understand the error. It seems like it is expecting/finding a pke.py file, but I don't see any?

Not gettings keyphrases as whole

When I try to extract keyphrases using TopicRank I get topics which are incomplete.

For example, for this text

"In April, 2015, an earthquake of magnitude 7.8 on the Richter scale struck Nepal causing sheer havoc, destruction and death. The earthquake left over 8,000 people dead and injured more than 21,000. In the aftermath of the earthquake, the Government of India along with the Indian Armed Forces launched rescue and relief operations in Nepal and termed it as Operation Maitri. Nepali ex-servicemen of the Indian Armed Forces also joined in the operations for guidance in the difficult terrain. Indian Armed Forces were the first to reach Nepal with a helping hand. The helping hand included aircraft from the Indian Air Force and soldiers from the Indian Army for rescue mission on ground. Medical units were also pressed in for relief and engineering units were deployed to clear debris and roads. Besides transporting essential items to stranded citizens of Nepal, the army succeeded in rescuing thousands of people trapped in remote locations."

It returned following keyphrases

indian arm forc -> 0.08056242226198376,
nepal -> 0.07927132788979997
relief oper -> 0.05869478479656734
earthquak -> 0.0531291849749921
rescu -> 0.04721509921710143

So the actual keyphrases should be
Indian armed force
Nepal
Relief Operation
Earthquake and
Rescue

This is what my code is

extractor = pke.TopicRank(input_file='data/data.txt') extractor.read_document(format='raw') extractor.candidate_selection() extractor.candidate_weighting() keyphrases = extractor.get_n_best(n=5) print(keyphrases)

So it there any problem in package or my interpretation is wrong?
If the output is corret then is there any way to get the output similar to the one I have shown?

feed directly text as input

Hi,
Is it possible to feed directly text as input instead of from a text file? I want to give it as input pos tagged sentences.

Is there a way to specify number of extracted key phrase from each sentence?

Is there a way to specify number of extracted key phrase from each sentence?
For example specifying to extract only one (highest score) key phrase from each sentence of a given input text?

Please advise
Thanks

Benchmarking reproduction

Hi!

I'm trying to validate the usage of this toolkit on my masters thesis by duplicating the results of the benchmarking you've done, but I'm unable to reproduce them. I systemically get much lower scores than what you have presented at the bottom of the github page. Would you be so kind as to show the code on how you did the evaluation.

Here is a gist of how I would go about evaluating the TopicRank algorithm: https://gist.github.com/miikargh/fa8f301125fa433fc796cb8376ee0dce

Thanks for the awesome toolkit!

AttributeError: module 'pke' has no attribute 'unsupervised'

UPDATE -- pilot error! I'd have the test script named pke.py.
So, the error now, when running the pke_example.py script is:
ImportError: bad magic number in 'pke': b'\x03\xf3\r\n'

I will leave this in place, for others who do similar, silly errors ................................
........................

python 3.6.7
When typed in by HAND into a live python3 shell, it works & I get keyphrase tuples.
When running the cut/pasted code example:

$ python3  /tmp/pke.py 
Traceback (most recent call last):
  File "/tmp/pke.py", line 2, in <module>
    import pke
  File "/tmp/pke.py", line 5, in <module>
    extractor = pke.unsupervised.TopicRank()
AttributeError: module 'pke' has no attribute 'unsupervised'

How to set up a training set in a supervised method?

In readme, I know how to set up the training file, for example，I have n papers for training file.But I don't know how to use the key words that the author of this n paper has tagged.In examples/train-model.py，I understand the following code.
pke.train_supervised_model(input_dir=input_dir,
reference_file=reference_file,
model_file=output_mdl,
df=df_counts,
format="corenlp",
use_lemmas=False,
stemmer="porter",
model=pke.supervised.Kea(),
language='english',
extension="xml")
input_dir represents a collection of n papers.I don’t know what reference_file and use_lemmas means, and how to used the author’s keyphrases of n papers in a supervised method to training model of kea.Because
You only specify n papers , without the keywords of these papers.I don't know if reference_file is the keyword document.So，I am eager to know how the keywords are used in the code, I wish you would tell me in detail that I would be grateful.

stat: path too long for Windows

Hello!
I've received an Exception associated with a too long path for Windows during to extractor.load_document(input=content, language='en', normalization='lemmatization'):
Exception: stat: path too long for Windows

For input, I used the content text from the article (https://en.wikipedia.org/wiki/London). It seems that python method if os.path.isfile(input) from load_document() throws this exception. Could you add the bool flag is_file as the parameter to avoid that when long content is sent as an input?

`extractor.load_document` (Spacy) limitation of 1000000 characters

extractor.load_document() uses Spacy, which has a limitation of 1000000 characters. This can be overwritten by setting nlp.max_length=. Please see the full error message below.

Can you provide a parameter to manually set nlp.max_length=, or provide a accessor to spacy_doc = nlp(text) in which is used in RawTextReader.read() ?

Text of length xxxx exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)

Feature request: pos_blacklist

Hiya -- first off, this library is amazing. Thank you so much for creating it 🙏

It would be immensely useful to be able to add a parameter to the candidate_filtering function called something like pos_blacklist.

The idea would be that one could filter out e.g "all adverbs" or all patterns "NNS VBZ".

I'm having to do this manually in a loop right now after candidate selection, which is suboptimal for a variety of reasons. Hopefully this request isn't too usecase-specific, or I'm not just missing an easy way to do this -- feel free to close if I am.

boudinfl / pke Goto Github PK

pke's Introduction

pke - python keyphrase extraction

Table of Contents

Installation

Minimal example

Getting started

Implemented models

Model performances

Citing pke

pke's People

Contributors

Stargazers

Watchers

Forkers

pke's Issues

2. load the content of the document.

load the DF counts from file

train a new Kea model

Error

Recommend Projects

Recommend Topics

Recommend Org

`pke` - python keyphrase extraction