Comments (5)
EDIT: My bad we fixed this behaviour a while ago. Please share more code, in python there should not be memory leaks, there might be something that accumulates data ?
I think it might be because you use only one TopicRank object. For example if I process many files i'll do something like:
# extractor = TopicRank() # Not that !!!
for d in docs:
extractor = TopicRank() # That
extractor.load_document(d)
extractor.candidate_selection()
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=n)
It is very important that the extractor is recreated for each different document. extractor.load_document
does not reset extractor.candidates
, so candidates are accumulating with each call to extractor.candidate_selection
.
Also please note that pke
is a research tool, and not suited for production uses. I would advise to extract only the code useful for your use case and optimizing it instead of using pke
as is ! But I'm very glad to know that you use it large scale !!
from pke.
Sure thing; I am saving the spacy model within the "worker" object here, but I'm re-initializing the PKE extractor between each document.
This is all (effectively) happening within the loop:
def extract_keyphrases(text=None, lang='en', n=20, stoplist = []):
extractor = pke.unsupervised.TopicRank()
worker = get_worker()
# Check if the spacy model is already loaded. if it isn't, load it now and save it in the worker
try:
pke_model = worker.pke_model
except AttributeError:
import spacy
pke_model = spacy.load("en_core_web_sm")
worker.pke_model = pke_model
extractor.load_document(input=text, language=lang, stoplist = stoplist, spacy_model = pke_model)
extractor.candidate_selection()
try:
extractor.candidate_weighting()
except:
return list()
keyphrases = extractor.get_n_best(n=n)
return keyphrases
from pke.
The code looks fine, if you try with the "FirstPhrases" extractor (which is simpler) do you still have this issue ? And how many candidates are extracted in the documents (if you map len(extractor.candidates)
for each document), maybe this can give insight.
You can also try to preprocess the documents with spacy and pass these to extract_keyphrase like this:
Code
def preprocess(text, lang):
try:
pke_model = worker.pke_model
except AttributeError:
import spacy
pke_model = spacy.load("en_core_web_sm")
worker.pke_model = pke_model
return pke_model(text)
def extract_keyphrases(doc=None, lang='en', n=20, stoplist = []):
extractor = pke.unsupervised.TopicRank()
extractor.load_document(input=doc, language=lang, stoplist = stoplist)
extractor.candidate_selection()
try:
extractor.candidate_weighting()
except:
return list()
keyphrases = extractor.get_n_best(n=n)
return keyphrases
doc = map(preprocess, doc)
kps = map(extract_keyphrases, doc)
Apart from this I don't if I can be of any more help :(
from pke.
Thanks for the suggestions! Using "FirstPhrases," the memory leak still appears to occur, but seems to build up a bit slower; however, quite excitingly, when I moved the preprocessing step to a separate function, the memory leak appears to be resolved! (or at least very very significantly reduced.) Could it be that the extractor is hanging on to a copy of the spacy model or something when passed into load_document()
?
edit: It's possible that the memory leak actually is somewhere in RawTextReader,
from digging into what the difference could be. I'm going to let it keep running just to make sure the issue is resolved and keep at it but after 300 iterations I should have seen it by now. Very exciting! Thank you very much.
from pke.
Thanks for your experiments ! If loading documents before hand reduces memory usage then I'm closing this issue for now.
from pke.
Related Issues (20)
- Langcode "ge" does not work with spacy HOT 2
- max_length parameter error with the latest version HOT 1
- Keyword dataset HOT 1
- How to manipulate length of key-phrases? HOT 3
- AttributeError: module 'scipy.sparse' has no attribute 'coo_array' HOT 1
- Throws zero division error HOT 3
- sklearn is deprecated, breaks PKE HOT 1
- error while installing pke HOT 1
- SnowballStemmer and Spacy-model use different langcodes HOT 1
- Error: module 'scipy.sparse' has no attribute 'coo_array' HOT 1
- KP-Miner: why candidate_df is 1 for n-grams except unigram? HOT 1
- Error in PositionRank: tuple index out of range HOT 1
- Error in TopicalPageRank: 'numpy.ndarray' object has no attribute 'index' HOT 3
- Does not support Chinese? HOT 1
- PositionRank algorithm fails with networkx version 3.0: IndexError in `candidate_weighting()` method HOT 1
- LookupError: Can't find any language named 'hinglish' HOT 1
- hinglish key error HOT 1
- extractor.load_document (Spacy) limitation of 1000000 characters
- TfIdf still state of the art?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pke.