I am running a distributed processing pipeline using Dask that is ingesting millions o

Memory leak in TopicRank about pke HOT 5 CLOSED

NathancWatts commented on September 27, 2024

Memory leak in TopicRank

from pke.

Comments (5)

ygorg commented on September 27, 2024

EDIT: My bad we fixed this behaviour a while ago. Please share more code, in python there should not be memory leaks, there might be something that accumulates data ?

I think it might be because you use only one TopicRank object. For example if I process many files i'll do something like:

# extractor = TopicRank() # Not that !!!
for d in docs:
    extractor  = TopicRank()  # That
    extractor.load_document(d)
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=n)

It is very important that the extractor is recreated for each different document. extractor.load_document does not reset extractor.candidates, so candidates are accumulating with each call to extractor.candidate_selection.

Also please note that pke is a research tool, and not suited for production uses. I would advise to extract only the code useful for your use case and optimizing it instead of using pke as is ! But I'm very glad to know that you use it large scale !!

from pke.

NathancWatts commented on September 27, 2024

Sure thing; I am saving the spacy model within the "worker" object here, but I'm re-initializing the PKE extractor between each document.
This is all (effectively) happening within the loop:

def extract_keyphrases(text=None, lang='en', n=20, stoplist = []):
   extractor = pke.unsupervised.TopicRank()
   worker = get_worker()
   # Check if the spacy model is already loaded. if it isn't, load it now and save it in the worker
   try:
       pke_model = worker.pke_model
   except AttributeError:
       import spacy
       pke_model = spacy.load("en_core_web_sm")
       worker.pke_model = pke_model
    
   extractor.load_document(input=text, language=lang, stoplist = stoplist, spacy_model = pke_model)
   extractor.candidate_selection()
   try:
       extractor.candidate_weighting()
   except:
       return list()
 
   keyphrases = extractor.get_n_best(n=n)
   return keyphrases

from pke.

ygorg commented on September 27, 2024

The code looks fine, if you try with the "FirstPhrases" extractor (which is simpler) do you still have this issue ? And how many candidates are extracted in the documents (if you map len(extractor.candidates) for each document), maybe this can give insight.
You can also try to preprocess the documents with spacy and pass these to extract_keyphrase like this:

Code

def preprocess(text, lang):
  try:
       pke_model = worker.pke_model
   except AttributeError:
       import spacy
       pke_model = spacy.load("en_core_web_sm")
       worker.pke_model = pke_model
  return pke_model(text)

def extract_keyphrases(doc=None, lang='en', n=20, stoplist = []):
   extractor = pke.unsupervised.TopicRank()
    
   extractor.load_document(input=doc, language=lang, stoplist = stoplist)
   extractor.candidate_selection()
   try:
       extractor.candidate_weighting()
   except:
       return list()
 
   keyphrases = extractor.get_n_best(n=n)
   return keyphrases

doc = map(preprocess, doc)
kps = map(extract_keyphrases, doc)

Apart from this I don't if I can be of any more help :(

from pke.

NathancWatts commented on September 27, 2024

Thanks for the suggestions! Using "FirstPhrases," the memory leak still appears to occur, but seems to build up a bit slower; however, quite excitingly, when I moved the preprocessing step to a separate function, the memory leak appears to be resolved! (or at least very very significantly reduced.) Could it be that the extractor is hanging on to a copy of the spacy model or something when passed into load_document()?
edit: It's possible that the memory leak actually is somewhere in RawTextReader, from digging into what the difference could be. I'm going to let it keep running just to make sure the issue is resolved and keep at it but after 300 iterations I should have seen it by now. Very exciting! Thank you very much.

from pke.

ygorg commented on September 27, 2024

Thanks for your experiments ! If loading documents before hand reduces memory usage then I'm closing this issue for now.

from pke.

Memory leak in TopicRank about pke HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent