Code Monkey home page Code Monkey logo

Comments (5)

ygorg avatar ygorg commented on September 27, 2024

EDIT: My bad we fixed this behaviour a while ago. Please share more code, in python there should not be memory leaks, there might be something that accumulates data ?

I think it might be because you use only one TopicRank object. For example if I process many files i'll do something like:

# extractor = TopicRank() # Not that !!!
for d in docs:
    extractor  = TopicRank()  # That
    extractor.load_document(d)
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=n)

It is very important that the extractor is recreated for each different document. extractor.load_document does not reset extractor.candidates, so candidates are accumulating with each call to extractor.candidate_selection.

Also please note that pke is a research tool, and not suited for production uses. I would advise to extract only the code useful for your use case and optimizing it instead of using pke as is ! But I'm very glad to know that you use it large scale !!

from pke.

NathancWatts avatar NathancWatts commented on September 27, 2024

Sure thing; I am saving the spacy model within the "worker" object here, but I'm re-initializing the PKE extractor between each document.
This is all (effectively) happening within the loop:

def extract_keyphrases(text=None, lang='en', n=20, stoplist = []):
   extractor = pke.unsupervised.TopicRank()
   worker = get_worker()
   # Check if the spacy model is already loaded. if it isn't, load it now and save it in the worker
   try:
       pke_model = worker.pke_model
   except AttributeError:
       import spacy
       pke_model = spacy.load("en_core_web_sm")
       worker.pke_model = pke_model
    
   extractor.load_document(input=text, language=lang, stoplist = stoplist, spacy_model = pke_model)
   extractor.candidate_selection()
   try:
       extractor.candidate_weighting()
   except:
       return list()
 
   keyphrases = extractor.get_n_best(n=n)
   return keyphrases

from pke.

ygorg avatar ygorg commented on September 27, 2024

The code looks fine, if you try with the "FirstPhrases" extractor (which is simpler) do you still have this issue ? And how many candidates are extracted in the documents (if you map len(extractor.candidates) for each document), maybe this can give insight.
You can also try to preprocess the documents with spacy and pass these to extract_keyphrase like this:

Code
def preprocess(text, lang):
  try:
       pke_model = worker.pke_model
   except AttributeError:
       import spacy
       pke_model = spacy.load("en_core_web_sm")
       worker.pke_model = pke_model
  return pke_model(text)

def extract_keyphrases(doc=None, lang='en', n=20, stoplist = []):
   extractor = pke.unsupervised.TopicRank()
    
   extractor.load_document(input=doc, language=lang, stoplist = stoplist)
   extractor.candidate_selection()
   try:
       extractor.candidate_weighting()
   except:
       return list()
 
   keyphrases = extractor.get_n_best(n=n)
   return keyphrases

doc = map(preprocess, doc)
kps = map(extract_keyphrases, doc)

Apart from this I don't if I can be of any more help :(

from pke.

NathancWatts avatar NathancWatts commented on September 27, 2024

Thanks for the suggestions! Using "FirstPhrases," the memory leak still appears to occur, but seems to build up a bit slower; however, quite excitingly, when I moved the preprocessing step to a separate function, the memory leak appears to be resolved! (or at least very very significantly reduced.) Could it be that the extractor is hanging on to a copy of the spacy model or something when passed into load_document()?
edit: It's possible that the memory leak actually is somewhere in RawTextReader, from digging into what the difference could be. I'm going to let it keep running just to make sure the issue is resolved and keep at it but after 300 iterations I should have seen it by now. Very exciting! Thank you very much.

from pke.

ygorg avatar ygorg commented on September 27, 2024

Thanks for your experiments ! If loading documents before hand reduces memory usage then I'm closing this issue for now.

from pke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.