Code Monkey home page Code Monkey logo

Comments (4)

MaartenGr avatar MaartenGr commented on May 22, 2024 1

@stolam There is also the option to set calculate_probabilities to false, which definitely helps with resource management and speeds up the solution. In UMAP there is similarly the option to set low_memory to False which I have found to help with low resource machines. I am thinking of replacing the calculate_probabilities parameter with a low_memory parameter in order to chance both the calculating of probabilities as well as setting low memory in UMAP.

@Kingstonshaw

  1. This is something I am considering turning on if verbose is set to True, so I do think you will see this in an update in the near future!

  2. Not yet, this is mainly because the model in its entirety currently cannot be offloaded to the GPU as not all models in BERTopic have that feature (UMAP & HDBSCAN) although UMAP is nearing that point. Having said that, I might look into replacing sentence-transformers with Flair to make it easier to offload the creation of embeddings.

from bertopic.

MaartenGr avatar MaartenGr commented on May 22, 2024 1

@stolam I just released a new version of BERTopic (v0.5) that has a low_memory options built-in as a parameter. Set this to True and it should use significantly less memory. Also, calculate_probabilities is now set to False as a default to prevent accidental memory issues. Hopefully, this helps you out. If not, please let me know!

from bertopic.

Wktx avatar Wktx commented on May 22, 2024

I am having issues using big data as well, model fit transform is very resource intensive and takes hours. I am curious:

1.Is there a way to turn on progress_bar=true fit-transforming the model like using the transformers.encode to get embeddings? I don't see a way to tell if the model is still running or hanging.

2.Is there plans to add options to offload the model to the GPU via torch? I have VRAMs to spare

Any suggestion and feedback would be greatly appreciated!

from bertopic.

stolam avatar stolam commented on May 22, 2024

@MaartenGr Thank you, for your reply. It is good to know about the calculate_probabilities option. In my case, the algorithm crashes during the UMAP phase, so I will use the low_memory option (you meant to set it to True, right?) and try to limit the cores, that should improve things.

Just to put things into perspective, my dataset is 20GB, my RAM is 128GB so I thought I would be OK. The memory consumption was growing slowly from 20 to 40GB and then exploded quickly with UMAP.

from bertopic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.