When using big data, it becomes infeasible to hold everything in memory at once. W

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Make the algorithm less memory intensive about bertopic HOT 4 CLOSED

maartengr commented on May 22, 2024

Make the algorithm less memory intensive

from bertopic.

Comments (4)

MaartenGr commented on May 22, 2024 1

@stolam There is also the option to set calculate_probabilities to false, which definitely helps with resource management and speeds up the solution. In UMAP there is similarly the option to set low_memory to False which I have found to help with low resource machines. I am thinking of replacing the calculate_probabilities parameter with a low_memory parameter in order to chance both the calculating of probabilities as well as setting low memory in UMAP.

@Kingstonshaw

This is something I am considering turning on if verbose is set to True, so I do think you will see this in an update in the near future!
Not yet, this is mainly because the model in its entirety currently cannot be offloaded to the GPU as not all models in BERTopic have that feature (UMAP & HDBSCAN) although UMAP is nearing that point. Having said that, I might look into replacing sentence-transformers with Flair to make it easier to offload the creation of embeddings.

from bertopic.

MaartenGr commented on May 22, 2024 1

@stolam I just released a new version of BERTopic (v0.5) that has a low_memory options built-in as a parameter. Set this to True and it should use significantly less memory. Also, calculate_probabilities is now set to False as a default to prevent accidental memory issues. Hopefully, this helps you out. If not, please let me know!

from bertopic.

Wktx commented on May 22, 2024

I am having issues using big data as well, model fit transform is very resource intensive and takes hours. I am curious:

1.Is there a way to turn on progress_bar=true fit-transforming the model like using the transformers.encode to get embeddings? I don't see a way to tell if the model is still running or hanging.

2.Is there plans to add options to offload the model to the GPU via torch? I have VRAMs to spare

Any suggestion and feedback would be greatly appreciated!

from bertopic.

stolam commented on May 22, 2024

@MaartenGr Thank you, for your reply. It is good to know about the calculate_probabilities option. In my case, the algorithm crashes during the UMAP phase, so I will use the low_memory option (you meant to set it to True, right?) and try to limit the cores, that should improve things.

Just to put things into perspective, my dataset is 20GB, my RAM is 128GB so I thought I would be OK. The memory consumption was growing slowly from 20 to 40GB and then exploded quickly with UMAP.

from bertopic.

Recommend Projects

Make the algorithm less memory intensive about bertopic HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent