Comments (4)
@stolam There is also the option to set calculate_probabilities
to false, which definitely helps with resource management and speeds up the solution. In UMAP there is similarly the option to set low_memory
to False which I have found to help with low resource machines. I am thinking of replacing the calculate_probabilities
parameter with a low_memory
parameter in order to chance both the calculating of probabilities as well as setting low memory in UMAP.
@Kingstonshaw
-
This is something I am considering turning on if verbose is set to True, so I do think you will see this in an update in the near future!
-
Not yet, this is mainly because the model in its entirety currently cannot be offloaded to the GPU as not all models in BERTopic have that feature (UMAP & HDBSCAN) although UMAP is nearing that point. Having said that, I might look into replacing
sentence-transformers
withFlair
to make it easier to offload the creation of embeddings.
from bertopic.
@stolam I just released a new version of BERTopic (v0.5) that has a low_memory
options built-in as a parameter. Set this to True and it should use significantly less memory. Also, calculate_probabilities
is now set to False as a default to prevent accidental memory issues. Hopefully, this helps you out. If not, please let me know!
from bertopic.
I am having issues using big data as well, model fit transform is very resource intensive and takes hours. I am curious:
1.Is there a way to turn on progress_bar=true fit-transforming the model like using the transformers.encode to get embeddings? I don't see a way to tell if the model is still running or hanging.
2.Is there plans to add options to offload the model to the GPU via torch? I have VRAMs to spare
Any suggestion and feedback would be greatly appreciated!
from bertopic.
@MaartenGr Thank you, for your reply. It is good to know about the calculate_probabilities
option. In my case, the algorithm crashes during the UMAP phase, so I will use the low_memory
option (you meant to set it to True, right?) and try to limit the cores, that should improve things.
Just to put things into perspective, my dataset is 20GB, my RAM is 128GB so I thought I would be OK. The memory consumption was growing slowly from 20 to 40GB and then exploded quickly with UMAP.
from bertopic.
Related Issues (20)
- [Possible BUG] Cannot swap away from GPU:0 HOT 2
- [Possible BUG] n_words parameter doesn't update y_label values HOT 3
- cannot build wheel hdbscan HOT 2
- Guidance on managing BERTopic models HOT 15
- Potential Bug for vectorizer_model HOT 3
- CohereAPIError HOT 4
- Probabilities output is empty
- Question about pooling method used in the paper. HOT 1
- Reducing outliers by clustering them HOT 2
- Only 1 topic clustered? HOT 3
- Identifying more than one topic in a large document. HOT 1
- Show the distribution of top k documents in each topic HOT 8
- merge_models create new outlier topic HOT 2
- How to improve Hierarchical Clustering HOT 3
- option to recalculate c_tf_idf_, topic_representations_ and representative_docs_ after merging after merging models HOT 1
- `precomputed` Distance Compatibility for HDBSCAN HOT 3
- OpenAI Representation Model error HOT 6
- Bug with zeroshot_min_similarity parameter? HOT 4
- Multimodal imaging - documentation needs updating for Pandas change
- Is the updated TensorFlow 2.16.1 version conflicting with BERTopic and bertopic.representation import OpenAI? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bertopic.