Comments (3)
Ah right, then we would calculate the distance matrix ourselves based on what has been set within HDBSCAN. I think it's important here that there are additional checks to make sure that a missing "metric"
does not run into errors or that it automatically calculates the metric.
Your work on this would be greatly appreciated!
from bertopic.
Thank you for sharing this extensive description of this use case! I agree that it would be nice to have something like this implemented although I am curious as to how many users would end up using this feature.
Having said that, you can already pass the distance matrix to BERTopic and then simply skip over dimensionality reduction (as you already did before) in order to make this work. It would, however, introduce issues with topic embeddings but I'm actually curious about what would happen.
Lastly, do you think there is a way to implement this without introducing an HDBSCAN-specific parameter to the initialization of BERTopic? The reason why I ask is that my philosophy with BERTopic is to make it as modular as possible, so introducing this parameter might go against that if it is specific to HDBSCAN. Moreover, I want to keep the parameter space as small as possible in the initialization to keep the usage of BERTopic user-friendly. I have already seen some information-overload happening with the current set of parameters.
What do you think?
from bertopic.
Hey @MaartenGr, thank you for answering!
Yes, I think it's possible to implement this. As an initial idea, I think we can just get the metric parameter from HDBSCAN (self.hdbscan_model.get_params()["metric"]) and then define the logic. We can leverage scikit-learn's pairwise metrics to define it without any addition of extra parameters and maintaining modularity.
If I get your approval I can start working on that
from bertopic.
Related Issues (20)
- bertopic version 0.16.0 - when adding representation model together with zeroshot_topic_list end with failure HOT 3
- Disable warning for update_topics() HOT 1
- Wrong link in Algorithm Documentation HOT 2
- Getting probabilities for all topics given a document from loaded model HOT 1
- Issues with Zero-shot Topic Modeling regarding outliers and future operations HOT 3
- Switch from setup.py to pyproject.toml HOT 5
- Seed Words
- random openai issue with plain bertopic use HOT 18
- Nan Representative Docs when loading a serialized model HOT 1
- ModuleNotFoundError: Can't use LangChain with version 0.16.0 HOT 1
- Should raise an Exception when tokenizer is not defined HOT 1
- Handle Responsible AI scenarios for OpenAI HOT 2
- Warn when automatically choosing SklearnEmbedder backend HOT 3
- PartOfSpeech representation reproducibility and word with index 0 HOT 2
- Zero-Shot HOT 2
- Supervised topic model generating different topics to training data HOT 3
- Where is the full data set of embeddings? HOT 3
- Visualization in html page HOT 1
- Guided Modeling: Problem with seed_topic_list HOT 2
- Utilizing the GPU of MacBook Pro M3 to accelerate the process of fit_transform HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bertopic.