Comments (1)
To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialized CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?
It depends. Although we typically would like to approach it with the same corpus/dictionary, that would also mean being constrained to the same types of representations as other models. Moreover, it also means that we are constrained by using the c-TF-IDF representations whereas you could also use other forms of representations in BERTopic. Personally, and as shown in the mentioned issue, I'm not particularly a big fan of optimizing BERTopic for coherence/diversity. Especially since it ignores all those additional representations that are integrated in the library. It's always interesting to see papers using BERTopic and using coherence on the default pipeline without considering MMR, KeyBERTInspired, PartOfSpeech, and even LLM-based representations.
Also, consider the following. Is the model with the highest coherence actually the best model? What is the definition of the best model in your particular use case? In all honesty, I highly doubt that optimizing for coherence/diversity is the answer here which is why I typically advise people to first find the metrics that fit with their use case. That might also mean that, and I hope it does, that human evaluation (for instance, with domain-experts), are considered or even your own validation.
from bertopic.
Related Issues (20)
- probabilities_ outcome not consistent with get_document_info output HOT 3
- `transform` method not handling single embeddings or strings given to it. HOT 1
- results of `transform` is differnet from merged topic model `get_topic_info()` output HOT 1
- c_tf_idf_.indptr is None when attempting to save merged model HOT 6
- Consider adding a linter HOT 1
- I'm trying to use zeroshot topic model,but i encounter an error,Can anyone please help me?TypeError: BERTopic.__init__() got an unexpected keyword argument 'zeroshot_topic_list' HOT 1
- Identical topics: some become outliers, some are assigned to their topic HOT 1
- TerminatedWorkerError:A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {EXIT(1)} HOT 3
- TypeError:'NoneType' object is not subscriptable while calling topic_model.hierarchical_topics HOT 5
- Can't update model name when use notebook HOT 4
- Scikit-learn's HDBSCAN Implementation
- Issues with visualizations on loaded models. HOT 1
- (Zero-shot Topic Modeling) TypeError: object of type 'numpy.float64' has no len() HOT 1
- Additional representations did not update with topic reduction HOT 5
- [Guided Topic Modeling] ValueError: setting an array element with a sequence. HOT 6
- Array mismatch when try to fit new data HOT 1
- Does Bertopic support custom keyword extractor? HOT 5
- Why do I lose Names assigned by zero-shot after applying outlier reduction? HOT 1
- get_topic() with KeyBERTInspired? HOT 1
- Mismatch between old OpenAI API in bertopic/backend/_openai and current OpenAI (v1.33.0) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bertopic.