Code Monkey home page Code Monkey logo

Comments (1)

MaartenGr avatar MaartenGr commented on June 18, 2024

To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialized CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?

It depends. Although we typically would like to approach it with the same corpus/dictionary, that would also mean being constrained to the same types of representations as other models. Moreover, it also means that we are constrained by using the c-TF-IDF representations whereas you could also use other forms of representations in BERTopic. Personally, and as shown in the mentioned issue, I'm not particularly a big fan of optimizing BERTopic for coherence/diversity. Especially since it ignores all those additional representations that are integrated in the library. It's always interesting to see papers using BERTopic and using coherence on the default pipeline without considering MMR, KeyBERTInspired, PartOfSpeech, and even LLM-based representations.

Also, consider the following. Is the model with the highest coherence actually the best model? What is the definition of the best model in your particular use case? In all honesty, I highly doubt that optimizing for coherence/diversity is the answer here which is why I typically advise people to first find the metrics that fit with their use case. That might also mean that, and I hope it does, that human evaluation (for instance, with domain-experts), are considered or even your own validation.

from bertopic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.