Code Monkey home page Code Monkey logo

Comments (6)

dopc avatar dopc commented on May 18, 2024

I think it is related with HDBSCAN. You can check here, original docs.

from bertopic.

MaartenGr avatar MaartenGr commented on May 18, 2024

Unfortunately, the soft clustering is still an experimental feature that does have its fair share of open issues if you look through the HDBSCAN repo. As of this moment, it does seem that the probabilities for some documents do not represent the topics they were assigned to. Having said that, after some testing, it does seem that 98,9% of the probabilities are correctly assigned. The ones that aren't do match with their second highest probability. Fortunately, this means that the probabilities itself still can be interpreted although you should be careful indeed when blindly taking the highest probability.

from bertopic.

arielibaba avatar arielibaba commented on May 18, 2024

Thanks you guys for being responsive.
Indeed, it seems that the soft clustering for HDBSCAN still has to be improved.
Anyway, I will leverage the concept of exemplar points that they use in the documentation as it seems to be more reliable.

from bertopic.

firmai avatar firmai commented on May 18, 2024

Hey, not entirely sure how to proceed, what is the best way to get probabilities over documents: currently I have a 100% disconnect, is this normal.

image

from bertopic.

daviddiazsolis avatar daviddiazsolis commented on May 18, 2024

Hello everyone, I faced the same issue, and did some further research on the HDBSCAN repo and found there are some new commits that propose a way around the problem. It is not a perfect solution but the new probabilities are a much closer match than before.

Following the flat.py hdbscan_flat functions, the solutions involve three steps:

  1. using your own hdbscan model, so first you need to define one

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True,cluster_selection_method='eom')

  1. then, call bertopic using that hdbscan model

from bertopic import BERTopic

topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True,
hdbscan_model=clusterer)
#we get the old topics, and probabilities with the issue
topics, probs = topic_model.fit_transform(docs)
that=[np.argmax(probs[r,:]) for r in range(len(probs))] #max probabilities don't match with the topics

  1. we get the embedings from the original model and then run the hdbscan_flat functions
    cluster=topic_model.hdbscan_model
    embs=topic_model.umap_model.embedding_

from hdbscan.flat import (HDBSCAN_flat,
approximate_predict_flat,
membership_vector_flat,
all_points_membership_vectors_flat)

def n_clusters_from_labels(labels_):
return np.amax(labels_) + 1

n_clusters = n_clusters_from_labels(cluster.labels_)

When we ask for flat clustering with same n_clusters,

clusterer_flat = HDBSCAN_flat(embs, n_clusters=n_clusters,
cluster_selection_method='eom')

#we get the new topics, and probabilities

topics2=clusterer_flat.labels_
probs2 = all_points_membership_vectors_flat(clusterer_flat)
that2=[np.argmax(probs2[r,:]) for r in range(len(probs2))]

now the prob2 match much closer to topics 2 and that2
image
image

I hope this helps

from bertopic.

wsosnowski avatar wsosnowski commented on May 18, 2024

Hi,
the problem with inconsistency during inference phase is caused by UMAP not HDBSCAN anymore and there is a very easy fix - just create custom umap model and fix the random_state:

umap_model = UMAP(n_neighbors=15, n_components=5,  min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model, verbose=True, calculate_probabilities=True)

from bertopic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.