Hi Maarten, When using BERTopic on fetch_20newsgroups dataset to ext

I think it is related with HDBSCAN. You can check <a href="https://hdbscan.readthedocs

Inconsistency between topic with maximum probability and the predicted one for a document about bertopic HOT 6 CLOSED

maartengr commented on May 18, 2024

Inconsistency between topic with maximum probability and the predicted one for a document

from bertopic.

Comments (6)

dopc commented on May 18, 2024

I think it is related with HDBSCAN. You can check here, original docs.

from bertopic.

MaartenGr commented on May 18, 2024

Unfortunately, the soft clustering is still an experimental feature that does have its fair share of open issues if you look through the HDBSCAN repo. As of this moment, it does seem that the probabilities for some documents do not represent the topics they were assigned to. Having said that, after some testing, it does seem that 98,9% of the probabilities are correctly assigned. The ones that aren't do match with their second highest probability. Fortunately, this means that the probabilities itself still can be interpreted although you should be careful indeed when blindly taking the highest probability.

from bertopic.

arielibaba commented on May 18, 2024

Thanks you guys for being responsive.
Indeed, it seems that the soft clustering for HDBSCAN still has to be improved.
Anyway, I will leverage the concept of exemplar points that they use in the documentation as it seems to be more reliable.

from bertopic.

firmai commented on May 18, 2024

Hey, not entirely sure how to proceed, what is the best way to get probabilities over documents: currently I have a 100% disconnect, is this normal.

from bertopic.

daviddiazsolis commented on May 18, 2024

Hello everyone, I faced the same issue, and did some further research on the HDBSCAN repo and found there are some new commits that propose a way around the problem. It is not a perfect solution but the new probabilities are a much closer match than before.

Following the flat.py hdbscan_flat functions, the solutions involve three steps:

using your own hdbscan model, so first you need to define one

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True,cluster_selection_method='eom')

then, call bertopic using that hdbscan model

from bertopic import BERTopic

topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True,
hdbscan_model=clusterer)
#we get the old topics, and probabilities with the issue
topics, probs = topic_model.fit_transform(docs)
that=[np.argmax(probs[r,:]) for r in range(len(probs))] #max probabilities don't match with the topics

we get the embedings from the original model and then run the hdbscan_flat functions
cluster=topic_model.hdbscan_model
embs=topic_model.umap_model.embedding_

from hdbscan.flat import (HDBSCAN_flat,
approximate_predict_flat,
membership_vector_flat,
all_points_membership_vectors_flat)

def n_clusters_from_labels(labels_):
return np.amax(labels_) + 1

n_clusters = n_clusters_from_labels(cluster.labels_)

When we ask for flat clustering with same n_clusters,

clusterer_flat = HDBSCAN_flat(embs, n_clusters=n_clusters,
cluster_selection_method='eom')

#we get the new topics, and probabilities

topics2=clusterer_flat.labels_
probs2 = all_points_membership_vectors_flat(clusterer_flat)
that2=[np.argmax(probs2[r,:]) for r in range(len(probs2))]

now the prob2 match much closer to topics 2 and that2

I hope this helps

from bertopic.

wsosnowski commented on May 18, 2024

Hi,
the problem with inconsistency during inference phase is caused by UMAP not HDBSCAN anymore and there is a very easy fix - just create custom umap model and fix the random_state:

umap_model = UMAP(n_neighbors=15, n_components=5,  min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model, verbose=True, calculate_probabilities=True)

from bertopic.

Inconsistency between topic with maximum probability and the predicted one for a document about bertopic HOT 6 CLOSED

Comments (6)

When we ask for flat clustering with same n_clusters,

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent