drob-xx / topictuner Goto Github PK

View Code? Open in Web Editor NEW

40.0 2.0 1.0 84.56 MB

HDBSCAN Tuning for BERTopic Models

License: GNU General Public License v3.0

Python 78.93% Jupyter Notebook 21.07%

clustering hdbscan nlp topic-modeling tuning bertopic

topictuner's Introduction

TopicTuner — Tune BERTopic HDBSCAN Models

To install from PyPi :

pip install topicmodeltuner

The Problem

Out of the box, BERTopic relies upon HDBSCAN to cluster topics. Two of the most important HDBSCAN parameters, min_cluster_size and sample_size will almost always have a dramatic effect on cluster formation. They dictate the number of clusters created including the -1 or uncategorized cluster. While with some datasets a large number of uncategorized documents may be the right clustering, in practice BERTopic will essentially discard a large percentage of "good" documents and not use them for cluster formation and topic formation.

HDBSCAN is quite sensitive to the values of these two parameters relative to the text being clustered. This means that when using the BERTopic default value of min_topic_size=10 (which is assigned to HDBSCAN's min_cluster_size) the default parameters will more often than not result in an unmanageable number of topics; as well as a sub-optimal number of uncategorized documents. Additionally, documents assigned to the -1 category will not be used to determine topic vocabularly results.

The Solution

TopicTuner provides a TopicModelTuner class — a convenience wrapper for BERTopic Models that efficiently manages the process of discovering optimized min_cluster_size and sample_size parameters, providing:

Random and grid search functionality to quickly discover optimized parameters for a given BERTopic model.
An internal datastore that records all searches for a given model, making parameter selection fast and easy.
Visualizations to assist in parameter tuning and selection.
Two way Import/Export functionality so that you can start from scratch, or with an existing BERTopic model and export a BERTopic model with optimized parameters at the end of your session.
Save and Load for persistance.

To get you started this release includes both a demo notebook and API documentation

topictuner's People

Contributors

Stargazers

Watchers

Forkers

iamdank

topictuner's Issues

Using topictuner with a pretrained model

I have been playing with TopicTuner and wanted to use the gbert large language model from hugging face. Unfortunately, I got an error message: 'FeatureExtractionPipeline' object has no attribute 'encode'

My code is:


from transformers.pipelines import pipeline
gbert = pipeline("feature-extraction", model="deepset/gbert-large")
...
tmt = TMT(embedding_model = gbert)
tmt.createEmbeddings(docs=docs)

However, using a sentence-transformer model worked out well. Is the issue caused by using hugging face? And do you have a suggestion for a workaround?

Thanks a lot!

Curious about why tmt.reduce() method is faster than Bertopic's original UMAP method?

For the same docs, my dimensionality reduction in Bertopic costed 1.5 hour but tmt.reduce() only costed 10 more mins.

The following is the output of tmt.reduce():

UMAP(angular rp forest=True, metric='cosine, min dist=0.0, n components=5,n neighbors=5,random state=473921,verbose=2
Wed Jan 1 00:40:47 2024 Construct fuzzy simplicial set
Wed Jan 10 00:40:48 2024 Finding Nearest Neighbors
Wed Jan 10 00:40:48 2024 Building Rp forest with 37 trees
Wed Jan 10 00:41:012024 NN descent for 19 iterations
1 / 19
2 /19
3 / 19
4/19
Stopping threshold met -- exiting after 4 iterations
Wed Jan 10 00:41:30 2024 Finished Nearest Neighbor Search
Wed Jan 10 00:41:34 2024 Construct embedding
Epochs completed:0%
0/209[00:091
completede/200 epochs
completed200 epochs29/
40200 epochscompleted
60200 epochscompleted
completed80200 epochs
completed100200 epochs
completed120200 epochs-
completed140200 epochs
completed160200 epochs1
completed 180200 epochs
Wed Jan 10 00:54:18 2024 Finished embedding

How to use with wrap

I'm using wrapBERTopicModel and providing my own model. I see that wrapping returns the following.

        return TopicModelTuner(
            embedding_model=BERTopicModel.embedding_model,
            reducer_model=BERTopicModel.umap_model,
            hdbscan_model=BERTopicModel.hdbscan_model,
        )

Following the provided notebook, next step is to create embeddings. But when I execute that I get the following error:

----> 1 tmt.createEmbeddings(docs)

[/usr/local/lib/python3.10/dist-packages/topictuner/topictuner.py](https://localhost:8080/#) in createEmbeddings(self, docs)
    180         if np.all(docs != None):
    181             self.docs = docs
--> 182         self.embeddings = self.embedding_model.encode(self.docs)
    183 
    184     def reduce(self):

AttributeError: 'SentenceTransformerBackend' object has no attribute 'encode'

Help appreciated, thank you :)

Is there a way to enter stopwords and / or c-TF-IDF after I optimised hbdscan with topic tuning?

With BERTopic it is possible to select important words via c-TF-IDF and stopwords removal after the hdbscan calculation. Is this possible using TopicTuner as well? (Or would that lead to problems, because TopicTuner does not optimize for this post-processing state and then would cause much more uncategorized topics?).

In this case, would it make sense to do the stopwords removal and c-TF-IDF at the beginning before topic tuning?

Thanks for your consideration!!

ERROR: Cannot install bertopic because these package versions have conflicting dependencies.

Thanks for your amazing tool!

When I attempt to install the tool but encounter the dependency conflict like below:

(base) jimmy@Jimmys-MacBook-Air Asset-Management-Topic-Modeling % pip install -r requirements.txt                        
Collecting topicmodeltuner==0.3.4
  Using cached topicmodeltuner-0.3.4-py3-none-any.whl (27 kB)
Collecting wandb==0.13.10
  Using cached wandb-0.13.10-py3-none-any.whl (2.0 MB)
Collecting loguru
  Using cached loguru-0.6.0-py3-none-any.whl (58 kB)
Collecting bertopic>=v0.10.0
  Using cached bertopic-0.14.1-py2.py3-none-any.whl (120 kB)
Requirement already satisfied: Click!=8.0.0,>=7.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from wandb==0.13.10->-r requirements.txt (line 2)) (8.1.3)
Collecting GitPython>=1.0.0
  Using cached GitPython-3.1.31-py3-none-any.whl (184 kB)
Requirement already satisfied: requests<3,>=2.0.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from wandb==0.13.10->-r requirements.txt (line 2)) (2.28.1)
Requirement already satisfied: psutil>=5.0.0 in /Users/jimmy/Library/Python/3.11/lib/python/site-packages (from wandb==0.13.10->-r requirements.txt (line 2)) (5.9.4)
Collecting sentry-sdk>=1.0.0
  Using cached sentry_sdk-1.16.0-py2.py3-none-any.whl (184 kB)
Collecting docker-pycreds>=0.4.0
  Using cached docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Requirement already satisfied: PyYAML in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from wandb==0.13.10->-r requirements.txt (line 2)) (6.0)
Collecting pathtools
  Using cached pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... done
Collecting setproctitle
  Using cached setproctitle-1.3.2-cp311-cp311-macosx_10_9_universal2.whl (16 kB)
Requirement already satisfied: setuptools in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from wandb==0.13.10->-r requirements.txt (line 2)) (65.5.0)
Collecting appdirs>=1.4.3
  Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Collecting protobuf!=4.21.0,<5,>=3.19.0
  Using cached protobuf-4.22.0-cp37-abi3-macosx_10_9_universal2.whl (397 kB)
Requirement already satisfied: numpy>=1.20.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from bertopic>=v0.10.0->topicmodeltuner==0.3.4->-r requirements.txt (line 1)) (1.23.5)
Collecting hdbscan>=0.8.29
  Using cached hdbscan-0.8.29.tar.gz (5.2 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting umap-learn>=0.5.0
  Using cached umap-learn-0.5.3.tar.gz (88 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: pandas>=1.1.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from bertopic>=v0.10.0->topicmodeltuner==0.3.4->-r requirements.txt (line 1)) (1.5.2)
Collecting scikit-learn>=0.22.2.post1
  Using cached scikit_learn-1.2.1-cp311-cp311-macosx_12_0_arm64.whl (8.4 MB)
Requirement already satisfied: tqdm>=4.41.1 in /Users/jimmy/Library/Python/3.11/lib/python/site-packages (from bertopic>=v0.10.0->topicmodeltuner==0.3.4->-r requirements.txt (line 1)) (4.64.1)
Collecting sentence-transformers>=0.4.1
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... done
Collecting plotly>=4.7.0
  Using cached plotly-5.13.1-py2.py3-none-any.whl (15.2 MB)
Requirement already satisfied: six>=1.4.0 in /Users/jimmy/Library/Python/3.11/lib/python/site-packages (from docker-pycreds>=0.4.0->wandb==0.13.10->-r requirements.txt (line 2)) (1.16.0)
Collecting gitdb<5,>=4.0.1
  Using cached gitdb-4.0.10-py3-none-any.whl (62 kB)
Requirement already satisfied: charset-normalizer<3,>=2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests<3,>=2.0.0->wandb==0.13.10->-r requirements.txt (line 2)) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests<3,>=2.0.0->wandb==0.13.10->-r requirements.txt (line 2)) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests<3,>=2.0.0->wandb==0.13.10->-r requirements.txt (line 2)) (1.26.13)
Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests<3,>=2.0.0->wandb==0.13.10->-r requirements.txt (line 2)) (2022.12.7)
Collecting smmap<6,>=3.0.1
  Using cached smmap-5.0.0-py3-none-any.whl (24 kB)
Collecting cython>=0.27
  Using cached Cython-0.29.33-py2.py3-none-any.whl (987 kB)
Collecting scipy>=1.0
  Using cached scipy-1.10.1-cp311-cp311-macosx_12_0_arm64.whl (28.7 MB)
Requirement already satisfied: joblib>=1.0 in /Users/jimmy/Library/Python/3.11/lib/python/site-packages (from hdbscan>=0.8.29->bertopic>=v0.10.0->topicmodeltuner==0.3.4->-r requirements.txt (line 1)) (1.2.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/jimmy/Library/Python/3.11/lib/python/site-packages (from pandas>=1.1.5->bertopic>=v0.10.0->topicmodeltuner==0.3.4->-r requirements.txt (line 1)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pandas>=1.1.5->bertopic>=v0.10.0->topicmodeltuner==0.3.4->-r requirements.txt (line 1)) (2022.6)
Collecting tenacity>=6.2.0
  Using cached tenacity-8.2.2-py3-none-any.whl (24 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.26.1-py3-none-any.whl (6.3 MB)
Collecting sentence-transformers>=0.4.1
  Using cached sentence-transformers-2.2.1.tar.gz (84 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-2.2.0.tar.gz (79 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-2.1.0.tar.gz (78 kB)
  Preparing metadata (setup.py) ... done
Collecting tokenizers>=0.10.3
  Using cached tokenizers-0.13.2-cp311-cp311-macosx_12_0_arm64.whl (3.7 MB)
Collecting sentence-transformers>=0.4.1
  Using cached sentence-transformers-2.0.0.tar.gz (85 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.2.1.tar.gz (80 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.2.0.tar.gz (81 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.1.1.tar.gz (81 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.1.0.tar.gz (78 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.0.4.tar.gz (74 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.0.3.tar.gz (74 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.0.2.tar.gz (74 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.0.1.tar.gz (74 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-1.0.0.tar.gz (74 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-0.4.1.2.tar.gz (64 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-0.4.1.1.tar.gz (64 kB)
  Preparing metadata (setup.py) ... done
  Using cached sentence-transformers-0.4.1.tar.gz (64 kB)
  Preparing metadata (setup.py) ... done
INFO: pip is looking at multiple versions of scikit-learn to determine which version is compatible with other requirements. This could take a while.
Collecting scikit-learn>=0.22.2.post1
  Using cached scikit_learn-1.2.0-cp311-cp311-macosx_12_0_arm64.whl (8.3 MB)
INFO: pip is looking at multiple versions of plotly to determine which version is compatible with other requirements. This could take a while.
Collecting plotly>=4.7.0
  Using cached plotly-5.13.0-py2.py3-none-any.whl (15.2 MB)
INFO: pip is looking at multiple versions of pandas to determine which version is compatible with other requirements. This could take a while.
Collecting pandas>=1.1.5
  Using cached pandas-1.5.3-cp311-cp311-macosx_11_0_arm64.whl (10.8 MB)
INFO: pip is looking at multiple versions of numpy to determine which version is compatible with other requirements. This could take a while.
Collecting numpy>=1.20.0
  Using cached numpy-1.24.2-cp311-cp311-macosx_11_0_arm64.whl (13.8 MB)
INFO: pip is looking at multiple versions of idna to determine which version is compatible with other requirements. This could take a while.
Collecting idna<4,>=2.5
  Using cached idna-3.4-py3-none-any.whl (61 kB)
INFO: pip is looking at multiple versions of hdbscan to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of gitdb to determine which version is compatible with other requirements. This could take a while.
Collecting gitdb<5,>=4.0.1
  Using cached gitdb-4.0.9-py3-none-any.whl (63 kB)
INFO: pip is looking at multiple versions of charset-normalizer to determine which version is compatible with other requirements. This could take a while.
Collecting charset-normalizer<3,>=2
  Using cached charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
INFO: pip is looking at multiple versions of certifi to determine which version is compatible with other requirements. This could take a while.
Collecting certifi>=2017.4.17
  Using cached certifi-2022.12.7-py3-none-any.whl (155 kB)
INFO: pip is looking at multiple versions of setproctitle to determine which version is compatible with other requirements. This could take a while.
Collecting setproctitle
  Using cached setproctitle-1.3.1.tar.gz (27 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
INFO: pip is looking at multiple versions of pyyaml to determine which version is compatible with other requirements. This could take a while.
Collecting PyYAML
  Using cached PyYAML-6.0-cp311-cp311-macosx_11_0_arm64.whl (167 kB)
INFO: pip is looking at multiple versions of pathtools to determine which version is compatible with other requirements. This could take a while.
Collecting pathtools
  Using cached pathtools-0.1.1.tar.gz (41 kB)
  Preparing metadata (setup.py) ... done
INFO: pip is looking at multiple versions of loguru to determine which version is compatible with other requirements. This could take a while.
Collecting loguru
  Using cached loguru-0.5.3-py3-none-any.whl (57 kB)
INFO: pip is looking at multiple versions of sentry-sdk to determine which version is compatible with other requirements. This could take a while.
Collecting sentry-sdk>=1.0.0
  Using cached sentry_sdk-1.15.0-py2.py3-none-any.whl (181 kB)
INFO: pip is looking at multiple versions of requests to determine which version is compatible with other requirements. This could take a while.
Collecting requests<3,>=2.0.0
  Using cached requests-2.28.2-py3-none-any.whl (62 kB)
INFO: pip is looking at multiple versions of psutil to determine which version is compatible with other requirements. This could take a while.
Collecting psutil>=5.0.0
  Using cached psutil-5.9.4-cp38-abi3-macosx_11_0_arm64.whl (244 kB)
INFO: pip is looking at multiple versions of protobuf to determine which version is compatible with other requirements. This could take a while.
Collecting protobuf!=4.21.0,<5,>=3.19.0
  Using cached protobuf-4.21.12-cp37-abi3-macosx_10_9_universal2.whl (486 kB)
INFO: pip is looking at multiple versions of gitpython to determine which version is compatible with other requirements. This could take a while.
Collecting GitPython>=1.0.0
  Using cached GitPython-3.1.30-py3-none-any.whl (184 kB)
INFO: pip is looking at multiple versions of docker-pycreds to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of click to determine which version is compatible with other requirements. This could take a while.
Collecting Click!=8.0.0,>=7.0
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
INFO: pip is looking at multiple versions of bertopic to determine which version is compatible with other requirements. This could take a while.
Collecting bertopic>=v0.10.0
  Using cached bertopic-0.14.0-py2.py3-none-any.whl (119 kB)
  Using cached bertopic-0.13.0-py2.py3-none-any.whl (103 kB)
Collecting PyYAML
  Using cached PyYAML-5.4.1.tar.gz (175 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting bertopic>=v0.10.0
  Using cached bertopic-0.12.0-py2.py3-none-any.whl (90 kB)
  Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB)
  Using cached bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
INFO: pip is looking at multiple versions of appdirs to determine which version is compatible with other requirements. This could take a while.
Collecting appdirs>=1.4.3
  Using cached appdirs-1.4.3-py2.py3-none-any.whl (12 kB)
INFO: pip is looking at multiple versions of bertopic to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of wandb to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of topicmodeltuner to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install bertopic because these package versions have conflicting dependencies.

The conflict is caused by:
    sentence-transformers 2.2.2 depends on torch>=1.6.0
    sentence-transformers 2.2.1 depends on torch>=1.6.0
    sentence-transformers 2.2.0 depends on torch>=1.6.0
    sentence-transformers 2.1.0 depends on torch>=1.6.0
    sentence-transformers 2.0.0 depends on torch>=1.6.0
    sentence-transformers 1.2.1 depends on torch>=1.6.0
    sentence-transformers 1.2.0 depends on torch>=1.6.0
    sentence-transformers 1.1.1 depends on torch>=1.6.0
    sentence-transformers 1.1.0 depends on torch>=1.6.0
    sentence-transformers 1.0.4 depends on torch>=1.6.0
    sentence-transformers 1.0.3 depends on torch>=1.6.0
    sentence-transformers 1.0.2 depends on torch>=1.6.0
    sentence-transformers 1.0.1 depends on torch>=1.6.0
    sentence-transformers 1.0.0 depends on torch>=1.6.0
    sentence-transformers 0.4.1.2 depends on torch>=1.6.0
    sentence-transformers 0.4.1.1 depends on torch>=1.6.0
    sentence-transformers 0.4.1 depends on torch>=1.6.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

I am using torch==1.13.1:

(base) jimmy@Jimmys-MacBook-Air Asset-Management-Topic-Modeling % python -c "import torch; print(torch.__version__)"
1.13.1

What should I do now?

Use grey for outliers in visualizeEmbeddings

Hi there,

It would be great if visualizeEmbeddings used grey or something like that for the -1 outlier topic so that it is distinguishable from other topics (like the visualisation functions from BERTopic do).

In this example with a few hundred topics, the outlier topic is orange, the same orange as several of the other real topics.

hdbscan gives an array problem

Hi,

suddenly I get the problem that whenever I use randomSearch or GridSearch or anything related to runHDBSCAN, I get the following error:

ValueError: Expected 2D array, got scalar array instead:
array=None.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It seems like there is a problem with the parameters for runHDBSCAN. Do you have an idea how to deal with that?

Thanks!!!

Typo in example notebook

Hi there, cool package. I'm just starting to play around with it.

Just wanted to let you know that there is a typo in the code in the Google colab example notebook.

lastRunResultsDF = tmt.psuedoGridSearch([*range(62,71)], [x/100 for x in range(10,101,10)])

pseudo is misspelled so that line doesn't run. It's also misspelled throughout the text. Simple find and replace should do it.

What evaluation metrics you used

Hi
Great package!
May I ask what evaluation metrics you used for evaluate the success?

how does topicTuner help the parameter setting process

@drob-xx I checked your code, very impressive work, here I got a question. I think you used grid search to do different setting of min_cluster_size and min-samples and did some experiments, I also checked BaseHDBSCANTuner and gridSearch, pseudoGridSearch and randomSearch functions. But I am still having questions about how this "grid search" or more exactly, these functions help the parameters setting.

Unhelpful error in VisualizeEmbeddings when docs not set

Hi there,

Small thing, I think it may be helpful to have some error checking on whether docs have been set when they are needed.

I was running through the example notebook and just set the newsgroup embeddings using tmt.embeddings = embeddings rather than calculating them (because I use them all the time I just have them saved) but didn't set the documents anywhere.

When I got to tmt.visualizeEmbeddings(131,78).show() it threw the following error generated in _check_CS_SS

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 1
----> 1 tmt.visualizeEmbeddings(131,78).show()

File [c:\path\lib\site-packages\topictuner\basetuner.py:316](file:///C:/path/lib/site-packages/topictuner/basetuner.py:316), in BaseHDBSCANTuner.visualizeEmbeddings(self, min_cluster_size, min_samples, width, height, markersize, opacity)
    310     VizDF["wrappedText"] = [
    311         "Topic #: " + str(topic) + "

" + text
    312         for topic, text in zip(VizDF["topics"], wrappedText)
    313     ]
    314 else:
    315     VizDF["wrappedText"] = [
--> 316         "Topic #: " + str(topic) for topic in self.runHDBSCAN()
    317     ]
    318 for topiclabel in set(VizDF["topics"]):
    319     topicDF = VizDF.loc[VizDF["topics"] == topiclabel]

File [c:\path\lib\site-packages\topictuner\basetuner.py:94](file:///C:/path/lib/site-packages/topictuner/basetuner.py:94), in BaseHDBSCANTuner.runHDBSCAN(self, min_cluster_size, min_samples)
     88 def runHDBSCAN(self, min_cluster_size: int = None, min_samples: int = None):
     89     """
     90     Cluster the target embeddings (these will be the reduced embeddings when
     91     run as a TMT instance. Per HDBSCAN, min_samples must be more than 0 and less than
     92     or equal to min_cluster_size.
     93     """
---> 94     min_cluster_size, min_samples = self._check_CS_SS(
...
--> 408         raise ValueError("Cannot set min_cluster_size==None")
    409 if min_cluster_size == 1:
    410     raise ValueError("min_cluster_size must be more than 1")

ValueError: Cannot set min_cluster_size==None

This wasn't very helpful as the issue was no docs being set, not anything to do with min_cluster_size or min_samples

Setting tmt.docs = docs resolve the issue

transform new docs

Hi,
Is it possible to transform new docs in the bert topic model in the tmt object? I didn't succeed, when I run the transform line it returns the topics list of all the docs of the fit...

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.