The topicx from hyintell

CETopic and BERTopic training fails.

Python version: 3.7.13
Operating System: Ubuntu 16.04.7 LTS

Description

Hi, I am trying to train CETopic and BERTopic on a custom dataset consisting around 400K English tweets. The models train successfully on a very small subset of the same dataset, but training fails on the full dataset.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(dataset_path)
tm = CETopicTM(dataset=dataset, topic_model='cetopic', num_topics=200,
               embedding='sentence-transformers/bert-base-nli-mean-tokens', 
               word_select_method='tfidf_idfi', dim_size=1, seed=42)
print("Begin model training...")
tm.train()
topic_words = tm.get_topics()

The following error message was displayed for both models.

Begin model training...
Traceback (most recent call last):
  File "cetopic_train.py", line 32, in <module>
    tm.train()
  File "/home/devanshjain/mlda/topicx/baselines/cetopictm.py", line 32, in train
    self.topics = self.model.fit_transform(self.sentences)
  File "/home/devanshjain/mlda/topicx/baselines/cetopic/cetopic.py", line 55, in fit_transform
    embeddings = self._extract_embeddings(documents.Document)
  File "/home/devanshjain/mlda/topicx/baselines/cetopic/cetopic.py", line 84, in _extract_embeddings
    embeddings = self.embedding_model.embed_documents(documents)
  File "/home/devanshjain/mlda/topicx/baselines/cetopic/backend/_base.py", line 69, in embed_documents
    return self.embed(document, verbose)
  File "/home/devanshjain/mlda/topicx/baselines/cetopic/backend/_flair.py", line 71, in embed
    self.embedding_model.embed(sentence)
  File "/home/devanshjain/miniconda3/envs/cetopic/lib/python3.7/site-packages/flair/embeddings/base.py", line 62, in embed
    self._add_embeddings_internal(data_points)
  File "/home/devanshjain/miniconda3/envs/cetopic/lib/python3.7/site-packages/flair/embeddings/base.py", line 766, in _add_embeddings_internal
    self._add_embeddings_to_sentences(expanded_sentences)
  File "/home/devanshjain/miniconda3/envs/cetopic/lib/python3.7/site-packages/flair/embeddings/base.py", line 684, in _add_embeddings_to_sentences
    return_tensors="pt",
  File "/home/devanshjain/miniconda3/envs/cetopic/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2512, in __call__
    **kwargs,
  File "/home/devanshjain/miniconda3/envs/cetopic/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2703, in batch_encode_plus
    **kwargs,
  File "/home/devanshjain/miniconda3/envs/cetopic/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 459, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

Thanks for the nice work by the way!

Using the word selection method

Hi,

I was reading your paper and I came across the way of word selection for topic representation. I implemented this method along with all the word selection methods you have in this link. However, along the way, I found some code that can be edited. In here baselines/cetopictm.py you have a function called _calculate_topic_diversity() and the variables are derived from Bertopic code I believe. I think you should change the names accordingly.

Warm regards

hyintell / topicx Goto Github PK

topicx's People

Stargazers

Watchers

Forkers

topicx's Issues

CETopic and BERTopic training fails.

Description

What I Did

Using the word selection method

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent