x-tabdeveloping / topicwizard Goto Github PK

View Code? Open in Web Editor NEW

86.0 2.0 10.0 105.27 MB

Powerful topic model visualization in Python

Home Page: https://x-tabdeveloping.github.io/topicwizard/

License: MIT License

Python 100.00%

dash machine mantine plotly plotly-dash scikit-learn sklearn tailwindcss topic-modeling visualization

topicwizard's Introduction

topicwizard

Pretty and opinionated topic model visualization in Python.

topicwizard_0.5.0_compressed.mp4

New in version 1.0.0 🌟

Compatiblity with contextually sensitive topic models in Turftopic and BERTopic.
Easier and more streamlined persistence and interoperability.
Smaller building blocks for constructing figures and apps.

Features

Investigate complex relations between topics, words, documents and groups/genres/labels interactively
Easy to use pipelines for classical topic models that can be utilized for downstream tasks
Sklearn, Turftopic, Gensim and BERTopic compatible 🔩
Interactive and composable Plotly figures
Rename topics at will
Share your results
Easy deployment 🌍

Installation

Install from PyPI:

Notice that the package name on PyPI contains a dash: topic-wizard instead of topicwizard. I know it's a bit confusing, sorry for that

pip install topic-wizard

Classical Topic Models

The main abstraction of topicwizard around a classical/bag-of-words models is a topic pipeline, which consists of a vectorizer, that turns texts into bag-of-words representations and a topic model which decomposes these representations into vectors of topic importance. topicwizard allows you to use both scikit-learn pipelines or its own TopicPipeline.

Let's build a pipeline. We will use scikit-learns CountVectorizer as our vectorizer component:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english")

The topic model I will use for this example is Non-negative Matrix Factorization as it is fast and usually finds good topics.

from sklearn.decomposition import NMF

model = NMF(n_components=10)

Then let's put this all together in a pipeline. You can either use sklearn Pipelines...

from sklearn.pipeline import make_pipeline

topic_pipeline = make_pipeline(vectorizer, model)

Or topicwizard's TopicPipeline

from topicwizard.pipeline import make_topic_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model)

You can also turn an already existing pipeline into a TopicPipeline.

from topicwizard.pipeline import TopicPipeline

topic_pipeline = TopicPipeline.from_pipeline(pipeline)

Let's load a corpus that we would like to analyze, in this example I will use 20newsgroups from sklearn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Then let's fit our pipeline to this data:

topic_pipeline.fit(corpus)

Models do not necessarily have to be fitted before visualizing, topicwizard fits the model automatically on the corpus if it isn't prefitted.

Then launch the topicwizard web app to interpret the model.

import topicwizard

topicwizard.visualize(corpus, model=topic_pipeline)

Gensim

You can also use your gensim topic models in topicwizard by wrapping them in a TopicPipeline.

from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from topicwizard.compatibility import gensim_pipeline

texts: list[list[str]] = [
    ['computer', 'time', 'graph'],
    ['survey', 'response', 'eps'],
    ['human', 'system', 'computer'],
    ...
]

dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(bow_corpus, num_topics=10)

pipeline = gensim_pipeline(dictionary, model=lda)
# Then you can use the pipeline as usual
corpus = [" ".join(text) for text in texts]
topicwizard.visualize(corpus, model=pipeline)

Contextually Sensitive Models (New in 1.0.0)

topicwizard can also help you interpret topic models that understand contextual nuances in text, by utilizing representations from sentence transformers. The package is mainly designed to be compatible with turftopic, which to my knowledge contains the broadest range of contextually sensitive models, but we also provide compatibility with BERTopic.

Here's an example of interpreting a Semantic Signal Separation model over the same corpus.

import topicwizard
from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(n_components=10)
topicwizard.visualize(corpus, model=model)

You can also use BERTopic models by wrapping them in a compatibility layer:

from bertopic import BERTopic
from topicwizard.compatibility import BERTopicWrapper

model = BERTopicWrapper(BERTopic(language="english"))
topicwizard.visualize(corpus, model=model)

The documentation also includes examples of how you can construct Top2Vec and CTM models in turftopic, or you can write your own wrapper quite easily if needed.

Web Application

You can launch the topic wizard web application for interactively investigating your topic models. The app is also quite easy to deploy in case you want to create a client-facing interface.

import topicwizard

topicwizard.visualize(corpus, model=topic_pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"])

Topics	Words

Documents	Groups

TopicData

All compatible models in topicwizard have a prepare_topic_data() method, which produces a TopicData object containing information about topical inference and model fit on a given corpus.

TopicData is in essence a typed dictionary, containing all information that is needed for interactive visualization in topicwizard.

You can produce this data with TopicPipeline

pipeline = make_topic_pipeline(CountVectorizer(), NMF(10))
topic_data = pipeline.prepare_topic_data(corpus)

And with contextual models:

model = SemanticSignalSeparation(10)
topic_data = model.prepare_topic_data(corpus)

# or with BERTopic
model = BERTopicWrapper(BERTopic())
topic_data = model.prepare_topic_data(corpus)

TopicData can then be used to spin up the web application.

import topicwizard

topicwizard.visualize(topic_data=topic_data)

This data structure can be serialized, saved and shared. topicwizard uses joblib for serializing the data.

Beware that topicwizard 1.0.0 is no longer fully backwards compatible with the old topic data files. No need to panic, you can either construct TopicData manually from the old data structures, or try to run the app anyway. It will probably work just fine, but certain functionality might be missing.

import joblib
from topicwizard.data import TopicData

# Save the data
joblib.dump(topic_data, "topic_data.joblib")

# Load the data
# (The type annotation is just for type checking, it doesn't do anything)
topic_data: TopicData = joblib.load("topic_data.joblib")

When sharing across machines, make sure that everyone is on the same page with versions of the different packages. For example if the inference machine has scikit-learn==1.2.0, it's advisable that you have a version on the server that is compatible, otherwise deserialization might fail. Same thing goes for BERTopic and turftopic of course.

In fact when you click the download button in the application this is exactly what happens in the background.

The reason that this is useful, is that you might want to have the results of an inference run on a server locally, or you might want to run inference on a different machine from the one that is used to deploy the application.

Figures

If you want customizable, faster, html-saveable interactive plots, you can use the figures API.

All figures are produced from a TopicData object so that you don't have to run inference twice on the same corpus for two different figures.

Here are a couple of examples:

from topicwizard.figures import word_map, document_topic_timeline, topic_wordclouds, word_association_barchart

Word Map	Timeline of Topics in a Document
`word_map(topic_data)`	`document_topic_timeline(topic_data, "Joe Biden takes over presidential office from Donald Trump.")`

Wordclouds of Topics	Topic for Word Importance
`topic_wordclouds(topic_data)`	`word_association_barchart(topic_data, ["supreme", "court"])`

Figures in topicwizard are in essence just Plotly interactive figures and they can be modified at will. Consult Plotly's documentation for more details about manipulating and building plots.

For more information consult our Documentation

topicwizard's People

Contributors

Stargazers

Watchers

Forkers

kitchentable99 jankounchained spectralgrid tsido jinnnyang ankylohryax dscovr rbellamy plnech jiayi-wang0606

topicwizard's Issues

Compatibility layers

Future contributions should add compatibility layers for:

Gensim
BERTopic

And possibly some other emerging solutions like ETM.

Compatibility for Chinese

Hi! Thanks for this awesome package!

Currently am applying this package on Chinese language text corpus. The output generated are "empty squares" - the reason behind this is that we need to use explicit Language fonts (.tff file which I have). Any idea how to incorporate this external font file to this package for use?

Thanks!

Bump scipy dependency

Sparse arrays were introduced in scipy 1.8.0, since they are used in topicwizard, this has to be specified in the dependencies.

Web app freezes when hovering over (?)

Hi x-tab,

Whenever I hover over the (?) the topicwizard webapp becomes unresponsive.

Compatibility with latest versions of python

Hi,
I am unable to install topic-wizard and I am using python 3.11. Is this a known issue? It gives an error at wordcloud package during the installation of the topic-wizard package.

Use approximate nearest neighbours

We could use something like neofuzz for searching for approximate nearest negihbours in embedding space instead of brute forcing every time, this would make the app a lot more flexible and scalable.

Unable to handle nan output from a topic model.

Hello! I am very impressed with this library as per Marton Kardos's article on Medium.

I attempted to use topicwizard to visualize short-text topic modeling inferences based on a quickly trained tweetopic model. The results of my issues and troubleshooting are located on this hosted Google Colab notebook. Please note you can't run the notebook. I've just published so you can easily view via Google Colab.

Information about my Conda environment:

Python 3.9.16 (Installed via Anaconda)
ipykernel 6.9.12 and its dependencies (Anaconda)
Tweetopic 0.3.0 (PyPi)
Topic-wizard 0.2.5 (PyPi)
And all other dependencies which ensue from these two libraries.

I can train a topic model in tweetopic with no problems. I can import the topicwizard module with no problem. Once finished training on my tweetopic model, I can infer topic names via topicwizard.infer_topic_names(pipeline=pipeline) with no problems.

However, when I attempt to run topicwizard.visualize(vectorizer=vectorizer, topic_model=dmm, corpus=corpus_cleaned, port=8080) I receive the following error:

ValueError:
Invalid element(s) received for the 'size' property of scatter.marker
Invalid elements include: [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]
The 'size' property is a number and may be specified as:
  - An int or float in the interval [0, inf]
  - A tuple, list, or one-dimensional numpy array of the above

I troubleshooted and found that when I .transform(...) my corpus post-training, I found inferences that contain nans. I dropped those rows so that they don't mess with the elaborate computations the /prepare/<...py> files have in place to easily get the Dash app running. Despite cleaning up the nans, when I run the same .visualize() function above with the further cleaned inferences, I receive the following error tracing back to ...tweetopic/lib/site-packages/joblib/parallel.py:288 Further context as to the steps I followed is available on that Google Colab notebook.

ValueError: cannot assign slice from input of different size

Could any one help me figure out what is preventing me from getting the Dash app working? Thank you!

Plots on the right side of Topics page are empty

Hi! Thank you for such a useful innovation!
I'm attempting to visualise the topic modelling of 100 short texts (phrases). The code has worked fine, but once I try to visualise, the plots that are supposed to appear on the right side of the Topics page are empty. I thought that less components (i.e., less topics) could make it work, but an additional error comes up in the code when trying with 3 components.

KeyNMF doesn't work with topicwizard

I don't exactly know why but my guess would be that the problem lies somewhere in the discrepancy between the vocabulary of the two vectorisers, this has to be looked into.

Update/Rewrite Medium article "Unsupervised Text Clustering with Topic Models"

As pointed out by a reader the API has changed since this article.
They also said that it would be really useful to have a tutorial like this in our documentation or someplace else, and I'm totally on board.

Large vocabularies crash preprocessing

Add Querying Documents interactively based on topic axes.

Imagine the following scenario:
You're a restaurant branch, you run S3 on reviews you got from your customers, and you get a handful of interesting axes.
You would most likely want to query documents based on values on this axis to see for instance which reviews are negative in valence and talk about the food.

A good solution to this would be an interactive little app, where you can add or remove sliders over different topic axes, and the app would show documents that rank closest to the set values on the given axes.
All axis not added would be ignored.

I think this would be immensely useful so we should totally implement it.

Error with Gensim's NMF model

Hi, there. First, I'd like to say that your package has been very helpful for the visualisation in my topic modelling project. Thank you!

While topic wizard works great with Gensim's lda model for me, running it on gensim's nmf model produced the error "AttributeError: 'Nmf' object has no attribute 'inference'" from topicwizard.visualize().

Here's the full error message:

{
	"name": "AttributeError",
	"message": "'Nmf' object has no attribute 'inference'",
	"stack": "---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/Users/magz/workspace/asia3012/topic-model-comp.ipynb Cell 9 line 3
      <a href='vscode-notebook-cell:/Users/magz/workspace/asia3012/topic-model-comp.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> topic_pipeline = topicwizard.gensim_pipeline(dictionary, model=nmf)
      <a href='vscode-notebook-cell:/Users/magz/workspace/asia3012/topic-model-comp.ipynb#X33sZmlsZQ%3D%3D?line=1'>2</a> texts = [\" \".join(text) for text in df['cleanedText']]
----> <a href='vscode-notebook-cell:/Users/magz/workspace/asia3012/topic-model-comp.ipynb#X33sZmlsZQ%3D%3D?line=2'>3</a> topicwizard.visualize(texts, pipeline=topic_pipeline)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/app.py:297, in visualize(corpus, vectorizer, topic_model, pipeline, document_names, topic_names, exclude_pages, group_labels, port)
    295 if topic_names is None and hasattr(pipeline, \"topic_names\"):
    296     topic_names = pipeline.topic_names  # type: ignore
--> 297 app = get_dash_app(
    298     pipeline=pipeline,
    299     corpus=corpus,
    300     document_names=document_names,
    301     topic_names=topic_names,
    302     exclude_pages=exclude_pages,
    303     group_labels=group_labels,
    304 )
    305 return run_app(app, port=port)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/app.py:88, in get_dash_app(corpus, exclude_pages, pipeline, vectorizer, topic_model, document_names, topic_names, group_labels)
     86 if pipeline is None:
     87     pipeline = Pipeline([(\"Vectorizer\", vectorizer), (\"Model\", topic_model)])
---> 88 blueprint = get_app_blueprint(
     89     pipeline=pipeline,
     90     corpus=corpus,
     91     document_names=document_names,
     92     topic_names=topic_names,
     93     exclude_pages=exclude_pages,
     94     group_labels=group_labels,
     95 )
     96 app = Dash(
     97     __name__,
     98     blueprint=blueprint,
   (...)
    108     ],
    109 )
    110 return app

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/app.py:32, in get_app_blueprint(pipeline, corpus, document_names, topic_names, *args, **kwargs)
     24 def get_app_blueprint(
     25     pipeline: Pipeline,
     26     corpus: Iterable[str],
   (...)
     30     **kwargs,
     31 ) -> DashBlueprint:
---> 32     blueprint = prepare_blueprint(
     33         pipeline=pipeline,
     34         corpus=corpus,
     35         document_names=document_names,
     36         topic_names=topic_names,
     37         create_blueprint=create_blueprint,
     38         *args,
     39         **kwargs,
     40     )
     41     return blueprint

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/blueprints/template.py:41, in prepare_blueprint(pipeline, corpus, create_blueprint, document_names, topic_names, group_labels, *args, **kwargs)
     35 vectorizer, topic_model = split_pipeline(None, None, pipeline)
     36 vocab = get_vocab(vectorizer)
     37 (
     38     document_term_matrix,
     39     document_topic_matrix,
     40     topic_term_matrix,
---> 41 ) = prepare_transformed_data(vectorizer, topic_model, corpus)
     42 nan_documents = np.isnan(document_topic_matrix).any(axis=1)
     43 n_nan_docs = np.sum(nan_documents)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/prepare/utils.py:31, in prepare_transformed_data(vectorizer, topic_model, corpus)
     13 \"\"\"Transforms corpus with the topic model, and extracts important matrices.
     14 
     15 Parameters
   (...)
     28 topic_term_matrix: array of shape (n_topics, n_terms)
     29 \"\"\"
     30 document_term_matrix = vectorizer.transform(corpus)
---> 31 document_topic_matrix = topic_model.transform(document_term_matrix)
     32 topic_term_matrix = topic_model.components_
     33 return document_term_matrix, document_topic_matrix, topic_term_matrix

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/topicwizard/compatibility/gensim.py:162, in TopicModelWrapper.transform(self, X)
    149 \"\"\"Turns documents into topic distributions.
    150 
    151 Parameters
   (...)
    159     Sparse array of document-topic distributions.
    160 \"\"\"
    161 corpus = self._prepare_corpus(X)
--> 162 X_trans = self.model.inference(corpus)[0]
    163 # Normalizing probabilities (so that all docs add up to one)
    164 X_trans = (X_trans.T / X_trans.sum(axis=1)).T

AttributeError: 'Nmf' object has no attribute 'inference'"
}

Here's my code for reproducing the error:

dictionary = Dictionary(df['cleanedText'])
tfidf = TfidfModel(dictionary=dictionary)
corpus = [dictionary.doc2bow(doc) for doc in df['cleanedText']]
corpus_tfidf = tfidf[corpus]

nmf = GensimNmf(
    corpus=corpus_tfidf,
    num_topics=10,
    id2word=dictionary,
    chunksize=1000,
    passes=5,
    eval_every=10,
    minimum_probability=0,
    random_state=0,
    kappa=1,
)

topic_pipeline = topicwizard.gensim_pipeline(dictionary, model=nmf)
texts = [" ".join(text) for text in df['cleanedText']]
topicwizard.visualize(texts, pipeline=topic_pipeline)

Your help will be much appreciated!

I can't import topicwizard

I don't know how to handle this problem. ask for help. Thank you. here is the error information:
File d:\software\anaconda\anaconda\envs\topicmodel\lib\site-packages\topicwizard_init_.py:1
----> 1 from topicwizard.app import get_dash_app, load, load_app, visualize
2 from topicwizard.compatibility.bertopic import bertopic_pipeline
3 from topicwizard.compatibility.gensim import gensim_pipeline

File d:\software\anaconda\anaconda\envs\topicmodel\lib\site-packages\topicwizard\app.py:220
214 app = load_app(filename, exclude_pages=exclude_pages)
215 return run_app(app, port=port)
218 def split_pipeline(
219 vectorizer: Any, topic_model: Any, pipeline: Optional[Pipeline]
--> 220 ) -> tuple[Any, Any]:
221 """Check which arguments are provided,
222 raises error if the arguments are not satisfactory, and if needed
223 splits Pipeline into vectorizer and topic model."""
224 if (vectorizer is None) or (topic_model is None):

TypeError: 'type' object is not subscriptable

Implement Words on Topic Axes Figure

Rationale:

It is useful to be able to understand topics as axes, especially with models like Semantic Signal Separation, that conceptualize topics as such.
While you can easily get an overview of most important words for a topic in topicwizard, it could be very useful to also see words that are the most negative for a given topic, or are neutral.

Solution:

Add a plot to the figures API, where users can display a word map according to two chosen topic axes, opposed to the word_map() function, where the axes are calculated using UMAP.

Implementation:

Some code has already been written for this in my Medium article about S3:

import numpy as np

vocab = model.get_vocab()

# We will produce a BoW matrix to extract term frequencies
document_term_matrix = model.vectorizer.transform(ds["abstract"])
frequencies = document_term_matrix.sum(axis=0)
frequencies = np.squeeze(np.asarray(frequencies))

import pandas as pd

# model.components_ is a n_topics x n_terms matrix
# It contains the strength of all components for each word.
# Here we are selecting components for the words we selected earlier

terms_with_axes = pd.DataFrame({
    "inference": model.components_[7][selected_terms],
    "measurement_devices": model.components_[1][selected_terms],
    "noise": model.components_[6][selected_terms],
    "term": vocab[selected_terms]
 })

import plotly.express as px

px.scatter(
    terms_with_axes,
    text="term",
    x="inference",
    y="noise",
    color="measurement_devices",
    template="plotly_white",
    color_continuous_scale="Bluered",
).update_layout(
    width=1200,
    height=800
).update_traces(
    textposition="top center",
    marker=dict(size=12, line=dict(width=2, color="white"))
)

Doesn't seem to work for Gensim Topic Models

I have trained LDA model using Gensim, and now want to use topicwizard for visualization.
But even after following the Readme for using Gensim topic model case, it doesn't seem to work.

Note: I am doing in Nepali Language. The lda model is also trained in Nepali Language.
Code :

from gensim.corpora.dictionary import Dictionary
from topicwizard.compatibility import gensim_pipeline
import topicwizard

dictionary, and lda_model are loaded from my training by use of Gensim.

I have checked and there is no problem in any of that.

dictionary_form_data = Dictionary(dictionary)
pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)


corpus = [" ".join(tokenized_news) for tokenized_news in dictionary]

no problem till now, as corpus is shown as:

print(corpus[10:12])

Now fitting the corpus

pipeline.fit(corpus)

printing topic_names:

So, everything seems to be fine but when doing visualization:

So, what is the problem here? Is it that its not working in Gensim topic model ?

Switch to UMAP

TSNE is very slow and I have to resort to stupid methods for speeding it up like adding SVD layers in front of it and whatnot, it would be really nice to just use UMAP instead.

Usable on Colab

A lot of people have resorted to using Colab for it memory power, it would be great to have a good working example without the conflicts maybe a Colab version..

Thanks for this, great work!

How to read the generated graphs?

Hello,
Can you help me in understanding the generated graph results? As there are no axis labels or legends, I'm finding it difficult to interpret and read the graph. For example, In the Documents visualization, what do the axes in the timeline chart represent? How to read this timeline graph? Similar explanations for other topic and word graphs will also be helpful. Thank you for your work and time!