Code Monkey home page Code Monkey logo

tweetopic's Introduction

tweetopic

โšก Blazing Fast topic modelling over short texts in Python

PyPI version pip downloads python version Code style: black

Features

  • Fast โšก
  • Scalable ๐Ÿ’ฅ
  • High consistency and coherence ๐ŸŽฏ
  • High quality topics ๐Ÿ”ฅ
  • Easy visualization and inspection ๐Ÿ‘€
  • Full scikit-learn compatibility ๐Ÿ”ฉ

๐Ÿ›  Installation

Install from PyPI:

pip install tweetopic

๐Ÿ‘ฉโ€๐Ÿ’ป Usage (documentation)

Train your a topic model on a corpus of short texts:

from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)

# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)

# Creating topic pipeline
pipeline = Pipeline([
    ("vectorizer", vectorizer),
    ("dmm", dmm),
])

You may fit the model with a stream of short texts:

pipeline.fit(texts)

To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.

Install it from PyPI:

pip install topic-wizard

Then visualize your topic model:

import topicwizard

topicwizard.visualize(pipeline=pipeline, corpus=texts)

topicwizard visualization

๐ŸŽ“ References

  • Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233โ€“242). Association for Computing Machinery.

tweetopic's People

Contributors

kennethenevoldsen avatar pre-commit-ci[bot] avatar x-tabdeveloping avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tweetopic's Issues

Dependency clash with topic-wizard

As discussed in the ML team meeting last week, there is a dependency clash with topic-wizard related to numpy versions.

Expected Behavior

Running tweetopic and then visualising the results using topic-wizard should reproduce the kind of output described in the README.md.

Current Behavior

In a virtual environment, installing tweetopic via pip and then installing topic-wizard results in the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tweetopic 0.2.0 requires numpy<1.23.0,>=1.22.4, but you have numpy 1.24.2 which is incompatible.
tweetopic 0.2.0 requires scikit-learn<1.2.0,>=1.1.1, but you have scikit-learn 1.2.1 which is incompatible.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.24.2 which is incompatible.

This means that the workflow outlined in README.md is not possible.

(The inverse is obviously the case, if topic-wizard is installed before tweetopic.)

Possible Solution

Not a possible solution, but I'm wondering why exactly the two packages require different versions of numpy in the first place. Have there been breaking changes in numpy between minor versions which mean that the two packages cannot use the same dependencies? I haven't had a chance to dig into the code, so might be missing something!

Steps to Reproduce

Error has been reproduced on Ubuntu and MacOS, and for Python versions 3.9, 3.10, and 3.11.

  1. python3.9 -m venv env
  2. pip install --upgrade pip
  3. source .env/bin/activate
  4. pip install tweetopic
  5. pip install topic-wizard

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.