Code Monkey home page Code Monkey logo

gsdmm's Introduction

gsdmm

gsdmm implements short text classification via Dirichlet Mixture Models proposed by Yin and Wang 2014. It provides a fast c++ implementation and R interface for the Gibbs sampler described in the paper. Specifically, gsdmm implements the Likelihood function allowing for multiple occurrences of the same word in a given text (EQ4).

Benefits:

  • very space and time efficient
  • unlike LDA it requires only an upper bound on the number of clusters

Development:

  • I am planning to add a tuning function for the alpha and beta parameters of the gibbs sampler

Installation

You can install the development version of gsdmm from GitHub with:

# install.packages("devtools")
devtools::install_github("till-tietz/gsdmm")

Usage

Here is a minimal working example.

# we lemmatize and tokenize creating a list of character vector representing each text
text <- c(
  "Rockets are amazing.",
  "Witnessing a rocket in flight is a marvel of engineering.",
  "We should take a rocket to Mars.",
  "Rocket",
  "Have you ever seen a cat?",
  "Cats are fun.",
  "Your cat seems sweet.",
  "Cat"
) |>
  tolower() |>
  gsub(pattern = "[[:punct:] ]+", replacement = " ") |>
  textstem::lemmatize_strings() |>
  text2vec::word_tokenizer() |>
  lapply(function(i) i[!i %in% stopwords::stopwords()])

set.seed(42)

gsdmm::gsdmm(texts = text, n_iter = 100, n_clust = 20, alpha = 0.1, beta = 0.2, progress = FALSE)
#> $cluster
#> [1] 16  5 14 16  4 16  2 16
#> 
#> $distribution
#> 14 x 5 sparse Matrix of class "dgCMatrix"
#>          5 14 16 4 2
#> rocket   1  1  2 . .
#> amaze    .  .  1 . .
#> witness  1  .  . . .
#> flight   1  .  . . .
#> marvel   1  .  . . .
#> engineer 1  .  . . .
#> take     .  1  . . .
#> mar      .  1  . . .
#> ever     .  .  . 1 .
#> see      .  .  . 1 .
#> cat      .  .  2 1 1
#> fun      .  .  1 . .
#> seem     .  .  . . 1
#> sweet    .  .  . . 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.