Code Monkey home page Code Monkey logo

topic-modeling's Introduction

Topic Modeling

$p(word|corpus) = \sum_{topic}p(word|topic)*p(topic|corpus)$

NFM (Non-negative Factor Model):

  • MF (matrix factorization):

    $M = U*I$

    • $M$ : rating matrix / tf-idf matrix (m*n)
    • $U$ : user matrix / corpus' latent vector (m*k)
    • $I$ : iteam matrix / words' latent vector (k*n)
    • $k$ : number of topic

LSA\LSI (Latent Semantic Analysis\Latent Semantic Indexing)

  • SVD decomposition:

    $M = U S \bar{I}^T$ , $M$ can be real or complex matrix (m*n)

    • $U$ : real or complex unitary matrix (m*m)
    • $S$ : rectangular diagonal matrix with non-negative real numbers on the diagonal (m*n)
    • $I$ : real or complex unitary matrix (n*n)
  • unitary matrix:

    $M\bar{M}^T = \bar{M}^TM = I$

  • LSA(matrix factorization):

    $M = U S \bar{I}^T$ -> $M = \hat{U} \hat{S} \hat{I}$

    • $\hat{U}$ : initial randamly (m*r)
    • $\hat{S}$ : keep the non zero value of $S$ (r*r)
    • $\hat{I}$ : initial randamly (r*n)
  • LSA(dimension reduction、truncated SVD): Sort the singular values in $\hat{S}$ and keep the first $k$ elements

  • LSA(prediction):

    for topic modeling:

    • Rating matrix -> tf-idf/bag-of-words matrix (corpus * words)
    • Users' preformance matrix -> corpus' latent vector
    • Items' features matrix -> words' latent vector
    • $k$ -> number of topic
  • optimize:

    • root-mean-square error
    • stochastic gradient descent

LDA(Latent Dirichlet Allocation)

  • Plate Notation

    • $K$ number of topic
    • $M$ corpus
    • $N$ words
    • $Z_{ij}$ is the topic for the $j$-th word in document $i$
    • $W_{ij}$ is the specific word (observed word)
  • Generative Process

    1. Choose $θ_i〜Dir(\alpha)$, where $i \in {1,2 ... M}$
    2. Choose $φ_k〜Dir(\beta)$, where $k \in {1,2 ... K}$
    3. For each of word position $i,j$, where $i \in {1,2 ... M}$ and $j \in {1,2 ... N_i}$
      • choose a topic $Z_{ij}〜Multinomial(θ_i)$
      • choose a word $W_{ij}〜Multinomial(φ_{z_{i,j}})$

dataset

test case

  • NMF

    Topic 0 : ['despite', 'plea', 'Kardashian', 'Kim', 'execution', 'Bernard', 'Indiana', 'Brandon', 'federal', 'EU']
    Topic 1 : ['Trump', 'president', 'elect', 'run', 'mate', 'finalist', 'Donald', 'beat', 'include', 'safe']
    Topic 2 : ['阅读全文', 'fall', 'foul', 'billionaire', 'profile', 'controversial', 'law', 'democracy', 'medium', 'figure']
    Topic 3 : ['home', 'family', 'shoot', 'dead', 'Mr', 'dentist', 'Goodson', 'appointment', 'de', 'say']
    Topic 4 : ['Apple', 'new', 'year', 'Google', 'event', 'vaccine', 'rise', 'smart', 'find', 'daily']
    
  • LSI

    Topic 0 : ['return', 'say', 'vaccine', 'appointment', 'Mr', 'Goodson', 'dentist', 'shoot', 'dead', 'family']
    Topic 1 : ['high', 'White', 'fatality', 'House', 'daily', 'hold', 'coronavirus', 'relate', 'rise', 'record']
    Topic 2 : ['vaccine', 'Johnson', 'Boris', 'negotiation', 'EU', 'continue', 'Drug', 'deem', 'Food', 'adviser']
    Topic 3 : ['finalist', 'mate', 'Donald', 'elect', 'Trump', 'beat', 'president', 'run', 'include', 'Boris']
    Topic 4 : ['Indiana', 'Bernard', 'execution', 'Kim', 'Kardashian', 'Brandon', 'plea', 'federal', 'despite', 'set']
    
  • LDA

    Topic 0 : ['vaccine', 'Administration', 'deem', 'Drug', 'Food', 'adviser', 'panel', 'safe', 'effective', 'Apple']
    Topic 1 : ['Apple', 'Google', 'Black', 'deal', 'new', 'Facebook', 'Friday', 'return', 'good', 'app']
    Topic 2 : ['Apple', 'figure', 'profile', 'billionaire', 'foul', 'democracy', 'law', 'controversial', 'fall', 'medium']
    Topic 3 : ['Apple', 'continue', 'photo', 'thousand', 'ask', 'categorise', 'Great', 'Barrier', 'volunteer', 'Reef']
    Topic 4 : ['Apple', 'rise', 'event', 'say', 'Google', 'relate', 'hold', 'fatality', 'White', 'House']
    

reference

topic-modeling's People

Contributors

1tangerine1day avatar hackmd-deploy avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.