martingerlach / hsbm_topicmodel Goto Github PK

View Code? Open in Web Editor NEW

190.0 190.0 40.0 25.44 MB

Using stochastic block models for topic modeling

License: GNU General Public License v3.0

Python 0.70% HTML 43.85% Jupyter Notebook 55.45%

hsbm_topicmodel's People

Stargazers

Watchers

Forkers

edugalt count0 pmartin23 alison-thaung jnothman ferplascencia hellodannyliu remarkablej fvalle1 gokceneraslan sdownin atwel freekang kid8888 timhannigan leticiahan chrishyland ltdung nanmat sarnudagda jsedoc zhanglipku gaardhus mivanovitch chakreshiitgn andrisimonetti atirabassi ymtao5219 slowcoder-fastsleeper soufal dheiver nored mclevey nhatluu03 davidepafumi crazyivanz youzhen93 georgewang1994 zshwuhan roland-feng

hsbm_topicmodel's Issues

Relationship with https://github.com/TopSBM/topsbm ?

What's the relationship between this repository and https://github.com/TopSBM/topsbm? The later looks more official but this one has more activity.

Also https://topsbm.github.io/ which is hosted under https://github.com/TopSBM/ points to https://github.com/martingerlach/hSBM_Topicmodel instead of https://github.com/TopSBM/topsbm for source code.

Add licence

Hi Martin, Thanks for speaking at Sydney U the other day. I'm wondering what licence you apply to the code here. Could you please add a LICENSE.txt, to help others understand what derivative works are licensed? Thanks.

Graph tool and/or sbmtm.py on Mac OS with M2 (AxisError: axis xxxx is out of bounds for array of dimension 0)

Hello,

I have used topsbm for quite some time on an Intel-based Mac without any significant problems. Now I changed to a Mac with the M2 processor, installed everything via conda, and when running the tutorial to check everything works I am getting the following error:

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
Cell In[9], line 13
      5 model.make_graph(texts,documents=titles)
      7 ## we can also skip the previous step by saving/loading a graph
      8 # model.save_graph(filename = 'graph.xml.gz')
      9 # model.load_graph(filename = 'graph.xml.gz')
     10 
     11 ## fit the model
     12 # gt.seed_rng(32) ## seed for graph-tool's random number generator --> same results
---> 13 model.fit()

File ~/tmp/hSBM_Topicmodel/sbmtm.py:236, in sbmtm.fit(self, overlap, n_init, verbose, epsilon)
    234 for i_n_init in range(n_init):
    235     base_type = gt.BlockState if not overlap else gt.OverlapBlockState
--> 236     state_tmp = gt.minimize_nested_blockmodel_dl(g,
    237                                                  state_args=dict(
    238                                                      base_type=base_type,
    239                                                      **state_args),
    240                                                  multilevel_mcmc_args=dict(
    241                                                      verbose=verbose))
    242     L = 0
    243     for s in state_tmp.levels:

File ~/anaconda3/envs/gt/lib/python3.11/site-packages/graph_tool/inference/minimize.py:235, in minimize_nested_blockmodel_dl(g, init_bs, state, state_args, multilevel_mcmc_args)
    137 def minimize_nested_blockmodel_dl(g, init_bs=None,
    138                                   state=NestedBlockState, state_args={},
    139                                   multilevel_mcmc_args={}):
    140     r"""Fit the nested stochastic block model, by minimizing its description length
    141     using an agglomerative heuristic.
    142 
   (...)
    232 
    233     """
--> 235     state = state(g, bs=init_bs, **state_args)
    237     args = dict(niter=1, psingle=0, beta=numpy.inf)
    238     args.update(multilevel_mcmc_args)

File ~/anaconda3/envs/gt/lib/python3.11/site-packages/graph_tool/inference/nested_blockmodel.py:96, in NestedBlockState.__init__(self, g, bs, base_type, state_args, hstate_args, hentropy_args, **kwargs)
     81 self.hstate_args["copy_bg"] = False
     82 self.hentropy_args = dict(hentropy_args,
     83                           adjacency=True,
     84                           dense=True,
   (...)
     93                           recs_dl=False,
     94                           beta_dl=1.)
---> 96 self.levels = [base_type(g, b=bs[0] if bs is not None else None,
     97                          **self.state_args)]
     99 if bs is None:
    100     if base_type is OverlapBlockState:

File ~/anaconda3/envs/gt/lib/python3.11/site-packages/graph_tool/inference/blockmodel.py:380, in BlockState.__init__(self, g, b, B, eweight, vweight, recs, rec_types, rec_params, clabel, pclabel, bfield, Bfield, deg_corr, dense_bg, **kwargs)
    377 assert all(self.recdx.a >= 0), self.recdx.a
    379 if deg_corr:
--> 380     init_q_cache(max(2 * max(self.get_E(), self.get_N()), 100))
    382 self._entropy_args = dict(adjacency=True, deg_entropy=True, dl=True,
    383                           partition_dl=True, degree_dl=True,
    384                           degree_dl_kind="distributed", edges_dl=True,
    385                           dense=False, multigraph=True, exact=True,
    386                           recs=True, recs_dl=True, beta_dl=1.,
    387                           Bfield=True)
    389 if len(kwargs) > 0:

File ~/anaconda3/envs/gt/lib/python3.11/site-packages/numpy/core/fromnumeric.py:2810, in max(a, axis, out, keepdims, initial, where)
   2692 @array_function_dispatch(_max_dispatcher)
   2693 @set_module('numpy')
   2694 def max(a, axis=None, out=None, keepdims=np._NoValue, initial=np._NoValue,
   2695          where=np._NoValue):
   2696     """
   2697     Return the maximum of an array or maximum along an axis.
   2698 
   (...)
   2808     5
   2809     """
-> 2810     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
   2811                           keepdims=keepdims, initial=initial, where=where)

File ~/anaconda3/envs/gt/lib/python3.11/site-packages/numpy/core/fromnumeric.py:88, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     85         else:
     86             return reduction(axis=axis, out=out, **passkwargs)
---> 88 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

AxisError: axis 3203 is out of bounds for array of dimension 0

I could not figure it out and was wondering if you have any idea of what may be causing this and how to fix it?

Thank you in advance

Cannot reproduce plot in ipynb with counts=True

The Jupyter notebook uses make_graph with the implicit counts=True. The published plot looks something like this:

but with counts=True I can only get plots like

Is counts=True working for inference but not for plotting? Or is it broken for inference too?

"RuntimeWarning: invalid value encountered in true_divide"

Hi, thank you much for sharing your wonderful work.
When I print model.print_topics function, I encountered a warning:
RuntimeWarning: invalid value encountered in true_divide

group membership of each word-node P(t_w | w)

topic-distribution for words P(w | t_w)

P(w) computation doesn't take in account weights if counts=True

Hi.

I'm trying to reconstruct P(w) by multiplying P(word|topic)P(topic|sample)P(sample) namely applying matrix multiplication at p_w_tw times p_tw_d.

If I understood right everything P(w) should be the frequency of the word in the corpus.
Considering the corpus in your paper, this didn't happen though...

p_w_topsbm_original.pdf

I noticed that in get_groups method the rows

for e in g.edges():
    z1, z2 = state_l_edges[e]
    v1 = e.source()
    v2 = e.target()
    n_db[int(v1), z1] += 1
    n_dbw[int(v1), z2] += 1
    n_wb[int(v2) - D, z2] += 1

don't take care of edges' weight. This is an issue if the graph was built with `counts=True'.

Modifying that rows to

for e in g.edges():
    z1, z2 = state_l_edges[e]
    v1 = e.source()
    v2 = e.target()
    weight = g.ep["count"][e]
    n_db[int(v1), z1] += weight
    n_dbw[int(v1), z2] += weight
    n_wb[int(v2) - D, z2] += weight

The P(w) obtained from P(w|tw)*P(tw|d)*P(d) is actually the frequency

p_w_topsbm.pdf

Is there something wrong with my assumptions?
I mean multiplying P(w|tw) times P(tw|d) times P(d) should give the frequency of word w, right?

The probabilities should take into account the weight of the edges, or is there some other factor that I'm missing?

Thank you

Filippo

Hierarchical relations between groups

I wonder if there is any way to see how the groups in level 1 are connected to level 0 or level 2 (especially for the words). Maybe I miss something. but in the paper (figure 5), it seems that you distinguished groups in the second hierarchical level as solid lines and then showed how words are grouped together in the third hierarchical level, which is indicated in the dotted line. And the graph illustrates the groups in the second level are connected to the groups in the third level.

I wonder how I can figure out such hierarchical relations with your code.

"Overlapping not implemented yet"

It says in the docstring for sbmtm.fit that overlapping isn't implemented. Does this mean that equations 8-20 in the paper are not implemented?

Topic distribution on new document

I am wondering, how to get topic distribution of trained model on unseen new document?
Thank you

Questions regarding the paper, particularly in Role Detection

After reading https://advances.sciencemag.org/content/advances/4/7/eaaq1360.full.pdf there are some questions regarding the comparison between Topic Modeling and Community Detection: There are papers that do Role Detection (or Structural/Automorphic similarity) (see below), some use direct NMF while others claimed to use SBM. Are there any analogs to Fig. 2 in the paper?

ReFeX/Rolx https://dl.acm.org/doi/pdf/10.1145/2339530.2339723
RoleSim https://dl.acm.org/doi/pdf/10.1145/2518176
KarateClub collection https://arxiv.org/pdf/2003.04819.pdf
Stochastic Block Modeal Equivalent
- https://arxiv.org/pdf/1705.10225.pdf
- https://www.nature.com/articles/s41598-018-31202-1.pdf

tutorial not compatible with latest graph-tool 2.41

Hi, I believe the multilayer_SBM notebook tutorial is not compatible with the latest graph-tool version.

In particular, when running the fit_hyperlink_text_hsb part, I simply get an error because the deg_corr and layers arguments in gt.minimize_nested_blockmodel_dl are not present any more.
By looking at graph-tool, it appears they were last present in 2.37 . Maybe you tested the code with earlier versions?

Just flagging because I saw the recently published paper on EPJDS, it looks great and I was looking forward to apply the methodology to our use case!

Cannot plot fitted model

After fitting the model, I tried to use the plotting. However, it failed. I guess some error with graph_tool
AttributeError Traceback (most recent call last)
in
----> 1 model.plot()

/work/scripts/sbmtm.py in plot(self, filename, nedges)
182 '''
183 self.state.draw(layout='bipartite', output=filename,
--> 184 subsample_edges=nedges, hshortcuts=1, hide=0)
185
186

/usr/lib/python3/dist-packages/graph_tool/inference/nested_blockmodel.py in draw(self, **kwargs)
932 draws the hierarchical state."""
933 import graph_tool.draw
--> 934 return graph_tool.draw.draw_hierarchy(self, **kwargs)
935
936

AttributeError: module 'graph_tool.draw' has no attribute 'draw_hierarchy'

p_td_d and p_tw_d are of suspiciously similar construction

n_db and n_dbw are initially populated identically:

hSBM_Topicmodel/sbmtm.py

Lines 397 to 406 in 2ba7658

    
           n_db = np.zeros((D,B)) ## number of half-edges incident on document-node d and labeled as document-group td 
        
           n_dbw = np.zeros((D,B)) ## number of half-edges incident on document-node d and labeled as word-group td 
        
           for e in g.edges(): 
        
               z1,z2 = state_l_edges[e] 
        
               v1 = e.source() 
        
               v2 = e.target() 
        
               n_db[int(v1),z1] += 1 
        
               n_dbw[int(v1),z2] += 1 
        
               n_wb[int(v2)-D,z2] += 1

They are then indexed by ind_d and ind_w2 respectively. But ind_d and ind_w2 are constructed identically also, before these matrices are normalised into probabilities.

Are these not meant to be distinct outputs?

model.clusters returns dict of empty lists

I'm trying to implement the model on a corpus of 5352 documents following the tutorial notebook. After running the model.fit() method I can plot my results as a graph, see topic distributions per document, and use the model.clustering_query to get valid outputs. However when running model.clusters I get a dict with N-topics but with empty lists:

>>> model.clusters(l=1, n=5)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: [],
 7: [],
 8: [],
 9: [],
 10: [],
 11: [],
 12: [],
 13: [],
 14: [],
 15: [],
 16: []}

Any known reasons as to why this may happen, or do I need to provide more info? I've installed graph-tool on my Windows system through Docker.

Edit:
After looking further into the source code for the model.clusters I see that the problem is that one of the objects contain NaN values, as such recoding NaNs to 0s helped me solve my problem. The problem then seems to originate from the model.get_groups() method, though I havn't had the time debugging that yet.

def clusters(self,l=0,n=10):
    '''
    Get n 'most common' documents from each document cluster.
    most common refers to largest contribution in group membership vector.
    For the non-overlapping case, each document belongs to one and only one group with prob 1.

    '''
    # dict_groups = self.groups[l]
    dict_groups = self.get_groups(l=l)
    Bd = dict_groups['Bd']
    p_td_d = dict_groups['p_td_d']
    p_td_d = np.nan_to_num(p_td_d, 0) # <----- This solved my issue

    docs = self.documents
    ## loop over all word-groups
    dict_group_docs = {}
    for td in range(Bd):
        p_d_ = p_td_d[td,:]
        ind_d_ = np.argsort(p_d_)[::-1]
        list_docs_td = []
        for i in ind_d_[:n]:
            if p_d_[i] > 0:
                list_docs_td+=[(docs[i],p_d_[i])]
            else:
                break
        dict_group_docs[td] = list_docs_td
    return dict_group_docs

The error pertains to this warning:

/home/user/sbmtm.py:547: RuntimeWarning: invalid value encountered in true_divide
  p_td_d = (n_db/np.sum(n_db,axis=1)[:,np.newaxis]).T
/home/user/sbmtm.py:553: RuntimeWarning: invalid value encountered in true_divide
  p_tw_d = (n_dbw/np.sum(n_dbw,axis=1)[:,np.newaxis]).T

Clustering

I am really excited in applying these methods to my textual data and it appears to be working really well, however I have a question regarding the clustering - how does it decide on the number of clusters ?

Memory issues for medium-sized corpus

Thanks for this new intriguing method. A few toy models showed interesting and consistent patterns so I would like to use it with a real corpus. When applying this method to a medium-sized corpus, however, I run into memory issues (20GB RAM + 20 GB swapfile). The properties of the corpus are as follows:

#docs (unfiltered): 7897
#tokens (unfiltered): 9273368
#types (unfiltered): 67485

Following your paper, the algorithm should be scalable. Yet, even when drastically reducing the size of the graph by filtering out nodes (docs + words) as well as edges (tokens) I still encounter the same issues.

What's the exact issue? Is there a way to make it more scalable?

Edit: I use the graph tool Docker container

	n_db = np.zeros((D,B)) ## number of half-edges incident on document-node d and labeled as document-group td
	n_dbw = np.zeros((D,B)) ## number of half-edges incident on document-node d and labeled as word-group td

	for e in g.edges():
	z1,z2 = state_l_edges[e]
	v1 = e.source()
	v2 = e.target()
	n_db[int(v1),z1] += 1
	n_dbw[int(v1),z2] += 1
	n_wb[int(v2)-D,z2] += 1