Code Monkey home page Code Monkey logo

Comments (8)

martingerlach avatar martingerlach commented on September 4, 2024

The complexity (and thus memory requirements) of the algorithm scales with the number of nodes in the graph. For a corpus similar to yours, the number of nodes is dominated by the number of word-types. One pragmatic way to substantially reduce this number is to impose a minimum number of occurrences for each word-type to be included in the graph. This amounts to filtering the large number of low-frequency words which should not contain much information in any case.

I added an optional argument in make_graph(n_min = 10) in order to filter out many of the word types (you can choose other values for n_min). This should allow you to analyze the corpus with 20GB memory.

I am looking into other options to reduce memory usage in the future.

Best,
Martin

from hsbm_topicmodel.

aflueckiger avatar aflueckiger commented on September 4, 2024

Thanks for the quick answer and the code amendment. However, the approach of reducing size reasonably doesn't help unless I stick to toy models. More specifically, I filtered out words already so drasticallyby myself that the remaining corpus is no longer of any practical use. I set extreme thresholds for the term-document frequency (min_df=2000, max_df=0.7), which gives me the following numbers.

#docs (unfiltered): 7897
#tokens (unfiltered): 7953429
#types/unique words (filtered): 40891

#docs (filtered): 7897
#tokens (filtered): 3036422
#types/unique words (filtered): 530

I wonder how you processed the Twitter corpus that you listed in your paper (docs: 10,000, types: 12,258, words:196,625). Something on my site doesn't seem to be in line with these result. Hence, the following questions:
Are the edges (e.g. tokens) actually not of any significance concerning complexity (magnitude higher here)? Otherwise, I can only think about something that doesn't work properly in the current Docker build unless you used a performant server for these computations. Do I miss something?

from hsbm_topicmodel.

martingerlach avatar martingerlach commented on September 4, 2024

You are right, the memory also depends on the number of edges (word tokens). Therefore, for a very large corpus, you would probably need to run the code on a cluster.

However, the Twitter-corpus we used in the paper takes <~2GB of memory, so this should be no problem with a 20GB memory. I tried to construct a similar corpus in size as you described by taking 10,000 books from Project Gutenberg. Setting a minimum threshold of 10-100 occurrences for each word yields a manageable corpus that I can run on my laptop.

So I am not completely sure where your problem is stemming from since I havent used the Docker build. Perhaps you can find additional information on the graph-tool gitlab Issues https://git.skewed.de/count0/graph-tool/issues

Beyond all this, Tiago (graph-tool) suggested to try the following to reduce the memory usage (which is not yet implemented in the topic-model wrapper):

The main reason why minimize_nested_blockmodel_dl() uses a lot of RAM is
because it keeps several simultaneous copies of the global state when it's
searching for the optimal number of groups. Instead, a much more economical
approach is to use a single state the whole time. You do this by
initializing a singe NestedBlockState, where each node belongs to its own
group at first. You then equilibrate a MCMC by calling repeatedly:

state.multiflip_mcmc_sweep(niter=n)

with n=10 or 100 so. This attempts to merge and split groups, so it gives
good results. In practice, minimize_nested_blockmodel_dl() still gives
better results in many cases, but they are comparable. But with this
approach the memory use is far lower (factor 3 to 4, or so).
(Remember that using weighted graphs, via the 'eweight' parameter, will also
help reduce memory use.)

from hsbm_topicmodel.

aflueckiger avatar aflueckiger commented on September 4, 2024

I really appreciate your support. I installed graph-tool locally and did some further experiments. The RAM consumption is still incredibly high and seems to differ from your observations. Do I understand correctly that you have successfully processed 10,000 books with just filtering by term frequency n<100 on a laptop? This must have resulted in a much bigger corpus than I try to model.

I managed to process an extremely filtered corpus successfully (min_df=1000, max_df=0.6), however, it still takes 17GB of RAM. Thus, something is still not in line.

#docs (filtered): 7897
#tokens (filtered): 3797133
#types/unique words (filtered): 1151

Currently, I don't have the expertise to readily extend the code following your additional suggestions as I work primarily in the field of computational linguistics and computational social science. I would be happy to test such an implementation though.

Moreover, can you share the script (incl. corpus loading from nltk?) of your experiment to make sure we are on the same page? :)

from hsbm_topicmodel.

martingerlach avatar martingerlach commented on September 4, 2024

Here is a notebook in which I analyze 1000 books from Project Gutenberg with a minimum threshold n_min=100 (so only words that occur at least 100 times are included in the corpus. This takes about 1GB and runs for a few minutes. I guess this would easily work with 10,000 documents but I didnt want to wait.

I am looking into other methods to further reduce memory usage (along the lines of what I wrote above) and hope to be able to include some of these features at some point.

Let me know if you have further questions.
dev_20181017_demo-for-PG-corpus.html.tar.gz

from hsbm_topicmodel.

aflueckiger avatar aflueckiger commented on September 4, 2024

The corpus you have tested with is quite different compared to mine. In your example, only 1000 documents are considered with a maximal length of 1000 tokens each, which results in a relatively small corpus. Moreover, a minimum term frequency of 100 is very restrictive given the corpus has only 1000 documents (a word needs to occur in every 10th article on average, thus, smaller "topics" disappear). For practical applications, these requirements are not appropriate.

Modeling bigger corpora may be not possible with the current implementation unless one has access to a cluster. The used algorithm has a complexity of O(Vln2V) that explains the memory issue with the bigger corpus. https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference.minimize.minimize_nested_blockmodel_dl

I only see the following options:

  • filtering linguistically informed based on the part-of-speech tag (POS) to keep only semantically relevant words (nouns, adjectives, verbs)
  • waiting for an implementation that reduces the required memory as sketched out

Sampling just a number of words for each document would jeopardize the explanatory power. Thus, this is not an option for my purposes.

Do you have other suggestions? Thank you.

from hsbm_topicmodel.

jnothman avatar jnothman commented on September 4, 2024

I wonder whether there is scope to implement an alternative inference algorithm for hSBM such as the variational approach of Park and Bader (https://arxiv.org/pdf/1711.05150.pdf)

from hsbm_topicmodel.

count0 avatar count0 commented on September 4, 2024

With a recent graph-tool version (2.43 or above) the memory requirements should be significantly improved (as well as speed). Please try again!

from hsbm_topicmodel.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.