<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Get raw ngram count in addition to logProb about berkeleylm HOT 3 CLOSED

adampauls commented on September 28, 2024

Get raw ngram count in addition to logProb

from berkeleylm.

Comments (3)

GoogleCodeExporter commented on September 28, 2024

Do you need this access to be fast? I have some functionality which you can 
access by doing:
 new NgramMapWrapper<W, LongRef>(lm.getNgramMap(), lm.getWordIndexer());

on a StupidBackoffLm. This gives a Map from List<W> to LongRefs. However, this 
interface is slow due to all the boxing/unboxing.

Original comment by [email protected] on 14 Jul 2011 at 5:39

from berkeleylm.

GoogleCodeExporter commented on September 28, 2024

Of course, fast is always better :)

However, it seems I have not fully understood the way the library works.
Two questions:
1) As the JavaDocs say that getLogProb() is slow, what is a fast way to get 
this information given a phrase?

2) How is this probability computed given the raw counts in the Google web1t 
corpus? It seems to me there should be an easy way to just invert the process.

thanks for your help,
Torsten

Original comment by [email protected] on 15 Jul 2011 at 7:52

from berkeleylm.

GoogleCodeExporter commented on September 28, 2024

1) NgramLanguageModel.getLogProb(List<W>) is "slow" because it has to turn the 
List<W> into an int[] first. Note that it is not actually "slow", just slow 
relative to the efficient accessors in 
ArrayEncodedNgramLanguageModel.getLogProb(int[]) and 
ContextEncodedNgramLanguageModel.getLogProb. I have added additional comments 
that direct you towards those calls so others are not confused by this. 

2) The probability is computed using Stupid Backoff. I have added a call to 
StupidBackoffLm that grabs the count, and will be releasing a new version of 
the code with this fix shortly.

Original comment by [email protected] on 15 Jul 2011 at 6:19