Comments (19)
I used my Python implementation shown in the post 8 days ago using the NIPS dataset, alpha = 50/K, beta = 0.1 and 1500 iterations for a Gibbs sampling implementation (https://github.com/lda-project/lda) and it seems to agree with your figure above.
Yes, the Cao 2009 metric doesn't agree with that (looks like it points to a larger topic size or higher beta values) but IMO the results strongly indicate that this implementation follows the one of Arun et al. From a user's perspective on the ldatuning package, I'd expect that the implementation should follow the original paper, even if it means that it doesn't always agree with other metrics.
from ldatuning.
Thanks, I'd also like to know Nikita's opinion and I'm reopening this issue since it looks like we have a solution even with working R code and also this way people see that the Arun metric in this package is under discussion. So if Nikita gives green light, you could even provide a patch in a PR.
In the next days, I will update my own (Python) package tmtoolkit to implement your "interpretation" of the paper since I find it much more convincing than the current ldatuning implementation (which I translated to Python in my package). Unfortunately, I doubt that the paper authors will ever reply :(
from ldatuning.
Thank you, very good commentary. I am busy next month but will try to inspect the implementation of Arun2010. Please, feel free to provide a patch.
from ldatuning.
I'm also busy this month, but in November I think I can have a look at it and provide a patch.
from ldatuning.
I had a look at it and ran some experiments. I think the current implementation is correct at least in terms of giving useful results (if compared to the other metrics), whereas when properly normalizing the distributions, you get no informative results that point to range of optimal number of topics.
So, I still find it strange how KL divergence is used here with unnormalized distributions, but it "seems to work". Maybe that's what the authors mean when they wrote they found "empirically" (p.396) that this divergence is useful for determining a good number of topics.
from ldatuning.
I am also confused about this implementation. @internaut why did you in the end decide the implementation was correct? I agree with you that the implemented metric gives similar "right topic" values to the other metrics and so from that perspective this implementation gives nice results. However, I am struggling to reproduce the plots seen in the paper with this implementation. Did you also try this?
from ldatuning.
Well, I said "correct at least in terms of giving useful results". I'm not sure it is correct in the way that it was proposed by the authors Arun et al. Unfortunately, they don't provide the code with their paper. I tried different implementations but couldn't reproduce their results using the AP dataset. They however also failed to report the alpha and beta priors they used for their topic models, which makes it even more difficult to replicate the results.
I'd suggest to at least display a warning about these issues in the documentation.
from ldatuning.
Thank you for getting back to me, particularly since this issue was opened and closed many years ago. Yes sorry, I did see that in your original comment, sorry if I wasn't clear.
It seems like we have gone down similar routes. I have also tried to reproduce their paper plots and found the lack of alpha and beta a problem. My current function takes account of the two issues you raise in the original comment but I also took from the paper that cm2 should be sorted in decreasing order. My function looks like the below.
Arun2010 <- function(models, dtm) {
# length of documents (count of words)
len <- slam::row_sums(dtm)
# evaluate metrics
metrics <- sapply(models, FUN = function(model) {
# matrix M1 topic-word
m1 <- exp(model@beta) # rowSums(m1) == 1
m1.svd <- svd(m1)
cm1 <- as.matrix(m1.svd$d)
norm <- sum(cm1)
cm1 <- cm1/norm
# matrix M2 document-topic
m2 <- model@gamma # rowSums(m2) == 1
cm2 <- as.vector(len %*% m2) # crossprod(len, m2)
norm <- sum(cm2)
cm2 <- cm2 / norm
cm2 <- sort(cm2, decreasing = T)
# symmetric Kullback-Leibler divergence
divergence <- sum(cm1*log(cm1/cm2)) + sum(cm2*log(cm2/cm1))
return ( divergence )
})
return(metrics)
}
Figure 1 : Arun metric as described in the paper (function above) applied to the NIPS dataset ref 15 in the paper.
Figure 2 : LHS figure 5 from paper. Their implementation applied to the same NIPS data
I'm at least in the right ball park but I can't reproduce the dip at 100-120 topics. Is this where you got to? I think the idea in the paper is really neat and would love to have a working implementation.
from ldatuning.
You write "Their implementation applied to the same NIPS data" – does that mean you have their code? Would be interested in that.
I don't read in the paper that CM2 should be sorted. I guess you mean the following:
A point to be mentioned here is that both the distributions CM1 and CM2
are in sorted order so that the corresponding topic components are expected to
match.
As far as I understand, it should only made sure that each component in CM1 matches the one in CM2, i.e. CM1[1] matches CM2[1] which should be the case as they both provide information on topic 1. IMO sorting any of these destroys this connection.
Normalizing CM1 and CM2 in the same way as you did, I also get a similar result as you: basically a monotonically increasing curve (taken aside the noise) for K. When not normalizing CM1 I usually do get a curve that has some dip. The location of the dip usually depends on how I set alpha and beta which conceptually makes sense b/c both priors encode my beliefs about how "fine grained" the topics are.
from ldatuning.
Unfortunately I don't mean I've seen their code. I wish.... It is the plot from their paper. I have emailed them though, hopefully I get a response. I will let you know if I do.
The reason I think cm2 needs sorting is because svd will return the semi-axes of the hyper-ellipsoid (diagonal of SIGMA or svd$d in R) in order of their importance. i.e sigma[1] will not necessarily correspond not to topic 1 but instead to the topic with the largest variance (when the topics are well separated of course). The more complete way to do this is to:
- compute cm2 and find the order of topic importance
- order m1 according to the topic importance found from cm2
- perform svd on the orders m1
- Finally sort cm2.
This way the topic components in cm1 and cm2 match up. See function below for implementation.
Arun2010_paper <- function(models, dtm) {
# length of documents (count of words)
len <- slam::row_sums(dtm)
# evaluate metrics
metrics <- sapply(models, FUN = function(model) {
# matrix M2 document-topic
m2 <- model@gamma # rowSums(m2) == 1
cm2 <- as.vector(len %*% m2) # crossprod(len, m2)
norm <- sum(cm2)
cm2 <- cm2 / norm
topicOrder <- order(cm2, decreasing = T) #find importance of topics
cm2 <- sort(cm2, decreasing = T)
# matrix M1 topic-word
m1 <- exp(model@beta) # rowSums(m1) == 1
m1 <- m1[topicOrder,] #order topics according to their importance
m1.svd <- svd(m1)
cm1 <- as.matrix(m1.svd$d)
norm <- sum(cm1)
cm1 <- cm1/norm
# symmetric Kullback-Leibler divergence
divergence <- sum(cm1*log(cm1/cm2)) + sum(cm2*log(cm2/cm1))
return ( divergence )
})
return(metrics)
}
What do you think? Or have I mis-interpreted the affect of the svd on the topic order in sigma?
from ldatuning.
Isn't "d" from svd(X) invariant to the order of rows in X? So I think m1 <- m1[topicOrder,]
has no effect.
However generally, I think the idea might be what is conveyed in the paper – although I'm still unsure and it would be great to have their code. I've experimented with the following implementation which is however in Python:
def metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths):
# CM1 – sing. value decomp. of topic-word distrib.
cm1 = np.linalg.svd(topic_word_distrib, compute_uv=False)
cm1 /= np.sum(cm1) # normalize
# CM2 – topics scaled by document lengths
doc_lengths = np.asarray(doc_lengths).flatten()
cm2 = doc_lengths @ doc_topic_distrib # @ is mat. mul.
cm2 = -np.sort(-cm2) # sort in desc. order (just like cm1 is already sorted in desc. order)
cm2 /= np.sum(cm2) # normalize
# symmetric Kullback-Leibler divergence KL(cm1||cm2) + KL(cm2||cm1)
return np.sum(cm1 * (np.log(cm1 / cm2))) + np.sum(cm2 * (np.log(cm2 / cm1)))
I've tried this on the AP dataset for different beta values (0.01, 0.05, 0.1 and 0.2) while alpha is 1/K for K topics and unfortunately it doesn't give any informative curve.
from ldatuning.
Yes you are right, svd(x) is invariant to the order of rows in x. I did it more to illustrate that sorting cm2 doesn't mean that the topic components don't match up.
I also don't reproduce their results but the magnitude the KL-divergence is at least in the right ball park. I will see if they reply and if not after a week I try to contact them again.
Thank you again for replying it is nice to know there is someone else this problem has got to. I will let you know if I get a reply at any point. I would really like to resolve this.
from ldatuning.
So I think I've found the hyperparameters: beta = 0.1 and alpha = 50/k, at least for the NIPS data. These are the hyperparameters used in the Griffiths2004 paper (one of the other metrics in this package). Did you try this combination? Below I have included my results (in blue) overlayed the LHS of figure 5 from the arun2010 paper.
from ldatuning.
That looks promising. Do you have a quick link to the NIPS data? I only happen to have the AP dataset.
from ldatuning.
I used the link in reference 15 of the paper http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
from ldatuning.
@nikita-moor it would be great to have your input because your implementation definitely has some merit i.e. it generally agrees with the other metrics but I can't work out the intuition for how your implementation differs from the paper description. The issue I'm having now is that my implementation doesn't tend to agree with the other metrics. If you have any insight here it would be greatly appreciated.
from ldatuning.
Thank you for taking the time to check against your python code. It's good to know you get the same results. Yes I agree the implementation should follow the paper IMO too. What do you think of the discussion @nikita-moor?
Regarding the comparison to caoJuan metric I wouldn't expect it to change to an upward trend as in the arun but I do see a change in gradient between 75-100 (may be I am seeing what I want to see though) which would indicate to me that this is a good range.
If the authors ever reply I will update you but at least from my perspective I'm happy we have solved this mystery.
from ldatuning.
Hello @internaut and @hkimber. Unfortunately, I didn't look into this topic for several years and my current research interests are far from LDA, so I will rely on your investigation.
My only desire is to fix Deveaud2014. It worked on my data before I changed something in passion to reproduce the original paper, and after that I can't redo it. I communicated with the authors, but surprisingly they were not interested to reproduce their own work or provide the code. So, new enthusiasts are welcome.
from ldatuning.
Thanks, Nikita. So would you integrate a patch if @hkimber provided it in a PR?
Regarding Deveaud2014, this is another issue. I only had a quick look, but comparing the paper and your implementation I see at least two possible problems if I interpret the paper correctly. If you like, I would open another issue for that, because it isn't directly related to this issue.
from ldatuning.
Related Issues (20)
- Typo in man page HOT 1
- fit models...Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: invalid argument type HOT 3
- FindTopicsNumber stop working HOT 1
- return_models = T causes error in FindTopicsNumber_plot HOT 1
- Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry HOT 2
- Consistent fatal error when attempting FindTopicsNumber() HOT 4
- crash with Apple M1chip HOT 4
- Status of each k in verbose mode HOT 1
- Check needed for NCOL(dtm) <= # of topics HOT 6
- Biterm topic model tuning
- Deveaud 2014 correctly implemented?
- Performance of FindTopicsNumber on big datasets HOT 1
- warning message running "FindTopicsNumber_plot"
- Warning message: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4. HOT 1
- Porting this package to use text2vec instead of topicmodels HOT 4
- add option to load-balance model-fitting HOT 2
- Error: "Each row of the input matrix needs to contain at least one non-zero entry" HOT 3
- Error in seq_len(m) : argument must be coercible to non-negative integer HOT 1
- Delete HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ldatuning.