I read the paper by Arun et al. that you cited in the implementation of the "Arun2010"

I am also confused about this implementation. <a class="user-mention notranslate" data

Arun 2010 correctly implemented? about ldatuning HOT 19 OPEN

nikita-moor commented on August 16, 2024

Arun 2010 correctly implemented?

from ldatuning.

Comments (19)

internaut commented on August 16, 2024 1

I used my Python implementation shown in the post 8 days ago using the NIPS dataset, alpha = 50/K, beta = 0.1 and 1500 iterations for a Gibbs sampling implementation (https://github.com/lda-project/lda) and it seems to agree with your figure above.

Yes, the Cao 2009 metric doesn't agree with that (looks like it points to a larger topic size or higher beta values) but IMO the results strongly indicate that this implementation follows the one of Arun et al. From a user's perspective on the ldatuning package, I'd expect that the implementation should follow the original paper, even if it means that it doesn't always agree with other metrics.

from ldatuning.

internaut commented on August 16, 2024 1

Thanks, I'd also like to know Nikita's opinion and I'm reopening this issue since it looks like we have a solution even with working R code and also this way people see that the Arun metric in this package is under discussion. So if Nikita gives green light, you could even provide a patch in a PR.

In the next days, I will update my own (Python) package tmtoolkit to implement your "interpretation" of the paper since I find it much more convincing than the current ldatuning implementation (which I translated to Python in my package). Unfortunately, I doubt that the paper authors will ever reply :(

from ldatuning.

nikita-moor commented on August 16, 2024

Thank you, very good commentary. I am busy next month but will try to inspect the implementation of Arun2010. Please, feel free to provide a patch.

from ldatuning.

internaut commented on August 16, 2024

I'm also busy this month, but in November I think I can have a look at it and provide a patch.

from ldatuning.

internaut commented on August 16, 2024

I had a look at it and ran some experiments. I think the current implementation is correct at least in terms of giving useful results (if compared to the other metrics), whereas when properly normalizing the distributions, you get no informative results that point to range of optimal number of topics.
So, I still find it strange how KL divergence is used here with unnormalized distributions, but it "seems to work". Maybe that's what the authors mean when they wrote they found "empirically" (p.396) that this divergence is useful for determining a good number of topics.

from ldatuning.

hkimber commented on August 16, 2024

I am also confused about this implementation. @internaut why did you in the end decide the implementation was correct? I agree with you that the implemented metric gives similar "right topic" values to the other metrics and so from that perspective this implementation gives nice results. However, I am struggling to reproduce the plots seen in the paper with this implementation. Did you also try this?

from ldatuning.

internaut commented on August 16, 2024

Well, I said "correct at least in terms of giving useful results". I'm not sure it is correct in the way that it was proposed by the authors Arun et al. Unfortunately, they don't provide the code with their paper. I tried different implementations but couldn't reproduce their results using the AP dataset. They however also failed to report the alpha and beta priors they used for their topic models, which makes it even more difficult to replicate the results.

I'd suggest to at least display a warning about these issues in the documentation.

from ldatuning.

hkimber commented on August 16, 2024

Thank you for getting back to me, particularly since this issue was opened and closed many years ago. Yes sorry, I did see that in your original comment, sorry if I wasn't clear.

It seems like we have gone down similar routes. I have also tried to reproduce their paper plots and found the lack of alpha and beta a problem. My current function takes account of the two issues you raise in the original comment but I also took from the paper that cm2 should be sorted in decreasing order. My function looks like the below.

Arun2010 <- function(models, dtm) {
  # length of documents (count of words)
  len <- slam::row_sums(dtm)
  # evaluate metrics
  metrics <- sapply(models, FUN = function(model) {
    # matrix M1 topic-word
    m1 <- exp(model@beta) # rowSums(m1) == 1
    m1.svd <- svd(m1)
    cm1 <- as.matrix(m1.svd$d)
    norm <- sum(cm1)
    cm1 <- cm1/norm
    # matrix M2 document-topic
    m2   <- model@gamma   # rowSums(m2) == 1
    cm2  <- as.vector(len %*% m2)    # crossprod(len, m2)
    norm <- sum(cm2)
    cm2  <- cm2 / norm 
    cm2 <- sort(cm2, decreasing = T)
    # symmetric Kullback-Leibler divergence
    divergence <- sum(cm1*log(cm1/cm2)) + sum(cm2*log(cm2/cm1))
    return ( divergence )
  })
  return(metrics)
}

Figure 1 : Arun metric as described in the paper (function above) applied to the NIPS dataset ref 15 in the paper.

Figure 2 : LHS figure 5 from paper. Their implementation applied to the same NIPS data

I'm at least in the right ball park but I can't reproduce the dip at 100-120 topics. Is this where you got to? I think the idea in the paper is really neat and would love to have a working implementation.

from ldatuning.

internaut commented on August 16, 2024

You write "Their implementation applied to the same NIPS data" – does that mean you have their code? Would be interested in that.

I don't read in the paper that CM2 should be sorted. I guess you mean the following:

A point to be mentioned here is that both the distributions CM1 and CM2
are in sorted order so that the corresponding topic components are expected to
match.

As far as I understand, it should only made sure that each component in CM1 matches the one in CM2, i.e. CM1[1] matches CM2[1] which should be the case as they both provide information on topic 1. IMO sorting any of these destroys this connection.

Normalizing CM1 and CM2 in the same way as you did, I also get a similar result as you: basically a monotonically increasing curve (taken aside the noise) for K. When not normalizing CM1 I usually do get a curve that has some dip. The location of the dip usually depends on how I set alpha and beta which conceptually makes sense b/c both priors encode my beliefs about how "fine grained" the topics are.

from ldatuning.

hkimber commented on August 16, 2024

Unfortunately I don't mean I've seen their code. I wish.... It is the plot from their paper. I have emailed them though, hopefully I get a response. I will let you know if I do.

The reason I think cm2 needs sorting is because svd will return the semi-axes of the hyper-ellipsoid (diagonal of SIGMA or svd$d in R) in order of their importance. i.e sigma[1] will not necessarily correspond not to topic 1 but instead to the topic with the largest variance (when the topics are well separated of course). The more complete way to do this is to:

compute cm2 and find the order of topic importance
order m1 according to the topic importance found from cm2
perform svd on the orders m1
Finally sort cm2.

This way the topic components in cm1 and cm2 match up. See function below for implementation.

Arun2010_paper <- function(models, dtm) {
  # length of documents (count of words)
  len <- slam::row_sums(dtm)
  # evaluate metrics
  metrics <- sapply(models, FUN = function(model) {
    # matrix M2 document-topic
    m2   <- model@gamma   # rowSums(m2) == 1
    cm2  <- as.vector(len %*% m2)    # crossprod(len, m2)
    norm <- sum(cm2)
    cm2  <- cm2 / norm 
    topicOrder <- order(cm2, decreasing = T) #find importance of topics
    cm2 <- sort(cm2, decreasing = T)
    # matrix M1 topic-word
    m1 <- exp(model@beta) # rowSums(m1) == 1
    m1 <- m1[topicOrder,] #order topics according to their importance
    m1.svd <- svd(m1)
    cm1 <- as.matrix(m1.svd$d)
    norm <- sum(cm1)
    cm1 <- cm1/norm
    # symmetric Kullback-Leibler divergence
    divergence <- sum(cm1*log(cm1/cm2)) + sum(cm2*log(cm2/cm1))
    return ( divergence )
  })
  return(metrics)
}

What do you think? Or have I mis-interpreted the affect of the svd on the topic order in sigma?

from ldatuning.

internaut commented on August 16, 2024

Isn't "d" from svd(X) invariant to the order of rows in X? So I think m1 <- m1[topicOrder,] has no effect.

However generally, I think the idea might be what is conveyed in the paper – although I'm still unsure and it would be great to have their code. I've experimented with the following implementation which is however in Python:

def metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths):
    # CM1 – sing. value decomp. of topic-word distrib.
    cm1 = np.linalg.svd(topic_word_distrib, compute_uv=False)
    cm1 /= np.sum(cm1)     # normalize

    # CM2 – topics scaled by document lengths
    doc_lengths = np.asarray(doc_lengths).flatten()
    cm2 = doc_lengths @ doc_topic_distrib     # @ is mat. mul.
    cm2 = -np.sort(-cm2)    # sort in desc. order (just like cm1 is already sorted in desc. order)
    cm2 /= np.sum(cm2)     # normalize

    # symmetric Kullback-Leibler divergence KL(cm1||cm2) + KL(cm2||cm1)
    return np.sum(cm1 * (np.log(cm1 / cm2))) + np.sum(cm2 * (np.log(cm2 / cm1)))

I've tried this on the AP dataset for different beta values (0.01, 0.05, 0.1 and 0.2) while alpha is 1/K for K topics and unfortunately it doesn't give any informative curve.

from ldatuning.

hkimber commented on August 16, 2024

Yes you are right, svd(x) is invariant to the order of rows in x. I did it more to illustrate that sorting cm2 doesn't mean that the topic components don't match up.

I also don't reproduce their results but the magnitude the KL-divergence is at least in the right ball park. I will see if they reply and if not after a week I try to contact them again.

Thank you again for replying it is nice to know there is someone else this problem has got to. I will let you know if I get a reply at any point. I would really like to resolve this.

from ldatuning.

hkimber commented on August 16, 2024

So I think I've found the hyperparameters: beta = 0.1 and alpha = 50/k, at least for the NIPS data. These are the hyperparameters used in the Griffiths2004 paper (one of the other metrics in this package). Did you try this combination? Below I have included my results (in blue) overlayed the LHS of figure 5 from the arun2010 paper.

from ldatuning.

internaut commented on August 16, 2024

That looks promising. Do you have a quick link to the NIPS data? I only happen to have the AP dataset.

from ldatuning.

hkimber commented on August 16, 2024

I used the link in reference 15 of the paper http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

from ldatuning.

hkimber commented on August 16, 2024

@nikita-moor it would be great to have your input because your implementation definitely has some merit i.e. it generally agrees with the other metrics but I can't work out the intuition for how your implementation differs from the paper description. The issue I'm having now is that my implementation doesn't tend to agree with the other metrics. If you have any insight here it would be greatly appreciated.

from ldatuning.

hkimber commented on August 16, 2024

Thank you for taking the time to check against your python code. It's good to know you get the same results. Yes I agree the implementation should follow the paper IMO too. What do you think of the discussion @nikita-moor?

Regarding the comparison to caoJuan metric I wouldn't expect it to change to an upward trend as in the arun but I do see a change in gradient between 75-100 (may be I am seeing what I want to see though) which would indicate to me that this is a good range.

If the authors ever reply I will update you but at least from my perspective I'm happy we have solved this mystery.

from ldatuning.

nikita-moor commented on August 16, 2024

Hello @internaut and @hkimber. Unfortunately, I didn't look into this topic for several years and my current research interests are far from LDA, so I will rely on your investigation.

My only desire is to fix Deveaud2014. It worked on my data before I changed something in passion to reproduce the original paper, and after that I can't redo it. I communicated with the authors, but surprisingly they were not interested to reproduce their own work or provide the code. So, new enthusiasts are welcome.

from ldatuning.

internaut commented on August 16, 2024

Thanks, Nikita. So would you integrate a patch if @hkimber provided it in a PR?

Regarding Deveaud2014, this is another issue. I only had a quick look, but comparing the paper and your implementation I see at least two possible problems if I interpret the paper correctly. If you like, I would open another issue for that, because it isn't directly related to this issue.

from ldatuning.

Arun 2010 correctly implemented? about ldatuning HOT 19 OPEN

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent