Code Monkey home page Code Monkey logo

cluster-analysis's Introduction

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! (2020; Code for paper)

The repo contains the code needed to reproduce the results in Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! by Sia, Dalmia, and Mieke (2020)

Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1728โ€“1736. https://doi.org/10.18653/v1/2020.emnlp-main.135

How to use the code

To cluster the word embeddings to discover the latent topics, run the code/score.py file. Here are the arguments that can be passed in:

Required:

--entities : The type of pre-trained word embedding you are clustering with
choices= word2vec, fasttext, glove, KG KG stands for your own set of embeddings

--entities_file: The file name contain the embeddings

--clustering_algo: The clustering algorithm to use
choices= KMeans, SPKMeans, GMM, KMedoids, Agglo, DBSCAN , Spectral, VMFM

--vocab: List of vocab files to use for tokenization

Not Required:

--dataset: Dataset to test clusters against against
default = 20NG choices= 20NG, reuters

--preprocess: Cuttoff threshold for words to keep in the vocab based on frequency

--use_dims: Dimensions to scale with PCA (much be less than orginal dims)

--num_topics: List of number of topics to try default: 20

--doc_info: How to add document information choices= DUP, WGT

--rerank: Value used for reranking the words in a cluster
choices=tf, tfidf, tfdf

Example call: python3 code/score.py --entities KG --entities_file {dest_to_entities_file} --clustering_algo GMM --dataset reuters --vocab {dest_to_vocab_file} --num_topics 20 50 --doc_info WGT--rerank tf

How to cite

@inproceedings{sia-etal-2020-tired,
    title = "Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!",
    author = "Sia, Suzanna  and
      Dalmia, Ayush  and
      Mielke, Sabrina J.",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.135",
    doi = "10.18653/v1/2020.emnlp-main.135",
    pages = "1728--1736",
}

cluster-analysis's People

Contributors

adalmia96 avatar jonaschn avatar sjmielke avatar suzyahyah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cluster-analysis's Issues

Could you provide a sample embedding vocab files to run the socre.py?

Hi,

Thanks for your fascinating work. I am trying to reproduce your results for my own project, but I found it hard to run the socre.py code. In your example call,
python3 code/score.py --entities KG --entities_file {dest_to_entities_file} --clustering_algo GMM --dataset reuters --vocab {dest_to_vocab_file} --num_topics 20 50 --doc_info WGT--rerank tf
You did not specify the type of embedding and vocab files, so I kept running into problem.
Following is the code from your preprocess.py.
def create_global_vocab(vocab_files): vocab_list = set(line.split()[0] for line in open(vocab_files[0])) for vocab in vocab_files: vocab_list = vocab_list & set(line.split()[0] for line in open(vocab)) return vocab_list
My question is what kind of vocab files(txt, pkl,csv etc.) can have open(vocab_files[0]) ?

It will be helpful you can provide a sample embedding and vocab file so we can run the demo.sh code in the bin.

Thanks in advance for your help.

Vocabs ???

hello, and thank you for this great work.

I'm trying to test the code, but i find a required argument that is :

"--vocab: List of vocab files to use for tokenization"

can you give us some examples or where we can get them ?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.