Code Monkey home page Code Monkey logo

idocnadee's Introduction

iDocNADEe

AAAI 2019 paper: "Document Informed Neural Autoregressive Topic Models with Distributional Prior" (a Contextualized Neural Topic Model with Word Embeddings)

About

This code consists of the implementations for the proposed models: DocNADE, iDocNADE, DocNADEe and iDocNADEe in the AAAI-19 paper.

Requirements

Requires Python 3 (tested with 3.6.1). The remaining dependencies can then be installed via:

    $ pip install -r requirements.txt
    $ python -c "import nltk; nltk.download('all')"

Data format

Datasets: A directory containing CSV files. There is expected to be 1 CSV file per set or collection, with separate sets for training, validation and test. The CSV files in the directory must be named accordingly: training_docnade.csv, validation_docnade.csv, test_docnade.csv. For this task, each CSV file (prior to preprocessing) consists of 2 string fields with a comma delimiter - the first is the label and the second is the document body.

Vocabulary files: A plain text file, with 1 vocabulary token per line (note that this must be created in advance, we do not provide a script for creating vocabularies).

How to use: Training of DocNADE, DocNADEe, iDocNADE and iDocNADEe

The script train_TMN_docnade_TASKTYPE.sh invokes train_model.py to train the four different model to compute PPL as well IR and save it in a repository. It will also log all the information with the PPL and IR in the same model folder. Here's how to use the script:

    $ ./train_TMN_docnade_PPL.sh
	
	$ ./train_TMN_docnade_IR.sh
	
    - ``dataset`` 				is the path to the input dataset.
    - ``docnadeVocab`` 			is the path to vocabulary file used by DocNADE.
    - ``model`` 				is the path to model output directory.
    - ``learning-rate`` 		is learning rate.
    - ``batch-size`` 			is batch size for training data.
    - ``num-steps`` 			is the number of steps to train for.
    -  `log-every`` 			is to print training loss after this many steps.
    - ``validation-bs`` 		is the batch size for validation evaluation.
    - ``test-bs`` 				is the batch size for test evaluation.
    - ``validation-ppl-freq`` 	is to evaluate validation PPL and NLL after this many steps.
    - ``validation-ir-freq`` 	is to evaluate validation IR after this many steps.
    - ``test-ir-freq`` 			is to evaluate test IR after this many steps.
    - ``test-ppl-freq`` 		is to evaluate test PPL and NLL after this many steps.
    - ``num-classes`` 			is number of classes.
    - ``patience`` 				is patience for early stopping criterion.
    - ``hidden-size`` 			is size of the hidden layer.
    - ``activation`` 			is which activation to use: sigmoid|tanh. Notice, use 'sigmoid' for **PPL** and 'tanh' for **IR** computations. 
    - ``bidirectional`` 		is whether to use iDocNADE model or not,  i.e. True or False. If True, then model --> ``iDocNADE``
	- ``initialize-docnade`` 	is whether to include glove embedding prior or not, i.e. True or False. If True, then model -->``DocNADEe``. 
								If ``bidirectional`` = True and ``initialize-docnade`` = True, then model --> ``iDocNADEe``.
    - ``combination-type`` 		is combination type for iDocNADE forward and backward hidden document representation, for instance 'sum'
    - ``vocab-size`` 			is the vocabulary size.
	- ``lambda-embeddings``		is the mixture weight, i.e. [0.0-1.0] with word embeddings. 
    - ``projection`` 			is whether to project prior embeddings or not,  i.e. True or False. Set to False.
    - ``deep`` 					is whether to maked model deep (deepDocNADE) or not (docNADE/iDocNADE),  i.e. True or False
    - ``deep-hidden-sizes`` 	is sizes of the deep hidden layers for deepDocNADE, for instance, 200, 200.
    - ``reload`` 				is whether to reload model or not,  i.e. True or False
    - ``reload-model-dir`` 		is path of directory for which model to be reloaded.
    - ``trainfile`` 			is path to training text file (required in case of topic coherence), for instance, ./datasets/20NSshort/training.txt
    - ``valfile`` 				is path to validation text file. (required in case of topic coherence), for instance, ./datasets/20NSshort/validation.txt
    - ``testfile`` 				is path to test text file. (required in case of topic coherence), for instance, ./datasets/20NSshort/test.txt
	- ``pretrained-embeddings-path`` is the path to pre-trained DocNADE model to initialize iDocNADE/DocNADEe/iDocNADEe model for **PPL** computation, for instance, './docnade_embeddings_ppl_reduced_vocab/TMNtitle'

Directory structure for results and datasets

Contains dataset folders

Datasets directory: ./datasets/

Contains GloVe pretrained embeddings

Pre-trained embeddings dir: /home/usr/resources/pretrained_embeddings/

Contains results of training

Results directory: ./model/MODELNAME/

Saved PPL model dir: ./model/MODELNAME/model_ppl/

Saved IR model dir: ./model/MODELNAME/model_ir/

Saved logs model dir: ./model/MODELNAME/logs/

Reload IR results: ./model/MODELNAME/logs/reload_info_ir.txt

Reload PPL results: ./model/MODELNAME/logs/reload_info_ppl.txt

Reload Functionality: In case of reload, set the following:

--reload-model-dir
--trainfile
--valfile
--testfile

NOTE: In computing PPL or IR for larget text datasets, it is recommented to use the reload functionality for generating scores (PPL/IR) for the test set, and set test-ppl-freq or test-ir-freq to a very large number so as to avoid overhead during the training time.

Citation

@inproceedings{pankajgupta2018iDocNADEe, title={Document Informed Neural Autoregressive Topic Models with Distributional Prior}, author={Gupta, Pankaj and Chaudhary, Yatin and Buettner, Florian and Sch{"u}tze, Hinrich}, booktitle={AAAI}, year={2018} }

idocnadee's People

Contributors

pgcool avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

idocnadee's Issues

Low PPL on the 20NSshort dataset

Hello,

Thank you for sharing the source code. I try to reproduce the iDocNADEe results on the 20NSshort dataset by running train_20NSshort_docnade.sh file, however, I get a much lower PPL (18.21) than the number (633) presented in the paper. Do I ignore some important configuration?

Thanks,

Jack

How to infer the subject

sorry,i just read your paper.But there are so many questions,for example,after Eq.2,3,we can get the log-likelihood of a document,however,how to get the topic-word distribution and the topic-document distribution?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.