The structured-topics from smndf

Implementation of the initial prototype

Motivation

We need to develop a prototype of the system that builds structured topics model and is able to label new texts according to these topics. This vertical prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).

An important point also is to make a preliminary evaluation of the prototype (to show that the quality can be measurable).

Implementation

The prototype will build structured topics out of sense similarity graphs. These graphs were built automatically using distributional semantics methods (http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/distributional-semantics/).

The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):

Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText and AdaGram models:
Frequency dictionary: http://panchenko.me/data/joint/word-freq-news.gz -- to filter the graph.

The data have the format word cid prob cluster isas
Cluster the graphs of sense similarities using Chinese Whisper (CW), Markov Chain Clustering (MCL) and Louvain Method (LM).

For the first two algorithms use this implementation: https://github.com/johannessimon/chinese- whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/ . Use:
CW: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/de/tudarmstadt/lt/cw/CW.java

MCL: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/net/sf/javaml/clustering/mcl/MarkovClustering.java

For the LM use any available implementation e.g. https://perso.uclouvain.be/vincent.blondel/research/louvain.html.

Description of CW is available here: http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf

The output of clustering shall look like this:
```
structured-topici-id<TAB>sense-id-1,sense-id-2,sense-id-3,...
```
To make each topics more readable, assign 3 frequent hypernyms to the senses (topic-labels). The set of hypernyms will be provided.

In addition, for each topic label, find URL of the image that depicts it from DBpedia (for instance http://dbpedia.org/page/Berlin). The images are located in the field: topic-label-image-urls. Each word in case of ambiguity (http://dbpedia.org/page/Python) should be disambiguated. The output shall look like this:
```
  structured-topici-id<TAB>topic-labels<TAB>topic-label-image-urls<TAB<sense-id-1,sense-id-2,sense-id-3,...
```
Evaluate interpretability of the topics by taking at random 100 topics and annotating them as "interpretable", "not interpretable" or "mixed".

5 . Make a basic classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should

load the structured topics

structured-topici-id<TAB>cluster-word-1,cluster-word-2,cluster-word-3,...

for an input document output a set of most relevant topics
each output topic should have a confidence of the classification

To implement this module you should use ElasticSearch index. One topic would be one document, and then use an input document as search query. The retrieval system will return a list of documents (topics) according to their TF-IDF score.

How scoring of ElasticSearch works:
- https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
- https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Evaluate quality of the topic categorization by comparing it to set of Wikipedia categories. In particular, you are going to use measures that quantify quality of clustering (purity, inverse purity and so on). See http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html. The golden clustering will be the set of categories of Wikipedia articles. The predicted clustering will be the set of structured topics assigned to the articles.

Evaluation of the initial results

Motivation

Evaluate the first result so you are able to write the first report.

Implementation

Select 6 best configurations from the table:
- LM + adagram
- LM + ddt-wiki
- LM + ddt-news
- CW + adagram
- CW + ddt-wiki
- CW + ddt-news
For each of these 6 clusterings add additional column that would rank the cluster according to their quality. Introduce an ad-hoc ranking e.g. average-depth-of-hypernyms*average-simialrity-of-hypernyms. Add this as an excel formula and rank clusters according to this formula.
Add additional column "Interpretable" for each of these 6 tables.
Fill the column for each row with 1 if the cluster is "interpretable" i.e. a list of cities, a list of drugs, a list of dinosaurs. Otherwise for uninterpretable clusters write 0. All rows of all 6 sheets shall be annotated.
Draw the plot of Precision@k: x is the number of relevant clusters among first k clusters; y is k i.e. the number of considered clusters. See http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
Post the plots here by 15 of December 2015, the earlier the better.

smndf / structured-topics Goto Github PK

structured-topics's People

Contributors

Stargazers

Watchers

structured-topics's Issues

Evaluation by Lists of X

Implementation of the initial prototype

Motivation

Implementation

Evaluation of the initial results

Motivation

Implementation

Visualization improvements

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent