Code Monkey home page Code Monkey logo

structured-topics's People

Contributors

smndf avatar

Stargazers

 avatar

Watchers

 avatar  avatar

structured-topics's Issues

Implementation of the initial prototype

Motivation

We need to develop a prototype of the system that builds structured topics model and is able to label new texts according to these topics. This vertical prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).

An important point also is to make a preliminary evaluation of the prototype (to show that the quality can be measurable).

Implementation

The prototype will build structured topics out of sense similarity graphs. These graphs were built automatically using distributional semantics methods (http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/distributional-semantics/).

The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):

  1. Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText and AdaGram models:

    Frequency dictionary: http://panchenko.me/data/joint/word-freq-news.gz -- to filter the graph.

    The data have the format word cid prob cluster isas

  2. Cluster the graphs of sense similarities using Chinese Whisper (CW), Markov Chain Clustering (MCL) and Louvain Method (LM).

    For the first two algorithms use this implementation: https://github.com/johannessimon/chinese- whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/ . Use:
    CW: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/de/tudarmstadt/lt/cw/CW.java

    MCL: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/net/sf/javaml/clustering/mcl/MarkovClustering.java

    For the LM use any available implementation e.g. https://perso.uclouvain.be/vincent.blondel/research/louvain.html.

    Description of CW is available here: http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf

    The output of clustering shall look like this:

    structured-topici-id<TAB>sense-id-1,sense-id-2,sense-id-3,...
    
  3. To make each topics more readable, assign 3 frequent hypernyms to the senses (topic-labels). The set of hypernyms will be provided.

    In addition, for each topic label, find URL of the image that depicts it from DBpedia (for instance http://dbpedia.org/page/Berlin). The images are located in the field: topic-label-image-urls. Each word in case of ambiguity (http://dbpedia.org/page/Python) should be disambiguated. The output shall look like this:

      structured-topici-id<TAB>topic-labels<TAB>topic-label-image-urls<TAB<sense-id-1,sense-id-2,sense-id-3,...
    
  4. Evaluate interpretability of the topics by taking at random 100 topics and annotating them as "interpretable", "not interpretable" or "mixed".

5 . Make a basic classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should

  1. Evaluate quality of the topic categorization by comparing it to set of Wikipedia categories. In particular, you are going to use measures that quantify quality of clustering (purity, inverse purity and so on). See http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html. The golden clustering will be the set of categories of Wikipedia articles. The predicted clustering will be the set of structured topics assigned to the articles.

Evaluation of the initial results

Motivation

Evaluate the first result so you are able to write the first report.

Implementation

  1. Select 6 best configurations from the table:
    • LM + adagram
    • LM + ddt-wiki
    • LM + ddt-news
    • CW + adagram
    • CW + ddt-wiki
    • CW + ddt-news
  2. For each of these 6 clusterings add additional column that would rank the cluster according to their quality. Introduce an ad-hoc ranking e.g. average-depth-of-hypernyms*average-simialrity-of-hypernyms. Add this as an excel formula and rank clusters according to this formula.
  3. Add additional column "Interpretable" for each of these 6 tables.
  4. Fill the column for each row with 1 if the cluster is "interpretable" i.e. a list of cities, a list of drugs, a list of dinosaurs. Otherwise for uninterpretable clusters write 0. All rows of all 6 sheets shall be annotated.
  5. Draw the plot of Precision@k: x is the number of relevant clusters among first k clusters; y is k i.e. the number of considered clusters. See http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
  6. Post the plots here by 15 of December 2015, the earlier the better.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.