smndf / structured-topics Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
We need to develop a prototype of the system that builds structured topics model and is able to label new texts according to these topics. This vertical prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).
An important point also is to make a preliminary evaluation of the prototype (to show that the quality can be measurable).
The prototype will build structured topics out of sense similarity graphs. These graphs were built automatically using distributional semantics methods (http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/distributional-semantics/).
The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):
Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText and AdaGram models:
Frequency dictionary: http://panchenko.me/data/joint/word-freq-news.gz -- to filter the graph.
The data have the format word cid prob cluster isas
Cluster the graphs of sense similarities using Chinese Whisper (CW), Markov Chain Clustering (MCL) and Louvain Method (LM).
For the first two algorithms use this implementation: https://github.com/johannessimon/chinese- whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/ . Use:
CW: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/de/tudarmstadt/lt/cw/CW.java
For the LM use any available implementation e.g. https://perso.uclouvain.be/vincent.blondel/research/louvain.html.
Description of CW is available here: http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf
The output of clustering shall look like this:
structured-topici-id<TAB>sense-id-1,sense-id-2,sense-id-3,...
To make each topics more readable, assign 3 frequent hypernyms to the senses (topic-labels
). The set of hypernyms will be provided.
In addition, for each topic label, find URL of the image that depicts it from DBpedia (for instance http://dbpedia.org/page/Berlin). The images are located in the field: topic-label-image-urls
. Each word in case of ambiguity (http://dbpedia.org/page/Python) should be disambiguated. The output shall look like this:
structured-topici-id<TAB>topic-labels<TAB>topic-label-image-urls<TAB<sense-id-1,sense-id-2,sense-id-3,...
Evaluate interpretability of the topics by taking at random 100 topics and annotating them as "interpretable", "not interpretable" or "mixed".
5 . Make a basic classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should
load the structured topics
structured-topici-id<TAB>cluster-word-1,cluster-word-2,cluster-word-3,...
for an input document output a set of most relevant topics
each output topic should have a confidence of the classification
To implement this module you should use ElasticSearch index. One topic would be one document, and then use an input document as search query. The retrieval system will return a list of documents (topics) according to their TF-IDF score.
How scoring of ElasticSearch works:
Evaluate the first result so you are able to write the first report.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.