Code Monkey home page Code Monkey logo

teem-tag's Introduction

#teem-tag Module which interacts with the Teem, a project under P2PValue Project.

Made with <3 for open-source. This module is part of the matchmaking proposal for which I got selected as a Google Summer of Code Student in 2016, under the organization Berkman Klein Center for Internet & Society.

Additional Links:

##Objective

Automatic analysis of the description of teams in Teem to get valid and important tokens. Currently new users have a lot of friction in finding communities of their interest. By analysing the data from the communities, we can generate a map of interests. These interests can therefore be grouped and the user can be onboarded more easily through the idea of having to select their interests and then being directed to relevant communities. Future goals of this module to become a full fletched module for ML + NLP tasks related to the project.

##Architecture This module heavily uses nltk package for most of it's inner working. It's build on top of Flask to provide a webserver. The module integrates as a docker container with SwellRT, the full-stack backend framework for Teem. Following is a schematic diagram for the same: alt tag

##Brief Explanation Extracting tags from a text document involves at least three steps:

  • Splitting the document into words. (Reading)
  • Grouping together variants of the same word. (Stemming)
  • Ranking them according to their relevance. (Ranking)

###Reader A Reader object accepts a input string (document), perform normalisation and noise filteringof the text (such as turning everything into lower case), analyse the structure of the phrases and punctuation, and return a list of words respecting the order in the text, with some additional information such as which ones look like proper nouns, or are at the end of a phrase.

###Stemmer The Stemmer tries to recognise the root of a word, in order to identify slightly different forms. This is already a quite complicated task, and it's clearly language-specific. The stem module in the NLTK package provides algorithms for many languages. Currently the module only supports English language and uses PorterStemmer as the default stemmer.

###Rater The Rater takes the list of words contained in the document, together with any additional information gathered at the previous stages, and returns a list of tags (i.e. words or small units of text) ordered by some idea of "relevance". It turns out that just working on the information contained in the document itself is not enough, because it says nothing about the frequency of a term in the language. For this reason, the module consists of analysing a corpus (i.e. a sample of documents written in the same language) to build a dictionary of known words. This is taken care by the build_dict() function. It is advised to build your own dictionaries, and the build_dict_from_nltk() function in the build_dict.py enables you to use the corpora included in NLTK.

teem-tag's People

Contributors

prastut avatar

Watchers

James Cloos avatar Fenil Patel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.