Code Monkey home page Code Monkey logo

cipcippy's Introduction

CipCipPy

Twitter IR system for the TREC Microblog track.

Authors:

For license information see LICENSE

Installation

In config.py set the constant DATA_PATH with the directory where CipCipPy will store data (indexes, cache, ...).

Install the package executing: python setup.py install.

Some hints:
Language training files included in data/resources/languageTraining are made for CipCipPy.utils.language.Lang class.
For hashtag segmentation it is necessary to build a dictionary {term: frequency}; e.g., we have used Google 1-grams http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

Dependencies

Packages and files

  • config.py Global parameters for configuring the environment.
  • corpus A new corpus can be generated with the function corpus.build, passing a list of instances of the classes from corpus.filters.
  • indexing Each module in this package contains a function index for generating an index from a plain text corpus, filtering out documents after a specific time. An index takes as argument a name that is used in the retrieval phase.
  • retrieval Tools for accessing (e.g., searching) the indexes.
  • classification Supervised learning utilities.
  • filtering Classes for real-time filtering (a TREC microblog task), which mainly use classification.
  • utils Generic classes and functions used in the library.
  • scripts (not a package) Scripts for executing applications which exploit the library (building indexes, validating models, ...).
  • data (not a package) The data path contains all the data files used by the library (cached data, indexes, ...).

Running experiments

This guide is for real-time filtering experiments of the TREC Microblog track: https://sites.google.com/site/microblogtrack/2012-guidelines. The experiments in [1] have been conducted according to this protocol. You will need to run a Dexter REST API server and download the files from http://trec.nist.gov/data/microblog2012.html. Do not esitate to email the authors for help.

  1. First install the TREC twitter tools (http://github.com/lintool/twitter-tools) in order to download the corpus. You need to obtain the TREC corpus following the guidelines. CipCipPy has some tools for downloading the corpus using twitter-tools, you can find them in CipCipPy.corpus.trec. You have to write the corpus in a directory on disk in plain text files, using CipCipPy.corpus.trec.dump.
  2. Now you can build processed versions of the corpus, using CipCipPy.corpus.build, i.e. the english corpus and the link titles corpus. Take as example the buildCorpus.py script, in which the filter HtmlUnescape is already set. Create three different corpus with the filters EnglishTri, EnglishLangid and LinkTitles (the last has to be created from the final english corpus). Create the union of the two english corpus using CipCipPy.corpus.enrich, look at the enrichCorpus.py script as example.
  3. Corpora must be indexed in order to generate the dataset. Two scripts, index.py and dataset.py, executed in this order, with proper arguments setting, will generate the final dataset. Look at the scripts code for the details.
  4. A separate script is needed for annotating query topics: annotateQueries.py.
  5. Finally, with validate.py and test.py you can run experiments. The first script will output a text file with the evaluation of the topics file passed as argument (we created a file with the only validation topics in it). The second script performs a similar process and writes the dump files of the results.
    The doc string of both scripts contains details of the input arguments. An example of the model parameter string: RO-0.655-1000-0.2-0.1-terms.stems.bigrams.hashtags-terms.stems.bigrams-candidateEntities. In CipCipPy.clssification.feature there is the code and the doc strings for understanding feature extraction functions.
  6. Use evaluate.py for printing the evaluation from result dump files.

Publications

[1] On the Impact of Entity Linking in Microblog Real-Time Filtering. Berardi G., Ceccarelli D., Esuli A., & Marcheggiani D.
[2] ISTI@ TREC Microblog track 2012: real-time filtering through supervised learning. Berardi G., Esuli A., & Marcheggiani D.
[3] ISTI@ TREC Microblog track 2011: exploring the use of hashtag segmentation and text quality ranking. Berardi G., Esuli A., Marcheggiani D., & Sebastiani F.

cipcippy's People

Contributors

giacbrd avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.