CipCipPy

Twitter IR system for the TREC Microblog track.

Authors:

Giacomo Berardi [email protected]
Andrea Esuli [email protected]
Diego Marcheggiani [email protected]

For license information see LICENSE

Installation

In config.py set the constant DATA_PATH with the directory where CipCipPy will store data (indexes, cache, ...).

Install the package executing: python setup.py install.

Some hints:
Language training files included in data/resources/languageTraining are made for CipCipPy.utils.language.Lang class.
For hashtag segmentation it is necessary to build a dictionary {term: frequency}; e.g., we have used Google 1-grams http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

Dependencies

whoosh
sklearn
nltk
pydexter (http://github.com/giacbrd/pydexter)
langid (for language detection)
TREC tools for evaluation: mb12-filteval.py and EvalJig.py (download them from http://trec.nist.gov/data/microblog2012.html and rename mb12-filteval.py to mb12filteval.py)

Packages and files

config.py Global parameters for configuring the environment.
corpus A new corpus can be generated with the function corpus.build, passing a list of instances of the classes from corpus.filters.
indexing Each module in this package contains a function index for generating an index from a plain text corpus, filtering out documents after a specific time. An index takes as argument a name that is used in the retrieval phase.
retrieval Tools for accessing (e.g., searching) the indexes.
classification Supervised learning utilities.
filtering Classes for real-time filtering (a TREC microblog task), which mainly use classification.
utils Generic classes and functions used in the library.
scripts (not a package) Scripts for executing applications which exploit the library (building indexes, validating models, ...).
data (not a package) The data path contains all the data files used by the library (cached data, indexes, ...).

Running experiments

This guide is for real-time filtering experiments of the TREC Microblog track: https://sites.google.com/site/microblogtrack/2012-guidelines. The experiments in [1] have been conducted according to this protocol. You will need to run a Dexter REST API server and download the files from http://trec.nist.gov/data/microblog2012.html. Do not esitate to email the authors for help.

First install the TREC twitter tools (http://github.com/lintool/twitter-tools) in order to download the corpus. You need to obtain the TREC corpus following the guidelines. CipCipPy has some tools for downloading the corpus using twitter-tools, you can find them in CipCipPy.corpus.trec. You have to write the corpus in a directory on disk in plain text files, using CipCipPy.corpus.trec.dump.
Now you can build processed versions of the corpus, using CipCipPy.corpus.build, i.e. the english corpus and the link titles corpus. Take as example the buildCorpus.py script, in which the filter HtmlUnescape is already set. Create three different corpus with the filters EnglishTri, EnglishLangid and LinkTitles (the last has to be created from the final english corpus). Create the union of the two english corpus using CipCipPy.corpus.enrich, look at the enrichCorpus.py script as example.
Corpora must be indexed in order to generate the dataset. Two scripts, index.py and dataset.py, executed in this order, with proper arguments setting, will generate the final dataset. Look at the scripts code for the details.
A separate script is needed for annotating query topics: annotateQueries.py.
Finally, with validate.py and test.py you can run experiments. The first script will output a text file with the evaluation of the topics file passed as argument (we created a file with the only validation topics in it). The second script performs a similar process and writes the dump files of the results.
The doc string of both scripts contains details of the input arguments. An example of the model parameter string: RO-0.655-1000-0.2-0.1-terms.stems.bigrams.hashtags-terms.stems.bigrams-candidateEntities. In CipCipPy.clssification.feature there is the code and the doc strings for understanding feature extraction functions.
Use evaluate.py for printing the evaluation from result dump files.

Publications

[1] On the Impact of Entity Linking in Microblog Real-Time Filtering. Berardi G., Ceccarelli D., Esuli A., & Marcheggiani D.
[2] ISTI@ TREC Microblog track 2012: real-time filtering through supervised learning. Berardi G., Esuli A., & Marcheggiani D.
[3] ISTI@ TREC Microblog track 2011: exploring the use of hashtag segmentation and text quality ranking. Berardi G., Esuli A., Marcheggiani D., & Sebastiani F.

hlt-isti / cipcippy Goto Github PK

cipcippy's Introduction

CipCipPy

Installation

Dependencies

Packages and files

Running experiments

Publications

cipcippy's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent