Code Monkey home page Code Monkey logo

anserini's People

Contributors

16bitnarwhal avatar arthurchen189 avatar borislin avatar chriskamphuis avatar crystina-z avatar dependabot[bot] avatar edwinzhng avatar emmileaf avatar iorixxx avatar jasper-xian avatar jimmy0017 avatar jmmackenzie avatar justram avatar kytabyte avatar lintool avatar luchentan avatar lukuang avatar mofetoluwa avatar mxueguang avatar nikhilro avatar peilin-yang avatar rodrigonogueira4 avatar ronakice avatar rosequ avatar shaneding avatar stephaniewhoo avatar toluclassics avatar tteofili avatar victor0118 avatar yuki617 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anserini's Issues

Indexing all of ClueWeb09

Quite impressively, I was able to index all of ClueWeb09 (English):

nohup sh target/appassembler/bin/IndexClueWeb09b \
  -input /scratch1/collections/ClueWeb09.English/data/ \
  -index lucene-index.cw09.cnt -threads 32 -optimize >& log.cw09.cnt.txt &

Took ~18 hours:

2015-10-16 07:51:04,775 INFO  [main] index.IndexClueWeb09b (IndexClueWeb09b.java:298) - Total 503781465 documents indexed in 18:01:04

Index size (note: no positions):

$ du -h lucene-index.cw09.cnt/
254G    lucene-index.cw09.cnt/

Implement RM3

We probably need some relevance feedback model... RM3 is probably our best bet.

Put example command to dump LTR feaures into the README.md

specifically

sh target/appassembler/bin/DumpTweetsLtrData -index tweets2011-index/ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -output ltr.data.txt -qrels src/main/resources/topics-and-qrels/qrels.microblog2011.txt -ql

Integrate CACM collection

The CACM collection is small enough that we can include it in the repository... so we can have indexing/retrieval experiments completely integrated in with the system.

Refactoring of the index and document

For now everything is based on Warc formatted records.
We'd have other types of records too, e.g. Trec text or maybe other types in the future.
It is better to have a base record and everything inherits it.

Implement indexing for selective search

In selective search, the document collection is divided into different partitions (e.g., by clustering). Write an indexer that takes a cluster mapping (docid to clusterid mapping) and builds the right indexes - i.e., puts the documents in the appropriate partition index.

Implement Tweets2011/2012 baseline

Get basic indexing/retrieval working on TREC Microblog track data from 2011 to 2014. Let's start with TREC 2011 and TREC 2012 microblog data since the corpus is smaller...

Refactor ClueWeb09b to parallel structure of IndexGov2

@iorixxx Please check out my branch cw09b-refactoring

I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:

  • Clean up various reference (e.g., pom.xml) to make sure everything still works?
  • Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?

Thanks!

Experiment with different analyzers on Gov2

According to @iorixxx

EnglishAnalyzer: PorterStemmer is aggressive, and stop word removal would make certain queries (the wall, the current, the sun, to be or not to be) meaningless.
I think analysis should be minimum.

We should play with different analyzers and evaluate impact on effectiveness.

Issue with using QueryParser to parse TREC topics

We're currently using QueryParser to parse TREC topics, which means that symbols in the topics like parentheses and quotes get interpreted as query operators... this isn't the desired behavior.

Create RerankerCascade abstraction

It would make sense to create a RerankerCascade abstraction for running a whole bunch of sequence of rerankers. Something like:

RerankerCascade cascade = new RerankerCascade(context).add(foo).add(bar);
cascade.run(docs);

Write an indexer for flat text files

@yb1 You probably want to dump out the cleaned text in a simple text format, something like this:

URL1 document1 ....
URL2 document2 ...

And write an indexer for it. Look at IndexTweets.java and IndexWebCollection.java here:
https://github.com/lintool/Anserini/tree/master/src/main/java/io/anserini/index

The tweets indexer should be fairly easy to understand - it's single-threaded so it's slower. IndexWebCollection is multi-threaded and thus much faster.

I would start with a single-threaded implementation. Call the class IndexPlainText or something like that.

Lucene query parser cannot parse wildcard queries

Lucene query parser gives the following error if the query has wildcard characters in it:
'*' or '?' not allowed as first character in WildcardQuery
Ex: Cannot parse 'where is the Eldorado Casino in Reno ?': '*' or '?' not allowed as first character in WildcardQuery.

IndexCounter code broken

@LuchenTan IndexCounter code doesn't compile, so master is currently broken right now.

  • Wrong package
  • Uses Args class which has been removed. See IndexGov2 for example of how to use args4j
  • Class has a weird name - can you please rename to DumpDocids or something like that?
  • Can you please change indentation to 2 spaces instead of tab. Search online for Eclipse code formatter, one of the options is indentation - change to make consistent with everyone else.

Indentation size

@iorixxx Do you mind if we agree on code indentation being two spaces, just to be consistent?

If so, can you please reformat your code? I'd rather you do it so better retain history for git blame. Please send pull request.

Thanks!

Connect NRTS demo with RTS mobile push broker

@aroegies @xeniaqian94 Can you two coordinate on making this happen?

RTS mobile push broker: https://github.com/aroegies/trecrts-tools

  • Decide on a REST API so that the NRTS demo can call the broker
  • Note that the REST API should have the notion of a queryid, user, and token (=password)
  • Modify the NRTS demo so that you pass in a queryid and a query (e.g., "birthday") on the command line, and also an interval, e.g., 1 minute. Every minute, the NRTS demo wakes up, runs the query, and pushes results to the RTS broker

Try out different analyzers on Tweets collection

@xeniaqian94 It would be great for you to get some experience running end-to-end ad hoc experiments, which is a core activity of IR research. Let's start with something simple, like playing with different analyzers - currently, the tweet indexing uses PorterStemFilter. Try removing it and see what the effect is. So:

  • change the analyzer to remove stemming
  • rebuild index
  • run retrieval experiments - report effects on MAP, P@30 (compare with original index).

It would be nice to also know the effects of indexing only English tweets, using same procedure above.

Code Comments

Before we get too far into hacking on Anserini, we should probably decide on how we want to deal with comments.

Do we want to do Javadoc? Something else?

Nondeterminism in documents indexed for Gov2?

Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.

Weird - some non-determinism the multi-threading?

Not that important if we can replicate effective results on standard test collections, but worth noting.

Try out RM3 on Gov2

Current implementation of RM3 works for Tweets... let's see if it works for Gov2.

Need to build a Gov2 index that stores doc vectors.

Prepare TST data for baselines by Anserini

From @aroegies

If you tell me what fields are desired from:
https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus-v0_3_0.thrift
Open questions:

  • Do we want raw HTML, cleaned up HTML, cleaned up visible only HTML?
  • Do we just want the sentences (e.g. for compatibility with TST eval)?
  • Do we want some combination of the above.
  • Likely don't want to save any of their tagging.
  • What metadata to retain though.

Then I should be able to quickly put together a script to re-crawl, format, and encode in JSON the documents.

Likely we just want to use the entire KBA dataset rather than the TST subset but whatever.

Implement DocumentReranker interface

It seems what we need is a generic document reranking interface: takes a document ranking and spits another document ranking back out. This would implement a standard multi-stage retrieval pipeline: e.g., BM25 (or QL) + 1st stage reranker + 2nd stage reranker, etc.

Simple LTR implementation for Tweets

@LuchenTan @xeniaqian94 Let's start with a simple two-feature LTR implementation for Tweets:

  • Start with the current tweet search implementation, which has two rerankers, rm3 and cleanup.
  • Your implementation is going to go in a third reranker that you tack on to the end.

Let's build a LTR implementation that just has two features: RM3 score + number hash tags. Inside your new reranker, you already have the RM3 score; use getField on the document to pull out the text, and then just count the number of hashtags. Print out a line like this:

1 325263 0.432 3

Topic 1, docid 325263, RM3 score of 0.432, 3 hashtags. Dump this information for all docs.

You'll need to take this file and join it with qrels to get the relevance judgments (i.e., write a simple Python script to do it. So you'll end up with a file like:

1 325263 0.432 3 1

The final column is the relevance judgment. Now you can run learning to rank using http://sourceforge.net/p/lemur/wiki/RankLib/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.