mpkato / interleaving Goto Github PK

A python library for conducting interleaving, which compares two or multiple rankers based on observed user clicks by interleaving their results.

License: MIT License

Python 100.00%

interleaving's People

Contributors

Stargazers

Watchers

Forkers

peterdm alxsoares levylll moonsn kaolabears ybm1 zhouqp631 luwangg anuragreddygv323 kiminh leihuaye arita37 penganzhen

interleaving's Issues

Documentation

Needless to say, documentation is necessary

Pairwise Preference Multileaving

Harrie Oosterhuis, and Maarten de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees." Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2017.

Prefix constraint sampling

In [Schuth+, CIKM 2014], the return value L of the prefix constraint sampling is a set.
Therefore, each prefix should be distinct. However, the current implementation of
Optimized::_sample_rankings (inherited from InterleavingMethod) seems to allow to
generate one prefix multiple times. Don't we have to modify the method?

Output of InterleavingMethod.evaluate

Is it OK if the output of InterleavingMethod.evaluate is changed to a list where each element (i,j) indicates i won j?

Current:

Return one of the following tuples:
- (1, 0): Ranking 'a' won
- (0, 1): Ranking 'b' won
- (0, 0): Tie

Expected:

Return a list of pairs of ranker indices in which element (i, j) indicates i won j.
e.g. a result [(1, 0), (2, 1), (2, 0)]  indicates ranker 1 won ranker 0, and ranker 2 won rankers 0 as well as 1.

A tiny bugfix

I forgot to remove a call to print from the test code for Probabilistic.

pypi package

This library is great. Would it be possible to make a pypi package out of it?

If you need help in doing so, I'd love to help out.

Update CircleCI Settings

Use `dict.fromkeys` instead of `set` for deterministic behavior

Some methods use sets to ensure the uniqueness of items and rely on the order of items of the set. Python's set order is not deterministic across python interpreters, which makes it hard to control, especially during tests as there is no seed for set that can be controlled.

Using dict.fromkeys instead of set achieves the same uniqueness but guaranteeing deterministic ordering of items.

The following methods are affected:

PairwisePreference._current_candidates
Optimized._sample_rankings
Optimized._sample

Needs a fix for Ranking.hash from TeamDraft

Suppose A = [1, 2, 3] and B = [2, 3, 1]. Two rankings are obtained by TeamDraft: [1, 2, 3] and [2, 1, 3]. However, there must be four types of rankings, i.e.

[1, 2, 3] with team A = [1, 3] and team B = [2],
[1, 2, 3] with team A = [1] and team B = [2, 3],
[2, 1, 3] with team A = [1, 3] and team B = [2], and
[2, 1, 3] with team A = [1] and team B = [2, 3].

The current implementation of Ranking.hash only takes into account of the equality of the ranking but not that of the team.

Dumping Ranking instances

I have been thinking about how to pass the interleaving results to a production environment.

Probably, an instance of InterleavingMethod can have function dump_rankings(self, file), which dumps sampled rankings into a file. Rankings, their probability, and some additional information (e.g. teams or credits) must be stored in the dump file. It is probably convenient to assign a unique ID for each ranking by using hash.

The format of the dump file should be something readable by the other programs, such as json or xml.

A production system can load the dump file, show the rankings to users, observe user clicks, and record the number of impressions and rank of clicks for each ranking.

InterleavingMethod.load_rankings(cls, file) can then load the rankings from the dump file and somehow obtain the number of impressions and rank of clicks. InterleavingMethod.evaluate(cls, ranking, clicks) can be used for deciding the ranking preference based on the loaded information.

In summary, my proposal is to implement:

InterleavingMethod.dump_rankings(self, file) for dumping sampled rankings
InterleavingMethod.load_rankings(cls, file) for loading the dumped rankings
Ranking.dumps(self) and Ranking.loads(self, s)
The format of the dump file can be

{ 
    ranking_id: {
        'probability': float,
        'ranking': {
            'ranking_list': list,
            'teams': { ranker_index: list }
        }
    }
}

Discussion on the arguments in InterleavingMethod

Sorry for trying to change the arguments in InterleavingMethod again.

The current arguments of InterleavingMethod are:

def __init__(self, max_length, sample_num, *lists)

This could be improved since keyword arguments with default values cannot be used with a variable-length argument.

There are some options:

def __init__(self, max_length, lists, sample_num=None, some_keyword_argument=3)

(lists is no longer a variable-length argument)

def __init__(self, max_length, *lists, **kwargs)

(kwargs is a dict for keyword arguments)
e.g.

im = InterleavingMethod(10, [1, 2], [3, 4], hoge=3)
# kwargs = {"hoge": 3}
# If "foo" is missing in this dict, we may use a default value for "foo" by explicitly doing so:
if not "foo" in kwargs:
    kwargs["foo"] = FOO_DEFAULT_VALUE

Which do you prefer to?

Optimized's infinite loop

The current Optimized._sample_ranking samples rankings until the number of sampled rankings becomes sample_num.
However, this sometimes never ends: the number of possible rankings can be less than sample_num.

So it must be "until rankings are sampled sample_num times", not "until the number of sampled rankings reaches sample_num".

Optimized sometimes produces interleaved rankings with invalid probability

The solver used in Optimized or RoughlyOptimized sometimes returns probability whose sum is not 1.0. We can just renormalize the probability before returning the result.

The choice of rankers in Probabilistic Interleaving and Multileaving

It seems that the choice of rankers is different at Probabilistic Interleaving and Multileaving.

In Probabilistic Interleaving, the authors said:

After constructing s1 and s2, l is generated similarly to the team draft method. However, instead of randomizing the ranker to contribute the next document per pair, one of the softmax functions is randomly selected at each rank. Doing so is mathematically convenient, as the only component that changes at each rank is the distribution over documents.

"one of the softmax functions is randomly selected at each rank" may mean that it randomly selects a ranker every time it decides a document to be put in the combined ranking.

In Probabilistic Multileaving, on the other hand, it randomly picks up one of the rankers without replacement, as you explained.

The conclusion might be one of the following:

my understanding is wrong: they use the same method for selecting a ranker
Probabilistic Interleaving and Multileaving use different strategies for selecting a ranker; so we need two types of implementation.
Probabilistic Interleaving and Multileaving use different strategies for selecting a ranker; yet we should select either for simplicity.

setup.py does not install the simulation subpackage

>python setup.py build
...
>python setup.py install
...
>python ./tests/test_methods.py
Traceback (most recent call last):
  File "./tests/test_methods.py", line 1, in <module>
    import interleaving as il
  File "C:\Users\owner\AppData\Local\Programs\Python\Python35\lib\site-packages\interleaving-0.0.1-py3.5.egg\interleaving\__init__.py", line 6, in <module>
ImportError: cannot import name 'simulation'

Probabilistic._compute_scores IS WRONG

I think the original PM paper is misleading.

When we compute P(a|l,q) by using P(l|a,q), it is necessary to use P(l|q) for normalization.
More precisely,

P(a|l,q) = P(l|a,q)P(a|q)/P(l|q)

Since P(l|q) = sum_a(P(l|a,q)P(a|q)) and P(a|q) = 1/(# rankers), the correct equation should be

P(a|l,q) = P(l|a,q)/sum_a(P(l|a, q))

Therefore, Algorithm 2 in the original paper must normalize p' before computing the expected outcome.

Is Probablistic.evaluate different from that in the original paper?

Sections 4.2 and 4.3 in "A Probabilistic Method for Inferring Preferences from Clicks, CIKM2011" explain how to infer the preference based on user clicks.

They seems different from the current implementation.

Implement Optimized Multileaving

Simulation

Implement the simulation with LETOR data.

Should the length of multileaving be input?

It may be convenient if we can input the (maximum) length of resulting inter/multileaving.

Speed up probabilistic interleaving

Random team selection at Probablistic.multileave

I wonder whether the current implementation of the random team selection at Probablistic.multileave is correct.

The current code randomizes the order of the team list, use all of them in order, and randomizes it again if all the teams are used.

In "Probabilistic Multileave for Online Retrieval Evaluation, SIGIR 2015", on the other hand, Line 6 in Algorithm 1 may suggest that it draws one team from the team list every time.

How do you think?

Proposal for revising MultileavingMethod

The current implementation generates an interleaved ranking when two rankings are given.
This may not be convenient when

The method requires much time and can save it by caching, and
The distribution of rankings is necessary (e.g. Store the results in a file and use them for some purposes).

A proposal here is to obtain two or more ranking through init, compute the distribution in the initialization, and return a ranking if the interleave method is called (without any argument).
Moreover, we can return a dict or a list of tuples that includes all the possible rankings with their probability.

If computing the probability or generating all the possible rankings is not feasible, we can just receive two or more rankings in the initialization, and generate a ranking when the interleave method is called.

>python ./tests/test_balanced.py
Traceback (most recent call last):
  File "./tests/test_balanced.py", line 4, in <module>
    from .test_methods import TestMethods
SystemError: Parent module '' not loaded, cannot perform relative import

setup.py does not install this module

The current setup.py cannot install this module due to some errors regarding numpy and scipy.

Deprecation Warning

/Users/kato/.local/share/virtualenvs/interleaving-M68ZqYhg/lib/python3.6/site-packages/pulp/solvers.py:71: DeprecationWarning: The SafeConfigParser class has been renamed to ConfigParser in Python 3.2. This alias will be removed in future versions. Use ConfigParser directly instead.
  'os':operating_system, 'arch':arch})

/Users/kato/dev/interleaving/interleaving/probabilistic.py:215: DeprecationWarning: `logsumexp` is deprecated!
Importing `logsumexp` from scipy.misc is deprecated in scipy 1.0.0. Use `scipy.special.logsumexp` instead.
  p_all = np.exp(p_log_all - misc.logsumexp(p_log_all))

There are two messages about deprecation.