mpkato / interleaving Goto Github PK
View Code? Open in Web Editor NEWA python library for conducting interleaving, which compares two or multiple rankers based on observed user clicks by interleaving their results.
License: MIT License
A python library for conducting interleaving, which compares two or multiple rankers based on observed user clicks by interleaving their results.
License: MIT License
Needless to say, documentation is necessary
Harrie Oosterhuis, and Maarten de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees." Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2017.
In [Schuth+, CIKM 2014], the return value L
of the prefix constraint sampling is a set.
Therefore, each prefix should be distinct. However, the current implementation of
Optimized::_sample_rankings (inherited from InterleavingMethod) seems to allow to
generate one prefix multiple times. Don't we have to modify the method?
Is it OK if the output of InterleavingMethod.evaluate is changed to a list where each element (i,j) indicates i won j?
Current:
Return one of the following tuples:
- (1, 0): Ranking 'a' won
- (0, 1): Ranking 'b' won
- (0, 0): Tie
Expected:
Return a list of pairs of ranker indices in which element (i, j) indicates i won j.
e.g. a result [(1, 0), (2, 1), (2, 0)] indicates ranker 1 won ranker 0, and ranker 2 won rankers 0 as well as 1.
I forgot to remove a call to print
from the test code for Probabilistic
.
This library is great. Would it be possible to make a pypi package out of it?
If you need help in doing so, I'd love to help out.
Some methods use sets to ensure the uniqueness of items and rely on the order of items of the set. Python's set
order is not deterministic across python interpreters, which makes it hard to control, especially during tests as there is no seed for set
that can be controlled.
Using dict.fromkeys
instead of set
achieves the same uniqueness but guaranteeing deterministic ordering of items.
The following methods are affected:
Suppose A = [1, 2, 3] and B = [2, 3, 1]. Two rankings are obtained by TeamDraft: [1, 2, 3] and [2, 1, 3]. However, there must be four types of rankings, i.e.
The current implementation of Ranking.hash only takes into account of the equality of the ranking but not that of the team.
I have been thinking about how to pass the interleaving results to a production environment.
Probably, an instance of InterleavingMethod can have function dump_rankings(self, file)
, which dumps sampled rankings into a file. Rankings, their probability, and some additional information (e.g. teams or credits) must be stored in the dump file. It is probably convenient to assign a unique ID for each ranking by using hash
.
The format of the dump file should be something readable by the other programs, such as json or xml.
A production system can load the dump file, show the rankings to users, observe user clicks, and record the number of impressions and rank of clicks for each ranking.
InterleavingMethod.load_rankings(cls, file)
can then load the rankings from the dump file and somehow obtain the number of impressions and rank of clicks. InterleavingMethod.evaluate(cls, ranking, clicks)
can be used for deciding the ranking preference based on the loaded information.
In summary, my proposal is to implement:
InterleavingMethod.dump_rankings(self, file)
for dumping sampled rankingsInterleavingMethod.load_rankings(cls, file)
for loading the dumped rankingsRanking.dumps(self)
and Ranking.loads(self, s)
{
ranking_id: {
'probability': float,
'ranking': {
'ranking_list': list,
'teams': { ranker_index: list }
}
}
}
Sorry for trying to change the arguments in InterleavingMethod again.
The current arguments of InterleavingMethod are:
def __init__(self, max_length, sample_num, *lists)
This could be improved since keyword arguments with default values cannot be used with a variable-length argument.
There are some options:
def __init__(self, max_length, lists, sample_num=None, some_keyword_argument=3)
(lists
is no longer a variable-length argument)
def __init__(self, max_length, *lists, **kwargs)
(kwargs
is a dict for keyword arguments)
e.g.
im = InterleavingMethod(10, [1, 2], [3, 4], hoge=3)
# kwargs = {"hoge": 3}
# If "foo" is missing in this dict, we may use a default value for "foo" by explicitly doing so:
if not "foo" in kwargs:
kwargs["foo"] = FOO_DEFAULT_VALUE
Which do you prefer to?
The current Optimized._sample_ranking
samples rankings until the number of sampled rankings becomes sample_num
.
However, this sometimes never ends: the number of possible rankings can be less than sample_num
.
So it must be "until rankings are sampled sample_num
times", not "until the number of sampled rankings reaches sample_num
".
The solver used in Optimized or RoughlyOptimized sometimes returns probability whose sum is not 1.0. We can just renormalize the probability before returning the result.
It seems that the choice of rankers is different at Probabilistic Interleaving and Multileaving.
In Probabilistic Interleaving, the authors said:
After constructing s1 and s2, l is generated similarly to the team draft method. However, instead of randomizing the ranker to contribute the next document per pair, one of the softmax functions is randomly selected at each rank. Doing so is mathematically convenient, as the only component that changes at each rank is the distribution over documents.
"one of the softmax functions is randomly selected at each rank" may mean that it randomly selects a ranker every time it decides a document to be put in the combined ranking.
In Probabilistic Multileaving, on the other hand, it randomly picks up one of the rankers without replacement, as you explained.
The conclusion might be one of the following:
>python setup.py build
...
>python setup.py install
...
>python ./tests/test_methods.py
Traceback (most recent call last):
File "./tests/test_methods.py", line 1, in <module>
import interleaving as il
File "C:\Users\owner\AppData\Local\Programs\Python\Python35\lib\site-packages\interleaving-0.0.1-py3.5.egg\interleaving\__init__.py", line 6, in <module>
ImportError: cannot import name 'simulation'
I think the original PM paper is misleading.
When we compute P(a|l,q) by using P(l|a,q), it is necessary to use P(l|q) for normalization.
More precisely,
P(a|l,q) = P(l|a,q)P(a|q)/P(l|q)
Since P(l|q) = sum_a(P(l|a,q)P(a|q)) and P(a|q) = 1/(# rankers), the correct equation should be
P(a|l,q) = P(l|a,q)/sum_a(P(l|a, q))
Therefore, Algorithm 2 in the original paper must normalize p' before computing the expected outcome.
Sections 4.2 and 4.3 in "A Probabilistic Method for Inferring Preferences from Clicks, CIKM2011" explain how to infer the preference based on user clicks.
They seems different from the current implementation.
Implement the simulation with LETOR data.
It may be convenient if we can input the (maximum) length of resulting inter/multileaving.
I wonder whether the current implementation of the random team selection at Probablistic.multileave is correct.
The current code randomizes the order of the team list, use all of them in order, and randomizes it again if all the teams are used.
In "Probabilistic Multileave for Online Retrieval Evaluation, SIGIR 2015", on the other hand, Line 6 in Algorithm 1 may suggest that it draws one team from the team list every time.
How do you think?
The current implementation generates an interleaved ranking when two rankings are given.
This may not be convenient when
A proposal here is to obtain two or more ranking through init, compute the distribution in the initialization, and return a ranking if the interleave method is called (without any argument).
Moreover, we can return a dict or a list of tuples that includes all the possible rankings with their probability.
If computing the probability or generating all the possible rankings is not feasible, we can just receive two or more rankings in the initialization, and generate a ranking when the interleave method is called.
The input argument was changed unintentionally. Need to be fixed.
Some tests conduct too many samplings, which makes the tests quite slow.
I tried speed up Probablistic. Can you try this?
I found a way to handle too tight unbiasedness constraints. It just is to ignore them.
It is described on the last sentence of Section 4.2.3 of the original paper [Schuth+, CIKM 2014].
>python ./tests/test_balanced.py
Traceback (most recent call last):
File "./tests/test_balanced.py", line 4, in <module>
from .test_methods import TestMethods
SystemError: Parent module '' not loaded, cannot perform relative import
The current setup.py
cannot install this module due to some errors regarding numpy and scipy.
/Users/kato/.local/share/virtualenvs/interleaving-M68ZqYhg/lib/python3.6/site-packages/pulp/solvers.py:71: DeprecationWarning: The SafeConfigParser class has been renamed to ConfigParser in Python 3.2. This alias will be removed in future versions. Use ConfigParser directly instead.
'os':operating_system, 'arch':arch})
/Users/kato/dev/interleaving/interleaving/probabilistic.py:215: DeprecationWarning: `logsumexp` is deprecated!
Importing `logsumexp` from scipy.misc is deprecated in scipy 1.0.0. Use `scipy.special.logsumexp` instead.
p_all = np.exp(p_log_all - misc.logsumexp(p_log_all))
There are two messages about deprecation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.