hpclab / rankeval Goto Github PK
View Code? Open in Web Editor NEWOfficial repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
Home Page: http://rankeval.isti.cnr.it/
License: Mozilla Public License 2.0
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
Home Page: http://rankeval.isti.cnr.it/
License: Mozilla Public License 2.0
Hi,
coremltools has a fixed version dependency on six 1.10.0, which is not the latest, and which often causes dependency/version problems. See e.g. apple/coremltools#141
It would be nice to turn this into a soft dependency (meaning we should try to import the module on demand, and give out a friendly error message if the import fails), because coremltools does not provide a core feature, but is merely used to support catboost.
Currently, I am commenting out the dependency completely to be able to install rankeval, however this is not a sustainable solution...
Mac OSX Xcode supports OpenMP per this post https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ . I tried it by setting the following env variables:
export CC='clang -Xpreprocessor '
export CXX='clang++ -Xpreprocessor '
and I got pretty far, were it not for this error:
clang -Xpreprocessor -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/paulperry/anaconda3/include -arch x86_64 -I/Users/paulperry/anaconda3/include -arch x86_64 -I./rankeval/analysis -I/Users/paulperry/anaconda3/include/python3.6m -I/Users/paulperry/anaconda3/lib/python3.6/site-packages/numpy/core/include -c ./rankeval/analysis/_efficient_feature_impl.cpp -o build/temp.macosx-10.7-x86_64-3.6/./rankeval/analysis/_efficient_feature_impl.o -fopenmp -O3 -w -std=c++11
./rankeval/analysis/_efficient_feature_impl.cpp:90:27: error: no matching constructor for initialization of 'std::vector<TreeNode>'
std::vector<TreeNode> queue = { root };
^ ~~~~~~~~
Where this might fix it: https://stackoverflow.com/questions/26144299/compiler-error-when-constructing-a-vector-of-pairs
But before I go mucking with the code I wondered if anyone else has gone down this path and succeeded. Thx
I'm running into an error and summarized it in this toy example:
X = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
y = pd.DataFrame([0,0,1])
g = pd.Series([1,1,2])
dataset = Dataset(X, y, g, name='dataset')
mse = MSE()
feature_analysis = feature_importance(model=X, dataset=dataset, metric=mse)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-be667221664d> in <module>
5 mse = MSE()
6
----> 7 feature_analysis = feature_importance(model=X, dataset=dataset, metric=mse)
~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/feature.py in feature_importance(model, dataset, metric, normalize)
63
64 if isinstance(metric, RMSE) or isinstance(metric, MSE):
---> 65 feature_imp, feature_count = eff_feature_importance(model, dataset)
66 if isinstance(metric, RMSE):
67 feature_imp[0] = np.sqrt(feature_imp[0])
~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/_efficient_feature.pyx in rankeval.analysis._efficient_feature.eff_feature_importance()
TypeError: Cannot convert DataFrame to numpy.ndarray
http://rankeval.isti.cnr.it/docs/rankeval.model.html
under rankeval.model.proxy_LightGBM module:
... This is required because LtR datasets do not have missing values, but have feature values equals to zero (while LightGBM consider zero valued feature as missing values). ...
I do not think this is correct.
This is what the LightGBM documentation says:
LightGBM enables the missing value handle by default. Disable it by setting use_missing=false.
LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true.
When zero_as_missing=false (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros.
When installing from Github (either via setup.py, or pip3), I get the following error when trying to import anything from rankeval:
Traceback (most recent call last):
File "./ranking-eval.py", line 10, in <module>
from rankeval.analysis.effectiveness import query_class_performance
File "/usr/local/lib/python3.6/dist-packages/rankeval/__init__.py", line 11, in <module>
encoding='utf-8').read().strip()
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/dist-packages/rankeval/../VERSION'
Looking at /usr/local/lib/python3.6/dist-packages/, there is no VERSION (and there should not be, of course). Note that in the source tree, the VERSION file is present in the parent directory of init.py.
In loading a model I get:
rankeval_lgb_model = RTEnsemble('lgb.model', name="LightGBM model", format="LightGBM")
[...]
AssertionError: Decision Tree not supported. RankEval does not support categorical features and missing values.
Is there a way to work around this? Or will there be support for LGBM cat features and missing values?
msn_validation = dataset_container.validation_dataset
msn_test = dataset_container.test_dataset
Is this intentional?
Or were the supposed to be used somewhere?
For example, instead of the training set, NDCG@10 could be measured on validation or test:
y_pred = msn_lgbm_lmart_1Ktrees_model.score(msn_train)
print "%s: %.3f" % (ndcg_10, ndcg_10.eval(msn_train, y_pred)[0])
Do you have plans to support Python 3?
Would you be interested in pull requests that move the project closer to supporting Python 3?
I'm a bit lost here. Is there a toy example I can play with ?
from rankeval.analysis.effectiveness import model_performance
model_perf = model_performance(
datasets=[rank_train],
models=[rankeval_model],
metrics=[precision_5, recall_5, ndcg_5])
model_perf.to_dataframe()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-211-bb7252cc105d> in <module>()
4 datasets=[rank_train],
5 models=[rankeval_model],
----> 6 metrics=[ ndcg_5])
7
8 model_perf.to_dataframe()
~/anaconda3/lib/python3.6/site-packages/rankeval-0.7.2-py3.6-macosx-10.7-x86_64.egg/rankeval/analysis/effectiveness.py in model_performance(datasets, models, metrics, cache)
57 for idx_metric, metric in enumerate(metrics):
58 data[idx_dataset][idx_model][idx_metric] = metric.eval(dataset,
---> 59 y_pred)[0]
60
61 performance = xr.DataArray(data,
~/anaconda3/lib/python3.6/site-packages/rankeval-0.7.2-py3.6-macosx-10.7-x86_64.egg/rankeval/metrics/ndcg.py in eval(self, dataset, y_pred)
91 for rel_id, (qid, q_y, _) in enumerate(
92 self.query_iterator(dataset, dataset.y)):
---> 93 idcg_score[rel_id] = self.dcg.eval_per_query(q_y, q_y)
94
95 self._cache_idcg_score[self._current_dataset] = idcg_score
~/anaconda3/lib/python3.6/site-packages/rankeval-0.7.2-py3.6-macosx-10.7-x86_64.egg/rankeval/metrics/dcg.py in eval_per_query(self, y, y_pred)
97 gain = y[idx_y_pred_sorted]
98 elif self.implementation == "exp":
---> 99 gain = np.exp2(y[idx_y_pred_sorted]) - 1.0
100
101 dcg = (gain / discount).sum()
~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
808 key = check_bool_indexer(self.index, key)
809
--> 810 return self._get_with(key)
811
812 def _get_with(self, key):
~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in _get_with(self, key)
840 if key_type == 'integer':
841 if self.index.is_integer() or self.index.is_floating():
--> 842 return self.loc[key]
843 else:
844 return self._get_values(key)
~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1476
1477 maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478 return self._getitem_axis(maybe_callable, axis=axis)
1479
1480 def _is_scalar_access(self, key):
~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1899 raise ValueError('Cannot index with multidimensional key')
1900
-> 1901 return self._getitem_iterable(key, axis=axis)
1902
1903 # nested tuple slicing
~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1141 if labels.is_unique and Index(keyarr).is_unique:
1142 indexer = ax.get_indexer_for(key)
-> 1143 self._validate_read_indexer(key, indexer, axis)
1144
1145 d = {axis: [ax.reindex(keyarr)[0], indexer]}
~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1204 raise KeyError(
1205 u"None of [{key}] are in the [{axis}]".format(
-> 1206 key=key, axis=self.obj._get_axis_name(axis)))
1207
1208 # we skip the warning on Categorical/Interval
KeyError: 'None of [3807 76\n4956 59\n3972 72\n635 73\n3664 20\nName: target, dtype: int64] are in the [index]'
=
Is this method really needed/worth keeping?
It is not used within the code base.
dataset.X = None should be sufficient if one really wants to save memory under specific circumstances, and should leave X for garbage collection (if there are no other references to it).
del self.X modifies something that was passed into the object.
Not sure whether this is really something that you want to do...
It is customary to normalise them to be 100 max. See Hastie et al. pg 368:
https://web.stanford.edu/~hastie/Papers/ESLII.pdf
I used
100*feature_analysis.T[:,0]/np.max(feature_analysis.T[:,0])
It would be great to have support for the (fairly new) TF-Ranking library by Google: https://github.com/tensorflow/ranking
Updating to the latest build I get:
!pip install rankeval
Collecting rankeval
Downloading https://files.pythonhosted.org/packages/83/cb/20aa574ce29312e8a7e2bc79fd1f9ebccebff8015866133073979d99b543/rankeval-0.7.2.tar.gz (8.6MB)
100% |████████████████████████████████| 8.6MB 4.1MB/s
Requirement already satisfied: numpy>=1.13 in /opt/conda/lib/python3.6/site-packages (from rankeval) (1.15.2)
Requirement already satisfied: scipy>=0.14.0 in /opt/conda/lib/python3.6/site-packages (from rankeval) (1.1.0)
Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from rankeval) (1.11.0)
Requirement already satisfied: pandas>=0.19.1 in /opt/conda/lib/python3.6/site-packages (from rankeval) (0.23.4)
Requirement already satisfied: xarray>=0.9.5 in /opt/conda/lib/python3.6/site-packages (from rankeval) (0.10.9)
Requirement already satisfied: seaborn>=0.8 in /opt/conda/lib/python3.6/site-packages (from rankeval) (0.9.0)
Collecting coremltools>=0.8 (from rankeval)
Downloading https://files.pythonhosted.org/packages/b9/9d/7ec5a2480c6afce4fcb99de1650b7abfd1457b2ef1de5ce39bf7bee8a8ae/coremltools-2.1.0-cp36-none-manylinux1_x86_64.whl (2.7MB)
100% |████████████████████████████████| 2.7MB 5.9MB/s
Requirement already satisfied: matplotlib>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from rankeval) (2.2.3)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19.1->rankeval) (2.6.0)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19.1->rankeval) (2018.5)
Requirement already satisfied: protobuf>=3.1.0 in /opt/conda/lib/python3.6/site-packages (from coremltools>=0.8->rankeval) (3.6.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.0.2->rankeval) (2.2.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.0.2->rankeval) (1.0.1)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.0.2->rankeval) (0.10.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from protobuf>=3.1.0->coremltools>=0.8->rankeval) (39.1.0)
Building wheels for collected packages: rankeval
Running setup.py bdist_wheel for rankeval ... done
Stored in directory: /root/.cache/pip/wheels/61/96/a8/6d3b323ae7c815d647e20e949b19437a9198c375afcb9c6d31
Successfully built rankeval
mxnet 1.3.0.post0 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.15.2 which is incompatible.
kmeans-smote 0.1.0 has requirement imbalanced-learn<0.4,>=0.3.1, but you'll have imbalanced-learn 0.5.0.dev0 which is incompatible.
kmeans-smote 0.1.0 has requirement numpy<1.15,>=1.13, but you'll have numpy 1.15.2 which is incompatible.
fastai 0.7.0 has requirement torch<0.4, but you'll have torch 0.4.1 which is incompatible.
anaconda-client 1.7.2 has requirement python-dateutil>=2.6.1, but you'll have python-dateutil 2.6.0 which is incompatible.
imbalanced-learn 0.5.0.dev0 has requirement scikit-learn>=0.20, but you'll have scikit-learn 0.19.1 which is incompatible.
Installing collected packages: coremltools, rankeval
Successfully installed coremltools-2.1.0 rankeval-0.7.2
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
import rankeval
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-122-c66b5899c31b> in <module>
----> 1 import rankeval
/opt/conda/lib/python3.6/site-packages/rankeval/__init__.py in <module>
9
10 __version__ = io.open(os.path.join(cur_dir, '..', 'VERSION'),
---> 11 encoding='utf-8').read().strip()
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.6/site-packages/rankeval/../VERSION'
Ma sta robba non funziona! ;-)
I've created an XGBRanker object with the sklearn API and tried to use the rankeval effectiveness analysis. It requires a score()
function, which makes sense, but I don't see that XGBRanker has one, and I don't know if sklearn requires one. Thoughts?
from rankeval.analysis.effectiveness import model_performance
model_perf = model_performance(
datasets=[x_valid],
models=[model],
metrics=[precision_10, recall_10, ndcg_10])
model_perf.to_dataframe()
with the following output
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-707-76c7248fd353> in <module>()
4 datasets=[x_valid],
5 models=[model],
----> 6 metrics=[precision_10, recall_10, ndcg_10])
7 model_perf.to_dataframe()
~/rankeval/rankeval/rankeval/analysis/effectiveness.py in model_performance(datasets, models, metrics, cache)
54 for idx_dataset, dataset in enumerate(datasets):
55 for idx_model, model in enumerate(models):
---> 56 y_pred = model.score(dataset, detailed=False, cache=cache)
57 for idx_metric, metric in enumerate(metrics):
58 data[idx_dataset][idx_model][idx_metric] = metric.eval(dataset,
AttributeError: 'XGBRanker' object has no attribute 'score'
I have a 92921 line input file in LibSVM/RankLib format, with 2155 query IDs:
> wc -l output.libsvm
92921 output.libsvm
> cut -f 2 -d ' ' output.libsvm | sort | uniq | wc -l
2155
dataset = Dataset.load("output.libsvm")
print(dataset)
print("n_queries: %s" % dataset.n_queries)
print("len(query_ids): %s" % len(dataset.query_ids))
Expected output (according to documentation):
Dataset (92921, 6)
n_queries: 2155
len(query_ids): 92921
Actual output:
Dataset (92921, 6)
n_queries: 2155
len(query_ids): 2156
Documentation snippet in https://github.com/hpclab/rankeval/blob/master/rankeval/dataset/dataset.py I am referring to:
query_ids : numpy 1d array of int
It is a ndarray of shape(nsamples,)
I think the reason why it works not as expected can be found in the constructor:
if len(query_ids) == X.shape[0]:
I guess this is done for easier by-query access.
However, it does not match the documented behavior.
Either the implementation or the documentation should be changed.
Note also that the logic here does not work in the case n_queries == n_instances.
Hi,
Nice demo yesterday. Perhaps you could add support for parsing Jforests model files. Example XML model file included below. NB: Jforests doesn't use an XML parser to parse it, so there is no declaration at the top of the file.
Let me know if you have question in interpreting how to make the XML format into a tree.
<Ensemble>
<Tree leaves="7" weight="1.0">
<SplitFeatures>16 10 6 9 6 9</SplitFeatures>
<LeftChildren>1 4 -3 -4 -1 -6</LeftChildren>
<RightChildren>-2 2 3 -5 5 -7</RightChildren>
<Thresholds>25050 31260 24147 32216 24147 29700</Thresholds>
<OriginalThresholds>0.7645119941402674 0.9540377220289324 0.7369529390221571 0.9832143075138864 0.7369529390221571 0.9064273942501373</OriginalThresholds>
<LeafOutputs>-2.0 1.5769671648438965 -0.262839281614885 1.9562399004573359 -2.0 1.9353035413956268 1.5149362903356052</LeafOutputs>
</Tree>
<Tree leaves="7" weight="1.0">
<SplitFeatures>0 0 4 15 4 0</SplitFeatures>
<LeftChildren>1 3 -3 4 -1 -6</LeftChildren>
<RightChildren>-2 2 -4 -5 5 -7</RightChildren>
<Thresholds>32138 28687 31497 0 32358 18957</Thresholds>
<OriginalThresholds>0.9808337911249466 0.8755112006348044 0.9612708295184033 0.0 0.987548068119392 0.5785570408350119</OriginalThresholds>
<LeafOutputs>-1.8400881506001467 1.810074728431422 -1.843558296666071 1.8091450294064726 -1.617818578255056 -1.9877378361172633 1.810248328082937</LeafOutputs>
</Tree>
</Ensemble>
Currently, Rankeval does not support loading a XGBoost model with holes in the node identifiers. This effect takes place when a XGBoost model is trained and the pruning phase (of XGBoost) removes some nodes from the final model, leaving the identifiers associated with these nodes out.
E.g., the following tree miss of the node identifiers 9
and 10
that were removed at training time (2 pruned nodes
from the training log of this tree):
booster[0]:
0:[f64<0.00485350005] yes=1,no=2,missing=1
1:[f133<0.5] yes=3,no=4,missing=3
3:[f109<22.529314] yes=7,no=8,missing=7
7:[f114<-26.4824524] yes=15,no=16,missing=15
15:leaf=-0.0236083996
16:leaf=-0.0109101823
8:[f94<401.155518] yes=17,no=18,missing=17
17:leaf=-0.00601068465
18:leaf=0.012365385
4:leaf=0.0293395668
2:[f133<0.5] yes=5,no=6,missing=5
5:[f107<11.7844181] yes=11,no=12,missing=11
11:[f116<-6.82519722] yes=23,no=24,missing=23
23:leaf=-0.00373374182
24:leaf=0.0127977673
12:[f17<28.2889977] yes=25,no=26,missing=25
25:leaf=0.00760455802
26:leaf=0.0176214874
6:[f134<7.5] yes=13,no=14,missing=13
13:[f131<69.5] yes=27,no=28,missing=27
27:leaf=0.0253543444
28:leaf=6.53095121e-05
14:leaf=0.0354171321
The solution is to consider the node identifier not as strictly incremental but with possible holes inside.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.