Code Monkey home page Code Monkey logo

rankeval's People

Contributors

claudio-lucchese avatar cmacdonald avatar cristinadece avatar francomarianardini avatar strani avatar zenogantner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rankeval's Issues

turn coremltools dependency into a soft dependency?

Hi,

coremltools has a fixed version dependency on six 1.10.0, which is not the latest, and which often causes dependency/version problems. See e.g. apple/coremltools#141

It would be nice to turn this into a soft dependency (meaning we should try to import the module on demand, and give out a friendly error message if the import fails), because coremltools does not provide a core feature, but is merely used to support catboost.

Currently, I am commenting out the dependency completely to be able to install rankeval, however this is not a sustainable solution...

support for build with Xcode?

Mac OSX Xcode supports OpenMP per this post https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ . I tried it by setting the following env variables:

export CC='clang -Xpreprocessor '
export CXX='clang++ -Xpreprocessor '

and I got pretty far, were it not for this error:

clang -Xpreprocessor -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/paulperry/anaconda3/include -arch x86_64 -I/Users/paulperry/anaconda3/include -arch x86_64 -I./rankeval/analysis -I/Users/paulperry/anaconda3/include/python3.6m -I/Users/paulperry/anaconda3/lib/python3.6/site-packages/numpy/core/include -c ./rankeval/analysis/_efficient_feature_impl.cpp -o build/temp.macosx-10.7-x86_64-3.6/./rankeval/analysis/_efficient_feature_impl.o -fopenmp -O3 -w -std=c++11
    ./rankeval/analysis/_efficient_feature_impl.cpp:90:27: error: no matching constructor for initialization of 'std::vector<TreeNode>'
        std::vector<TreeNode> queue = { root };
                              ^       ~~~~~~~~

Where this might fix it: https://stackoverflow.com/questions/26144299/compiler-error-when-constructing-a-vector-of-pairs

But before I go mucking with the code I wondered if anyone else has gone down this path and succeeded. Thx

feature_importance error

I'm running into an error and summarized it in this toy example:

X = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
y = pd.DataFrame([0,0,1])
g = pd.Series([1,1,2])
dataset = Dataset(X, y, g, name='dataset')
mse = MSE()

feature_analysis = feature_importance(model=X, dataset=dataset, metric=mse)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-be667221664d> in <module>
      5 mse = MSE()
      6 
----> 7 feature_analysis = feature_importance(model=X, dataset=dataset, metric=mse)

~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/feature.py in feature_importance(model, dataset, metric, normalize)
     63 
     64     if isinstance(metric, RMSE) or isinstance(metric, MSE):
---> 65         feature_imp, feature_count = eff_feature_importance(model, dataset)
     66         if isinstance(metric, RMSE):
     67             feature_imp[0] = np.sqrt(feature_imp[0])

~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/_efficient_feature.pyx in rankeval.analysis._efficient_feature.eff_feature_importance()

TypeError: Cannot convert DataFrame to numpy.ndarray

inaccurate description/reasoning in LIght

http://rankeval.isti.cnr.it/docs/rankeval.model.html
under rankeval.model.proxy_LightGBM module:

... This is required because LtR datasets do not have missing values, but have feature values equals to zero (while LightGBM consider zero valued feature as missing values). ...

I do not think this is correct.

This is what the LightGBM documentation says:

LightGBM enables the missing value handle by default. Disable it by setting use_missing=false.
LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true.
When zero_as_missing=false (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros.

Problem with VERSION file

When installing from Github (either via setup.py, or pip3), I get the following error when trying to import anything from rankeval:

Traceback (most recent call last):
  File "./ranking-eval.py", line 10, in <module>
    from rankeval.analysis.effectiveness import query_class_performance
  File "/usr/local/lib/python3.6/dist-packages/rankeval/__init__.py", line 11, in <module>
    encoding='utf-8').read().strip()
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/dist-packages/rankeval/../VERSION'

Looking at /usr/local/lib/python3.6/dist-packages/, there is no VERSION (and there should not be, of course). Note that in the source tree, the VERSION file is present in the parent directory of init.py.

support for categorical features and missing values in LGBM

In loading a model I get:

rankeval_lgb_model = RTEnsemble('lgb.model', name="LightGBM model", format="LightGBM")
[...]
AssertionError: Decision Tree not supported. RankEval does not support categorical features and missing values.

Is there a way to work around this? Or will there be support for LGBM cat features and missing values?

validation and test dataset are not used in feature analysis notebook

msn_validation = dataset_container.validation_dataset
msn_test = dataset_container.test_dataset

Is this intentional?
Or were the supposed to be used somewhere?

For example, instead of the training set, NDCG@10 could be measured on validation or test:

y_pred = msn_lgbm_lmart_1Ktrees_model.score(msn_train)
print "%s: %.3f" % (ndcg_10, ndcg_10.eval(msn_train, y_pred)[0])

support Python 3

Do you have plans to support Python 3?
Would you be interested in pull requests that move the project closer to supporting Python 3?

KeyError: 'None of [...] are in the [index]'

I'm a bit lost here. Is there a toy example I can play with ?

from rankeval.analysis.effectiveness import model_performance

model_perf = model_performance(
    datasets=[rank_train], 
    models=[rankeval_model], 
    metrics=[precision_5, recall_5, ndcg_5])

model_perf.to_dataframe()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-211-bb7252cc105d> in <module>()
      4     datasets=[rank_train],
      5     models=[rankeval_model],
----> 6     metrics=[ ndcg_5])
      7 
      8 model_perf.to_dataframe()

~/anaconda3/lib/python3.6/site-packages/rankeval-0.7.2-py3.6-macosx-10.7-x86_64.egg/rankeval/analysis/effectiveness.py in model_performance(datasets, models, metrics, cache)
     57             for idx_metric, metric in enumerate(metrics):
     58                 data[idx_dataset][idx_model][idx_metric] = metric.eval(dataset,
---> 59                                                                        y_pred)[0]
     60 
     61     performance = xr.DataArray(data,

~/anaconda3/lib/python3.6/site-packages/rankeval-0.7.2-py3.6-macosx-10.7-x86_64.egg/rankeval/metrics/ndcg.py in eval(self, dataset, y_pred)
     91             for rel_id, (qid, q_y, _) in enumerate(
     92                     self.query_iterator(dataset, dataset.y)):
---> 93                 idcg_score[rel_id] = self.dcg.eval_per_query(q_y, q_y)
     94 
     95             self._cache_idcg_score[self._current_dataset] = idcg_score

~/anaconda3/lib/python3.6/site-packages/rankeval-0.7.2-py3.6-macosx-10.7-x86_64.egg/rankeval/metrics/dcg.py in eval_per_query(self, y, y_pred)
     97             gain = y[idx_y_pred_sorted]
     98         elif self.implementation == "exp":
---> 99             gain = np.exp2(y[idx_y_pred_sorted]) - 1.0
    100 
    101         dcg = (gain / discount).sum()

~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    808             key = check_bool_indexer(self.index, key)
    809 
--> 810         return self._get_with(key)
    811 
    812     def _get_with(self, key):

~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in _get_with(self, key)
    840             if key_type == 'integer':
    841                 if self.index.is_integer() or self.index.is_floating():
--> 842                     return self.loc[key]
    843                 else:
    844                     return self._get_values(key)

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1476 
   1477             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478             return self._getitem_axis(maybe_callable, axis=axis)
   1479 
   1480     def _is_scalar_access(self, key):

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1899                     raise ValueError('Cannot index with multidimensional key')
   1900 
-> 1901                 return self._getitem_iterable(key, axis=axis)
   1902 
   1903             # nested tuple slicing

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1141             if labels.is_unique and Index(keyarr).is_unique:
   1142                 indexer = ax.get_indexer_for(key)
-> 1143                 self._validate_read_indexer(key, indexer, axis)
   1144 
   1145                 d = {axis: [ax.reindex(keyarr)[0], indexer]}

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1204                 raise KeyError(
   1205                     u"None of [{key}] are in the [{axis}]".format(
-> 1206                         key=key, axis=self.obj._get_axis_name(axis)))
   1207 
   1208             # we skip the warning on Categorical/Interval

KeyError: 'None of [3807    76\n4956    59\n3972    72\n635     73\n3664    20\nName: target, dtype: int64] are in the [index]'

 = 

dataset.clear_X(): What is it used for?

Is this method really needed/worth keeping?

It is not used within the code base.
dataset.X = None should be sufficient if one really wants to save memory under specific circumstances, and should leave X for garbage collection (if there are no other references to it).

del self.X modifies something that was passed into the object.
Not sure whether this is really something that you want to do...

No version file /site-packages/rankeval/../VERSION'

Updating to the latest build I get:

!pip install rankeval

Collecting rankeval
  Downloading https://files.pythonhosted.org/packages/83/cb/20aa574ce29312e8a7e2bc79fd1f9ebccebff8015866133073979d99b543/rankeval-0.7.2.tar.gz (8.6MB)
    100% |████████████████████████████████| 8.6MB 4.1MB/s 
Requirement already satisfied: numpy>=1.13 in /opt/conda/lib/python3.6/site-packages (from rankeval) (1.15.2)
Requirement already satisfied: scipy>=0.14.0 in /opt/conda/lib/python3.6/site-packages (from rankeval) (1.1.0)
Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from rankeval) (1.11.0)
Requirement already satisfied: pandas>=0.19.1 in /opt/conda/lib/python3.6/site-packages (from rankeval) (0.23.4)
Requirement already satisfied: xarray>=0.9.5 in /opt/conda/lib/python3.6/site-packages (from rankeval) (0.10.9)
Requirement already satisfied: seaborn>=0.8 in /opt/conda/lib/python3.6/site-packages (from rankeval) (0.9.0)
Collecting coremltools>=0.8 (from rankeval)
  Downloading https://files.pythonhosted.org/packages/b9/9d/7ec5a2480c6afce4fcb99de1650b7abfd1457b2ef1de5ce39bf7bee8a8ae/coremltools-2.1.0-cp36-none-manylinux1_x86_64.whl (2.7MB)
    100% |████████████████████████████████| 2.7MB 5.9MB/s 
Requirement already satisfied: matplotlib>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from rankeval) (2.2.3)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19.1->rankeval) (2.6.0)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19.1->rankeval) (2018.5)
Requirement already satisfied: protobuf>=3.1.0 in /opt/conda/lib/python3.6/site-packages (from coremltools>=0.8->rankeval) (3.6.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.0.2->rankeval) (2.2.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.0.2->rankeval) (1.0.1)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.0.2->rankeval) (0.10.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from protobuf>=3.1.0->coremltools>=0.8->rankeval) (39.1.0)
Building wheels for collected packages: rankeval
  Running setup.py bdist_wheel for rankeval ... done
  Stored in directory: /root/.cache/pip/wheels/61/96/a8/6d3b323ae7c815d647e20e949b19437a9198c375afcb9c6d31
Successfully built rankeval
mxnet 1.3.0.post0 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.15.2 which is incompatible.
kmeans-smote 0.1.0 has requirement imbalanced-learn<0.4,>=0.3.1, but you'll have imbalanced-learn 0.5.0.dev0 which is incompatible.
kmeans-smote 0.1.0 has requirement numpy<1.15,>=1.13, but you'll have numpy 1.15.2 which is incompatible.
fastai 0.7.0 has requirement torch<0.4, but you'll have torch 0.4.1 which is incompatible.
anaconda-client 1.7.2 has requirement python-dateutil>=2.6.1, but you'll have python-dateutil 2.6.0 which is incompatible.
imbalanced-learn 0.5.0.dev0 has requirement scikit-learn>=0.20, but you'll have scikit-learn 0.19.1 which is incompatible.
Installing collected packages: coremltools, rankeval
Successfully installed coremltools-2.1.0 rankeval-0.7.2
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

import rankeval

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-122-c66b5899c31b> in <module>
----> 1 import rankeval

/opt/conda/lib/python3.6/site-packages/rankeval/__init__.py in <module>
      9 
     10 __version__ = io.open(os.path.join(cur_dir, '..', 'VERSION'),
---> 11                       encoding='utf-8').read().strip()

FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.6/site-packages/rankeval/../VERSION'

'XGBRanker' object has no attribute 'score'

Ma sta robba non funziona! ;-)

I've created an XGBRanker object with the sklearn API and tried to use the rankeval effectiveness analysis. It requires a score() function, which makes sense, but I don't see that XGBRanker has one, and I don't know if sklearn requires one. Thoughts?

from rankeval.analysis.effectiveness import model_performance

model_perf = model_performance(
    datasets=[x_valid], 
    models=[model], 
    metrics=[precision_10, recall_10, ndcg_10])
model_perf.to_dataframe()

with the following output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-707-76c7248fd353> in <module>()
      4     datasets=[x_valid],
      5     models=[model],
----> 6     metrics=[precision_10, recall_10, ndcg_10])
      7 model_perf.to_dataframe()

~/rankeval/rankeval/rankeval/analysis/effectiveness.py in model_performance(datasets, models, metrics, cache)
     54     for idx_dataset, dataset in enumerate(datasets):
     55         for idx_model, model in enumerate(models):
---> 56             y_pred = model.score(dataset, detailed=False, cache=cache)
     57             for idx_metric, metric in enumerate(metrics):
     58                 data[idx_dataset][idx_model][idx_metric] = metric.eval(dataset,


AttributeError: 'XGBRanker' object has no attribute 'score'

dataset documentation does not match behavior

I have a 92921 line input file in LibSVM/RankLib format, with 2155 query IDs:

> wc -l output.libsvm
92921 output.libsvm
> cut -f 2 -d ' ' output.libsvm | sort | uniq | wc -l
2155
dataset = Dataset.load("output.libsvm")
print(dataset)
print("n_queries: %s" % dataset.n_queries)
print("len(query_ids): %s" % len(dataset.query_ids))

Expected output (according to documentation):

Dataset (92921, 6)
n_queries: 2155
len(query_ids): 92921

Actual output:

Dataset (92921, 6)
n_queries: 2155
len(query_ids): 2156

Documentation snippet in https://github.com/hpclab/rankeval/blob/master/rankeval/dataset/dataset.py I am referring to:

query_ids : numpy 1d array of int
        It is a ndarray of shape(nsamples,)

I think the reason why it works not as expected can be found in the constructor:

if len(query_ids) == X.shape[0]:

I guess this is done for easier by-query access.
However, it does not match the documented behavior.
Either the implementation or the documentation should be changed.

Note also that the logic here does not work in the case n_queries == n_instances.

Input proxy for Jforest.

Hi,

Nice demo yesterday. Perhaps you could add support for parsing Jforests model files. Example XML model file included below. NB: Jforests doesn't use an XML parser to parse it, so there is no declaration at the top of the file.

Let me know if you have question in interpreting how to make the XML format into a tree.

<Ensemble>
    <Tree leaves="7" weight="1.0">
        <SplitFeatures>16 10 6 9 6 9</SplitFeatures>
        <LeftChildren>1 4 -3 -4 -1 -6</LeftChildren>
        <RightChildren>-2 2 3 -5 5 -7</RightChildren>
        <Thresholds>25050 31260 24147 32216 24147 29700</Thresholds>
        <OriginalThresholds>0.7645119941402674 0.9540377220289324 0.7369529390221571 0.9832143075138864 0.7369529390221571 0.9064273942501373</OriginalThresholds>
        <LeafOutputs>-2.0 1.5769671648438965 -0.262839281614885 1.9562399004573359 -2.0 1.9353035413956268 1.5149362903356052</LeafOutputs>
    </Tree>
    <Tree leaves="7" weight="1.0">
        <SplitFeatures>0 0 4 15 4 0</SplitFeatures>
        <LeftChildren>1 3 -3 4 -1 -6</LeftChildren>
        <RightChildren>-2 2 -4 -5 5 -7</RightChildren>
        <Thresholds>32138 28687 31497 0 32358 18957</Thresholds>
        <OriginalThresholds>0.9808337911249466 0.8755112006348044 0.9612708295184033 0.0 0.987548068119392 0.5785570408350119</OriginalThresholds>
        <LeafOutputs>-1.8400881506001467 1.810074728431422 -1.843558296666071 1.8091450294064726 -1.617818578255056 -1.9877378361172633 1.810248328082937</LeafOutputs>
    </Tree>
</Ensemble>

XGBoost loader fails when the training prunes out some nodes

Currently, Rankeval does not support loading a XGBoost model with holes in the node identifiers. This effect takes place when a XGBoost model is trained and the pruning phase (of XGBoost) removes some nodes from the final model, leaving the identifiers associated with these nodes out.

E.g., the following tree miss of the node identifiers 9 and 10 that were removed at training time (2 pruned nodes from the training log of this tree):

booster[0]:
0:[f64<0.00485350005] yes=1,no=2,missing=1
        1:[f133<0.5] yes=3,no=4,missing=3
                3:[f109<22.529314] yes=7,no=8,missing=7
                        7:[f114<-26.4824524] yes=15,no=16,missing=15
                                15:leaf=-0.0236083996
                                16:leaf=-0.0109101823
                        8:[f94<401.155518] yes=17,no=18,missing=17
                                17:leaf=-0.00601068465
                                18:leaf=0.012365385
                4:leaf=0.0293395668
        2:[f133<0.5] yes=5,no=6,missing=5
                5:[f107<11.7844181] yes=11,no=12,missing=11
                        11:[f116<-6.82519722] yes=23,no=24,missing=23
                                23:leaf=-0.00373374182
                                24:leaf=0.0127977673
                        12:[f17<28.2889977] yes=25,no=26,missing=25
                                25:leaf=0.00760455802
                                26:leaf=0.0176214874
                6:[f134<7.5] yes=13,no=14,missing=13
                        13:[f131<69.5] yes=27,no=28,missing=27
                                27:leaf=0.0253543444
                                28:leaf=6.53095121e-05
                        14:leaf=0.0354171321

The solution is to consider the node identifier not as strictly incremental but with possible holes inside.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.