andosa / treeinterpreter Goto Github PK

View Code? Open in Web Editor NEW

740.0 740.0 140.0 28 KB

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

treeinterpreter's People

Contributors

Stargazers

Watchers

Forkers

pfjob09 directorscut82 chengyanliang craftsliu jibaro marctollin janrygl allenwoods pjbull unravelin wangzhengyuan aaakaa anhmike mengyuliu himalayajung deepera-jspeng marcbllv xangma amitmse dgw-cap1 colinchenmaster soledad89 saucecat rlabuguen nkhuyu ofergold samuelsmal dansanz wubizhi uzbit gavrmike dungchu reneeernst wolfhu micahjsmith kkpop mickey946 sprinterzzj allensmile callmeace gloryfromca zizi6947 colinsongf lalmeme balancewing hustsheng yuan39 jingmai wtx626 babylls analyticlaks donscheme clxdsjyx eycab taynaud rohit-sarkar mrahim alekseinagamoto ericschles dibus2 mobilefirsts lecea lazycrazyowl bwuebben kormilitzin vishalbelsare jbaez1212 rpb mengyuan404 bezova msamogh greipfrut pramitchoudhary esterdanielytterbrink william-stocks afcarl narayanmahto xiaoxiao19 dlvol3 ayush488 du-phan yashsinha1996 nickbaguley wguo123 shekhart47 basvanzutphen pombredanne gucheng7 shifwang bushmecj mhuiying volodymyrorlov nikoleta-v3 gabev clynie nishitpatel01 ddyn2hs mikebardpython ibryane xrosliang

treeinterpreter's Issues

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training)

Happens when training on batched data with warm_start = True and the data is unbalanced.

Error:

/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: **Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)** is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
ml-pipeline/src/treeint_simple_example.py", line 22, in <module>
    test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 212, in predict
    return _predict_forest(model, X, joint_contribution=joint_contribution)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 166, in _predict_forest
    return (np.mean(predictions, axis=0), np.mean(biases, axis=0),
  File "<__array_function__ internals>", line 6, in mean
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3373, in mean
    out=out, **kwargs)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 144, in _mean
    arr = asanyarray(a)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
**ValueError: could not broadcast input array from shape (2,1) into shape (2)**

Reproduction:

from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import pandas as pd

# Random forest that can train on chunks of data.
rf = RandomForestClassifier(warm_start=True, n_estimators=1)

# data of chunk1
chunk1_data_vec = [0, 0]
chunk1_df = pd.DataFrame(data={'label': chunk1_data_vec, 'features1': chunk1_data_vec, 'features2': chunk1_data_vec})
# data of chunk2
chunk2_data_vec = [0, 0, 1, 1, 0, 0, 1, 1]
chunk2_df = pd.DataFrame(data={'label': chunk2_data_vec, 'features1': chunk2_data_vec, 'features2': chunk2_data_vec})


# fit first chunk of data that has a single label
rf.fit(X=chunk1_df.drop(['label'], axis='columns'), y=chunk1_df['label'])
# fit second chunk of data that has 2 labels
rf.n_estimators += 1
rf.fit(X=chunk2_df.drop(['label'], axis='columns'), y=chunk2_df['label'])

# test
test_data = chunk2_df.drop(['label'], axis='columns')
# regular predict
rf.predict_proba(test_data)
# tree interpreter predict
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))

New release

Would it be possible for you to release a new version on pypi with the latest changes from the repository included?
Version 0.2.2 is from December 2018 and since then a couple of merge requests have been merged into the master branch which would be nice to have in the pip installable version.

Thanks in advance

Minimum python version?

Is there a minimum Python version recommended for this package? Would be useful to know it for future conda-forge builds, see for instance conda-forge/treeinterpreter-feedstock#1 (comment)

UnboundLocalError of line_shape variable while using ExtraTreeClassifier

ISSUE
Following error trace was encountered while running ti._predict_forest function for ExtraTreeClassifier model. This works perfectly for RandomForestClassifier.

POTENTIAL CAUSE
Following code block initializes line_shape variable.

treeinterpreter/treeinterpreter/treeinterpreter.py

Lines 56 to 66 in a52a6d7

    
           # reshape if squeezed into a single float 
        
           if len(values.shape) == 0: 
        
               values = np.array([values]) 
        
           if isinstance(model, DecisionTreeRegressor): 
        
               biases = np.full(X.shape[0], values[paths[0][0]]) 
        
               line_shape = X.shape[1] 
        
           elif isinstance(model, DecisionTreeClassifier): 
        
               # scikit stores category counts, we turn them into probabilities 
        
               normalizer = values.sum(axis=1)[:, np.newaxis] 
        
               normalizer[normalizer == 0.0] = 1.0 
        
               values /= normalizer

I see there are blocks specific to DecisionTreeRegressor and DecisionTreeClassifer while initializing line_shape variable.

Do we need to add something specific to Extra Tree Classifier?

Tests?

How should tests be run?

I get 3 test failures using python setup.py test or pytest using Python 3.6.

======================================================================
FAIL: test_forest_regressor (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 75, in test_forest_regressor
    self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true

======================================================================
FAIL: test_forest_regressor_joint (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 90, in test_forest_regressor_joint
    self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true

======================================================================
FAIL: test_tree_regressor (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 40, in test_tree_regressor
    self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true

Class order

Does anyone know if the class ordering is in ascending order (à la sklearn)?
E.g., the ordering of the columns of say contributions:
prediction, bias, contributions = ti.predict(clf, X_test)

AxisError: axis 1 is out of bounds for array of dimension 1 while using RandomForestClassifier

here is the code to replicate the issue:
inp = np.array([[-1, -1],
[-2, -1],
[1, 1],
[2, 1]])

model = sklearn.ensemble.RandomForestClassifier().fit(inp, [1,1,0,0])
predictions, bias, contributions = ti.predict(model, inp)

Error occurs at line 63 in treeinterpreter.py in _predict_tree(model, X, joint_contribution)

This is because not all model.estimator_ return tree_ values as 2D.

'RandomForestRegressor' object has no attribute 'n_outputs_' needed for the predict function

I'm trying to test the predict function but it raises the following error:

AttributeError: 'RandomForestRegressor' object has no attribute 'n_outputs_'.

Yet it seems that actually it has when checking the sklearn webpage.

Here is the full error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-101-56bf611044fd> in <module>
      1 # checking the performance of training data itselfz
----> 2 prediction, bias, contributions = ti.predict(rf, numpy_df_train)
      3 idx = pd.date_range(train_start_date, train_end_date)
      4 predictions_df1 = pd.DataFrame(data=prediction[0:], index = idx, columns=['prices'])
      5 predictions_df1.plot()

C:\ProgramData\Anaconda3\lib\site-packages\treeinterpreter\treeinterpreter.py in predict(model, X, joint_contribution)
    193     """
    194     # Only single out response variable supported,
--> 195     if model.n_outputs_ > 1:
    196         raise ValueError("Multilabel classification trees not supported")
    197 

AttributeError: 'RandomForestRegressor' object has no attribute 'n_outputs_'

Performance?

Hi,

I'm working on a project where treeinterpreter is taking ~2 minutes per prediction.

I suppose this is because the implementation is in pure Python.

Do you know if anyone has looked at porting this to C (or cython or whatever) to make it go faster?

Jim

Python 3.6 + sklearn 0.24

Hello.
I am receiving this error with

tree interpreter version '0.1.0' and sklearn0.24.0

just after doing
from treeinterpreter import treeinterpreter as ti

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-729-679b5145dca2> in <module>
----> 1 from treeinterpreter import treeinterpreter as ti
      2 from sklearn.tree import DecisionTreeRegressor
      3 from sklearn.ensemble import RandomForestRegressor

~/miniconda3/envs/py36_ds_liv/lib/python3.6/site-packages/treeinterpreter/treeinterpreter.py in <module>
      3 import sklearn
      4 
----> 5 from sklearn.ensemble.forest import ForestClassifier, ForestRegressor
      6 from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, _tree
      7 from distutils.version import LooseVersion

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

Home page use example code error.

Hi team,
As I'm testing the treeinterpreter module for random forest from the home example code, the last code for examing the random forest prediction and using the interpreter results with bias + np.sum(contributions, axis=1). I think it's not right by using the rf.predict function to get the prediction, correct way should use the rf.predict_proba() function to get the probability, right? As sklearn will convert the probability with argmax to get the index as label for prediction.

ValueError: 'axis' entry is out of bounds

In reference to:

https://github.com/andosa/treeinterpreter/blob/master/treeinterpreter/treeinterpreter.py#L60

Taking the sum along the first axis fails, if len(values.shape) == 1, a case which I just stumbled across.

Feature Importance per Class

Hi Ando,

Thank you for this wonderful work. I found this should be able to generate feature importance for each class? I can do prediction, bias, contributions = ti.predict(rf, instance) for each instance and average the contributions for each feature. Is this the right thing to do?

If this word really can generate feature importance for each class, then I suggest to add this point to the readme because it took me a while to search. I am also wondering why feature importance per class is not a popular topic for rf, at least on Google. Is that because rf is not at this job?

Thank you

Getting "ValueError: Wrong model type" when I try to run predict function

I ran the following code, mostly cut-and-pasted from the unit tests, in a python 3 jupyter notebook:

import numpy as np
import unittest

from sklearn.datasets import load_boston, load_iris
from sklearn.ensemble import (RandomForestRegressor, RandomForestClassifier,
                              ExtraTreesClassifier, ExtraTreesRegressor,)
from sklearn.tree import (DecisionTreeClassifier, DecisionTreeRegressor,
                          ExtraTreeClassifier, ExtraTreeRegressor,)

from treeinterpreter import treeinterpreter

boston = load_boston()
iris = load_iris()

TreeRegressor = ExtraTreeRegressor
X = boston.data
Y = boston.target
testX = X[int(len(X)/2):]

#Predict for decision tree
dt = TreeRegressor()
dt.fit(X[:int(len(X)/2)], Y[:int(len(X)/2)])

base_prediction = dt.predict(testX)
pred, bias, contrib = treeinterpreter.predict(dt, testX)

This gave me the following error:

~/.local/lib/python3.5/site-packages/treeinterpreter/treeinterpreter.py in predict(model, X, joint_contribution)
    204     else:
    205         raise ValueError("Wrong model type. Base learner needs to be \
--> 206             DecisionTreeClassifier or DecisionTreeRegressor.")

ValueError: Wrong model type. Base learner needs to be             DecisionTreeClassifier or DecisionTreeRegressor.

Note that running isinstance(dt, DecisionTreeRegressor) returns True. Unfortunately, I don't know enough about python to understand why isinstance would give different answers when it is run in my notebook vs in treeinterpreter.predict.

Do you guys happen to know how to fix this?

Does this package work with xgboost model?

Curious here.

How is joint contribution calculated over deep tree? How to set max number of elements in joint set, i.e. to doublets or triplets, over a deep tree?

I was able to calculate joint contributions over a random forest model trained with max depth = 30 as a binary classifier. My understanding is that, if the path of a particular instance is is 1->2->3->4->5->...->30 in the tree, the joint contributions should be calculated for (1,2), (2,3), (3,4)..., (1,2,3), (2,3,4), (3,4,5)... (1,2,3,4), (2,3,4,5), etc. i.e. the number of joint contributions scales as fibonacci_n where n = depth of tree / depth of path.

For a single instance, I get thousands of joint contributions, but the contribution groups are not what I expect, i.e. the sets of nodes are so disjointed, it doesn't seem to follow one path through the tree, so it can't seem to pertain to only one instance.

So questions are:

How is joint contribution calculated for a deep tree?
Is there a way to specify a max number of nodes per joint contribution?

IForest Supporting

Dear,

I have a scenario to use iForest model to detect exceptional sample(isolated). There are lots of features (3K+) and we want to get the topN contributed features. So do you have a plan to support iForest? Or any suggestion to resolve this issue?

Thanks in advance!

Support for H20 Random Forest Model

Hi,

I am currently using H20 random forest model. Is there a way you can add support to interpret that model or if you can provide some suggestions to tweak your code to support that model. infact, if your logic can be made generic(rather than scikit learn specific), i think this library would really add value in so many different projects.

Thanks,
Vik

Application to GradientBoostingTree class

Hi Ando,

Thanks for this wonderful package, makes my life a lot easier!

The treeinterpreter does not seem to work for the class GradientBoostingTree. It has a bug because this class does not output n_output, which is checked in your code to ensure the model has a univariate output.

This might be a quick fix. Would it be possible to do this?

I used the below code to test it.

Thanks,
Roel

----code----

import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor

X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
gbt = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
... max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))

X.shape

instances = X[300:309,:]
print "Instance 0 prediction:", gbt.predict(instances[0])
print "Instance 1 prediction:", gbt.predict(instances[1])

prediction, bias, contributions = ti.predict(gbt, instances)

Support for pipeline objects

Currently, we couln't able to use the sklearn's pipeline objects directly in treeinterpreter.

possible work around is provided here.
Can it be possible to support it natively?

Feature contribution

Thank you for your code.
I am wondering that if this code can predict every sample by feature weights. I mean that Is this possible if I want to extract one set of feature contribution and use that to predict any sample? In your solution, for each sample we have one set of feature contribution to predict that sample.
Thank you.

Adding a CITATION file.

Hello and thank you for your package!

I am a PhD student and I am currently using your library for one of my research projects. I was wondering if you would be interested in creating a citation file so the library can be cited. Here is a very good blog post on why/who citing your software is good practice: https://www.software.ac.uk/blog/2016-10-06-encouraging-citation-software-introducing-citation-files

Let me know if you would like me to open a pull request 👍

Should aggregation be the sum over absolute contributions?

In line 6 of aggregated_contribution the individual contributions are summed.

Shouldn't the contributions be converted to their absolute values first and then summed?

For example, a joint feature, that has both high negative and low negative contributions would get a low aggregated joint contribution as of the current implementation?

joint_contribution attributes is not documented

It's not clear what joint_contribution=True/False does.

Most recent version not installed with pip

I still get the sklearn deprecation warning about sklearn.ensemble.forest, and when I ran pip install --upgrade treeinterpreter, I got the message that I was already up-to-date.

	# reshape if squeezed into a single float
	if len(values.shape) == 0:
	values = np.array([values])
	if isinstance(model, DecisionTreeRegressor):
	biases = np.full(X.shape[0], values[paths[0][0]])
	line_shape = X.shape[1]
	elif isinstance(model, DecisionTreeClassifier):
	# scikit stores category counts, we turn them into probabilities
	normalizer = values.sum(axis=1)[:, np.newaxis]
	normalizer[normalizer == 0.0] = 1.0
	values /= normalizer