Code Monkey home page Code Monkey logo

treeinterpreter's People

Contributors

andosa avatar iamdecode avatar juliangilbey avatar marctollin avatar micahjsmith avatar mickey946 avatar saucecat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

treeinterpreter's Issues

Error when predicting with a RandomForest that its first trees were trained only on some of the data classes (batched training)

Happens when training on batched data with warm_start = True and the data is unbalanced.

Error:

/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: **Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)** is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
ml-pipeline/src/treeint_simple_example.py", line 22, in <module>
    test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 212, in predict
    return _predict_forest(model, X, joint_contribution=joint_contribution)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 166, in _predict_forest
    return (np.mean(predictions, axis=0), np.mean(biases, axis=0),
  File "<__array_function__ internals>", line 6, in mean
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3373, in mean
    out=out, **kwargs)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 144, in _mean
    arr = asanyarray(a)
  File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
**ValueError: could not broadcast input array from shape (2,1) into shape (2)**

Reproduction:

from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import pandas as pd

# Random forest that can train on chunks of data.
rf = RandomForestClassifier(warm_start=True, n_estimators=1)

# data of chunk1
chunk1_data_vec = [0, 0]
chunk1_df = pd.DataFrame(data={'label': chunk1_data_vec, 'features1': chunk1_data_vec, 'features2': chunk1_data_vec})
# data of chunk2
chunk2_data_vec = [0, 0, 1, 1, 0, 0, 1, 1]
chunk2_df = pd.DataFrame(data={'label': chunk2_data_vec, 'features1': chunk2_data_vec, 'features2': chunk2_data_vec})


# fit first chunk of data that has a single label
rf.fit(X=chunk1_df.drop(['label'], axis='columns'), y=chunk1_df['label'])
# fit second chunk of data that has 2 labels
rf.n_estimators += 1
rf.fit(X=chunk2_df.drop(['label'], axis='columns'), y=chunk2_df['label'])

# test
test_data = chunk2_df.drop(['label'], axis='columns')
# regular predict
rf.predict_proba(test_data)
# tree interpreter predict
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))

New release

Would it be possible for you to release a new version on pypi with the latest changes from the repository included?
Version 0.2.2 is from December 2018 and since then a couple of merge requests have been merged into the master branch which would be nice to have in the pip installable version.

Thanks in advance

UnboundLocalError of line_shape variable while using ExtraTreeClassifier

ISSUE
Following error trace was encountered while running ti._predict_forest function for ExtraTreeClassifier model. This works perfectly for RandomForestClassifier.

image

POTENTIAL CAUSE
Following code block initializes line_shape variable.

# reshape if squeezed into a single float
if len(values.shape) == 0:
values = np.array([values])
if isinstance(model, DecisionTreeRegressor):
biases = np.full(X.shape[0], values[paths[0][0]])
line_shape = X.shape[1]
elif isinstance(model, DecisionTreeClassifier):
# scikit stores category counts, we turn them into probabilities
normalizer = values.sum(axis=1)[:, np.newaxis]
normalizer[normalizer == 0.0] = 1.0
values /= normalizer

I see there are blocks specific to DecisionTreeRegressor and DecisionTreeClassifer while initializing line_shape variable.

Do we need to add something specific to Extra Tree Classifier?

Tests?

How should tests be run?

I get 3 test failures using python setup.py test or pytest using Python 3.6.

======================================================================
FAIL: test_forest_regressor (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 75, in test_forest_regressor
    self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true

======================================================================
FAIL: test_forest_regressor_joint (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 90, in test_forest_regressor_joint
    self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true

======================================================================
FAIL: test_tree_regressor (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 40, in test_tree_regressor
    self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true

Class order

Does anyone know if the class ordering is in ascending order (ร  la sklearn)?
E.g., the ordering of the columns of say contributions:
prediction, bias, contributions = ti.predict(clf, X_test)

AxisError: axis 1 is out of bounds for array of dimension 1 while using RandomForestClassifier

here is the code to replicate the issue:
inp = np.array([[-1, -1],
[-2, -1],
[1, 1],
[2, 1]])

model = sklearn.ensemble.RandomForestClassifier().fit(inp, [1,1,0,0])
predictions, bias, contributions = ti.predict(model, inp)

Error occurs at line 63 in treeinterpreter.py in _predict_tree(model, X, joint_contribution)

This is because not all model.estimator_ return tree_ values as 2D.

'RandomForestRegressor' object has no attribute 'n_outputs_' needed for the predict function

I'm trying to test the predict function but it raises the following error:

AttributeError: 'RandomForestRegressor' object has no attribute 'n_outputs_'.

Yet it seems that actually it has when checking the sklearn webpage.

Here is the full error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-101-56bf611044fd> in <module>
      1 # checking the performance of training data itselfz
----> 2 prediction, bias, contributions = ti.predict(rf, numpy_df_train)
      3 idx = pd.date_range(train_start_date, train_end_date)
      4 predictions_df1 = pd.DataFrame(data=prediction[0:], index = idx, columns=['prices'])
      5 predictions_df1.plot()

C:\ProgramData\Anaconda3\lib\site-packages\treeinterpreter\treeinterpreter.py in predict(model, X, joint_contribution)
    193     """
    194     # Only single out response variable supported,
--> 195     if model.n_outputs_ > 1:
    196         raise ValueError("Multilabel classification trees not supported")
    197 

AttributeError: 'RandomForestRegressor' object has no attribute 'n_outputs_'

Performance?

Hi,

I'm working on a project where treeinterpreter is taking ~2 minutes per prediction.

I suppose this is because the implementation is in pure Python.

Do you know if anyone has looked at porting this to C (or cython or whatever) to make it go faster?

Jim

Python 3.6 + sklearn 0.24

Hello.
I am receiving this error with

tree interpreter version '0.1.0' and sklearn0.24.0

just after doing
from treeinterpreter import treeinterpreter as ti

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-729-679b5145dca2> in <module>
----> 1 from treeinterpreter import treeinterpreter as ti
      2 from sklearn.tree import DecisionTreeRegressor
      3 from sklearn.ensemble import RandomForestRegressor

~/miniconda3/envs/py36_ds_liv/lib/python3.6/site-packages/treeinterpreter/treeinterpreter.py in <module>
      3 import sklearn
      4 
----> 5 from sklearn.ensemble.forest import ForestClassifier, ForestRegressor
      6 from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, _tree
      7 from distutils.version import LooseVersion

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

Home page use example code error.

Hi team,
As I'm testing the treeinterpreter module for random forest from the home example code, the last code for examing the random forest prediction and using the interpreter results with bias + np.sum(contributions, axis=1). I think it's not right by using the rf.predict function to get the prediction, correct way should use the rf.predict_proba() function to get the probability, right? As sklearn will convert the probability with argmax to get the index as label for prediction.

Feature Importance per Class

Hi Ando,

Thank you for this wonderful work. I found this should be able to generate feature importance for each class? I can do prediction, bias, contributions = ti.predict(rf, instance) for each instance and average the contributions for each feature. Is this the right thing to do?

If this word really can generate feature importance for each class, then I suggest to add this point to the readme because it took me a while to search. I am also wondering why feature importance per class is not a popular topic for rf, at least on Google. Is that because rf is not at this job?

Thank you

Getting "ValueError: Wrong model type" when I try to run predict function

I ran the following code, mostly cut-and-pasted from the unit tests, in a python 3 jupyter notebook:

import numpy as np
import unittest

from sklearn.datasets import load_boston, load_iris
from sklearn.ensemble import (RandomForestRegressor, RandomForestClassifier,
                              ExtraTreesClassifier, ExtraTreesRegressor,)
from sklearn.tree import (DecisionTreeClassifier, DecisionTreeRegressor,
                          ExtraTreeClassifier, ExtraTreeRegressor,)

from treeinterpreter import treeinterpreter

boston = load_boston()
iris = load_iris()

TreeRegressor = ExtraTreeRegressor
X = boston.data
Y = boston.target
testX = X[int(len(X)/2):]

#Predict for decision tree
dt = TreeRegressor()
dt.fit(X[:int(len(X)/2)], Y[:int(len(X)/2)])

base_prediction = dt.predict(testX)
pred, bias, contrib = treeinterpreter.predict(dt, testX)

This gave me the following error:

~/.local/lib/python3.5/site-packages/treeinterpreter/treeinterpreter.py in predict(model, X, joint_contribution)
    204     else:
    205         raise ValueError("Wrong model type. Base learner needs to be \
--> 206             DecisionTreeClassifier or DecisionTreeRegressor.")

ValueError: Wrong model type. Base learner needs to be             DecisionTreeClassifier or DecisionTreeRegressor.

Note that running isinstance(dt, DecisionTreeRegressor) returns True. Unfortunately, I don't know enough about python to understand why isinstance would give different answers when it is run in my notebook vs in treeinterpreter.predict.

Do you guys happen to know how to fix this?

How is joint contribution calculated over deep tree? How to set max number of elements in joint set, i.e. to doublets or triplets, over a deep tree?

I was able to calculate joint contributions over a random forest model trained with max depth = 30 as a binary classifier. My understanding is that, if the path of a particular instance is is 1->2->3->4->5->...->30 in the tree, the joint contributions should be calculated for (1,2), (2,3), (3,4)..., (1,2,3), (2,3,4), (3,4,5)... (1,2,3,4), (2,3,4,5), etc. i.e. the number of joint contributions scales as fibonacci_n where n = depth of tree / depth of path.

For a single instance, I get thousands of joint contributions, but the contribution groups are not what I expect, i.e. the sets of nodes are so disjointed, it doesn't seem to follow one path through the tree, so it can't seem to pertain to only one instance.

So questions are:

  1. How is joint contribution calculated for a deep tree?
  2. Is there a way to specify a max number of nodes per joint contribution?

IForest Supporting

Dear,

I have a scenario to use iForest model to detect exceptional sample(isolated). There are lots of features (3K+) and we want to get the topN contributed features. So do you have a plan to support iForest? Or any suggestion to resolve this issue?

Thanks in advance!

Support for H20 Random Forest Model

Hi,

I am currently using H20 random forest model. Is there a way you can add support to interpret that model or if you can provide some suggestions to tweak your code to support that model. infact, if your logic can be made generic(rather than scikit learn specific), i think this library would really add value in so many different projects.

Thanks,
Vik

Application to GradientBoostingTree class

Hi Ando,

Thanks for this wonderful package, makes my life a lot easier!

The treeinterpreter does not seem to work for the class GradientBoostingTree. It has a bug because this class does not output n_output, which is checked in your code to ensure the model has a univariate output.

This might be a quick fix. Would it be possible to do this?

I used the below code to test it.

Thanks,
Roel

----code----

import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor

X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
gbt = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
... max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))

X.shape

instances = X[300:309,:]
print "Instance 0 prediction:", gbt.predict(instances[0])
print "Instance 1 prediction:", gbt.predict(instances[1])

prediction, bias, contributions = ti.predict(gbt, instances)

Support for pipeline objects

Currently, we couln't able to use the sklearn's pipeline objects directly in treeinterpreter.

possible work around is provided here.
Can it be possible to support it natively?

Feature contribution

Thank you for your code.
I am wondering that if this code can predict every sample by feature weights. I mean that Is this possible if I want to extract one set of feature contribution and use that to predict any sample? In your solution, for each sample we have one set of feature contribution to predict that sample.
Thank you.

Should aggregation be the sum over absolute contributions?

In line 6 of aggregated_contribution the individual contributions are summed.

Shouldn't the contributions be converted to their absolute values first and then summed?

For example, a joint feature, that has both high negative and low negative contributions would get a low aggregated joint contribution as of the current implementation?

Most recent version not installed with pip

I still get the sklearn deprecation warning about sklearn.ensemble.forest, and when I ran pip install --upgrade treeinterpreter, I got the message that I was already up-to-date.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.