andosa / treeinterpreter Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
Happens when training on batched data with warm_start = True and the data is unbalanced.
Error:
/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: **Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)** is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
ml-pipeline/src/treeint_simple_example.py", line 22, in <module>
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 212, in predict
return _predict_forest(model, X, joint_contribution=joint_contribution)
File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 166, in _predict_forest
return (np.mean(predictions, axis=0), np.mean(biases, axis=0),
File "<__array_function__ internals>", line 6, in mean
File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3373, in mean
out=out, **kwargs)
File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 144, in _mean
arr = asanyarray(a)
File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
**ValueError: could not broadcast input array from shape (2,1) into shape (2)**
Reproduction:
from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import pandas as pd
# Random forest that can train on chunks of data.
rf = RandomForestClassifier(warm_start=True, n_estimators=1)
# data of chunk1
chunk1_data_vec = [0, 0]
chunk1_df = pd.DataFrame(data={'label': chunk1_data_vec, 'features1': chunk1_data_vec, 'features2': chunk1_data_vec})
# data of chunk2
chunk2_data_vec = [0, 0, 1, 1, 0, 0, 1, 1]
chunk2_df = pd.DataFrame(data={'label': chunk2_data_vec, 'features1': chunk2_data_vec, 'features2': chunk2_data_vec})
# fit first chunk of data that has a single label
rf.fit(X=chunk1_df.drop(['label'], axis='columns'), y=chunk1_df['label'])
# fit second chunk of data that has 2 labels
rf.n_estimators += 1
rf.fit(X=chunk2_df.drop(['label'], axis='columns'), y=chunk2_df['label'])
# test
test_data = chunk2_df.drop(['label'], axis='columns')
# regular predict
rf.predict_proba(test_data)
# tree interpreter predict
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
Would it be possible for you to release a new version on pypi with the latest changes from the repository included?
Version 0.2.2 is from December 2018 and since then a couple of merge requests have been merged into the master branch which would be nice to have in the pip installable version.
Thanks in advance
Is there a minimum Python version recommended for this package? Would be useful to know it for future conda-forge builds, see for instance conda-forge/treeinterpreter-feedstock#1 (comment)
ISSUE
Following error trace was encountered while running ti._predict_forest function for ExtraTreeClassifier model. This works perfectly for RandomForestClassifier.
POTENTIAL CAUSE
Following code block initializes line_shape variable.
treeinterpreter/treeinterpreter/treeinterpreter.py
Lines 56 to 66 in a52a6d7
I see there are blocks specific to DecisionTreeRegressor and DecisionTreeClassifer while initializing line_shape variable.
Do we need to add something specific to Extra Tree Classifier?
How should tests be run?
I get 3 test failures using python setup.py test
or pytest
using Python 3.6.
======================================================================
FAIL: test_forest_regressor (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 75, in test_forest_regressor
self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true
======================================================================
FAIL: test_forest_regressor_joint (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 90, in test_forest_regressor_joint
self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true
======================================================================
FAIL: test_tree_regressor (tests.test_treeinterpreter.TestTreeinterpreter)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/c97346/p/treeinterpreter/tests/test_treeinterpreter.py", line 40, in test_tree_regressor
self.assertTrue(np.allclose(base_prediction, pred))
AssertionError: False is not true
Does anyone know if the class ordering is in ascending order (ร la sklearn)?
E.g., the ordering of the columns of say contributions:
prediction, bias, contributions = ti.predict(clf, X_test)
here is the code to replicate the issue:
inp = np.array([[-1, -1],
[-2, -1],
[1, 1],
[2, 1]])
model = sklearn.ensemble.RandomForestClassifier().fit(inp, [1,1,0,0])
predictions, bias, contributions = ti.predict(model, inp)
Error occurs at line 63 in treeinterpreter.py in _predict_tree(model, X, joint_contribution)
This is because not all model.estimator_ return tree_ values as 2D.
I'm trying to test the predict function but it raises the following error:
AttributeError: 'RandomForestRegressor' object has no attribute 'n_outputs_'
.
Yet it seems that actually it has when checking the sklearn webpage.
Here is the full error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-101-56bf611044fd> in <module>
1 # checking the performance of training data itselfz
----> 2 prediction, bias, contributions = ti.predict(rf, numpy_df_train)
3 idx = pd.date_range(train_start_date, train_end_date)
4 predictions_df1 = pd.DataFrame(data=prediction[0:], index = idx, columns=['prices'])
5 predictions_df1.plot()
C:\ProgramData\Anaconda3\lib\site-packages\treeinterpreter\treeinterpreter.py in predict(model, X, joint_contribution)
193 """
194 # Only single out response variable supported,
--> 195 if model.n_outputs_ > 1:
196 raise ValueError("Multilabel classification trees not supported")
197
AttributeError: 'RandomForestRegressor' object has no attribute 'n_outputs_'
Hi,
I'm working on a project where treeinterpreter is taking ~2 minutes per prediction.
I suppose this is because the implementation is in pure Python.
Do you know if anyone has looked at porting this to C (or cython or whatever) to make it go faster?
Jim
Hello.
I am receiving this error with
tree interpreter version '0.1.0' and sklearn0.24.0
just after doing
from treeinterpreter import treeinterpreter as ti
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-729-679b5145dca2> in <module>
----> 1 from treeinterpreter import treeinterpreter as ti
2 from sklearn.tree import DecisionTreeRegressor
3 from sklearn.ensemble import RandomForestRegressor
~/miniconda3/envs/py36_ds_liv/lib/python3.6/site-packages/treeinterpreter/treeinterpreter.py in <module>
3 import sklearn
4
----> 5 from sklearn.ensemble.forest import ForestClassifier, ForestRegressor
6 from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, _tree
7 from distutils.version import LooseVersion
ModuleNotFoundError: No module named 'sklearn.ensemble.forest'
Hi team,
As I'm testing the treeinterpreter
module for random forest from the home example code, the last code for examing the random forest prediction and using the interpreter results with bias + np.sum(contributions, axis=1). I think it's not right by using the rf.predict
function to get the prediction, correct way should use the rf.predict_proba()
function to get the probability, right? As sklearn will convert the probability with argmax
to get the index as label for prediction.
In reference to:
https://github.com/andosa/treeinterpreter/blob/master/treeinterpreter/treeinterpreter.py#L60
Taking the sum along the first axis fails, if len(values.shape) == 1, a case which I just stumbled across.
Hi Ando,
Thank you for this wonderful work. I found this should be able to generate feature importance for each class? I can do prediction, bias, contributions = ti.predict(rf, instance)
for each instance and average the contributions for each feature. Is this the right thing to do?
If this word really can generate feature importance for each class, then I suggest to add this point to the readme because it took me a while to search. I am also wondering why feature importance per class is not a popular topic for rf, at least on Google. Is that because rf is not at this job?
Thank you
I ran the following code, mostly cut-and-pasted from the unit tests, in a python 3 jupyter notebook:
import numpy as np
import unittest
from sklearn.datasets import load_boston, load_iris
from sklearn.ensemble import (RandomForestRegressor, RandomForestClassifier,
ExtraTreesClassifier, ExtraTreesRegressor,)
from sklearn.tree import (DecisionTreeClassifier, DecisionTreeRegressor,
ExtraTreeClassifier, ExtraTreeRegressor,)
from treeinterpreter import treeinterpreter
boston = load_boston()
iris = load_iris()
TreeRegressor = ExtraTreeRegressor
X = boston.data
Y = boston.target
testX = X[int(len(X)/2):]
#Predict for decision tree
dt = TreeRegressor()
dt.fit(X[:int(len(X)/2)], Y[:int(len(X)/2)])
base_prediction = dt.predict(testX)
pred, bias, contrib = treeinterpreter.predict(dt, testX)
This gave me the following error:
~/.local/lib/python3.5/site-packages/treeinterpreter/treeinterpreter.py in predict(model, X, joint_contribution)
204 else:
205 raise ValueError("Wrong model type. Base learner needs to be \
--> 206 DecisionTreeClassifier or DecisionTreeRegressor.")
ValueError: Wrong model type. Base learner needs to be DecisionTreeClassifier or DecisionTreeRegressor.
Note that running isinstance(dt, DecisionTreeRegressor)
returns True
. Unfortunately, I don't know enough about python to understand why isinstance
would give different answers when it is run in my notebook vs in treeinterpreter.predict
.
Do you guys happen to know how to fix this?
Curious here.
I was able to calculate joint contributions over a random forest model trained with max depth = 30 as a binary classifier. My understanding is that, if the path of a particular instance is is 1->2->3->4->5->...->30 in the tree, the joint contributions should be calculated for (1,2), (2,3), (3,4)..., (1,2,3), (2,3,4), (3,4,5)... (1,2,3,4), (2,3,4,5), etc. i.e. the number of joint contributions scales as fibonacci_n where n = depth of tree / depth of path.
For a single instance, I get thousands of joint contributions, but the contribution groups are not what I expect, i.e. the sets of nodes are so disjointed, it doesn't seem to follow one path through the tree, so it can't seem to pertain to only one instance.
So questions are:
Dear,
I have a scenario to use iForest model to detect exceptional sample(isolated). There are lots of features (3K+) and we want to get the topN contributed features. So do you have a plan to support iForest? Or any suggestion to resolve this issue?
Thanks in advance!
Hi,
I am currently using H20 random forest model. Is there a way you can add support to interpret that model or if you can provide some suggestions to tweak your code to support that model. infact, if your logic can be made generic(rather than scikit learn specific), i think this library would really add value in so many different projects.
Thanks,
Vik
Hi Ando,
Thanks for this wonderful package, makes my life a lot easier!
The treeinterpreter does not seem to work for the class GradientBoostingTree. It has a bug because this class does not output n_output, which is checked in your code to ensure the model has a univariate output.
This might be a quick fix. Would it be possible to do this?
I used the below code to test it.
Thanks,
Roel
----code----
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressorX, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
gbt = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
... max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))
X.shape
instances = X[300:309,:]
print "Instance 0 prediction:", gbt.predict(instances[0])
print "Instance 1 prediction:", gbt.predict(instances[1])
prediction, bias, contributions = ti.predict(gbt, instances)
Currently, we couln't able to use the sklearn's pipeline objects directly in treeinterpreter.
possible work around is provided here.
Can it be possible to support it natively?
Thank you for your code.
I am wondering that if this code can predict every sample by feature weights. I mean that Is this possible if I want to extract one set of feature contribution and use that to predict any sample? In your solution, for each sample we have one set of feature contribution to predict that sample.
Thank you.
Hello and thank you for your package!
I am a PhD student and I am currently using your library for one of my research projects. I was wondering if you would be interested in creating a citation file so the library can be cited. Here is a very good blog post on why/who citing your software is good practice: https://www.software.ac.uk/blog/2016-10-06-encouraging-citation-software-introducing-citation-files
Let me know if you would like me to open a pull request ๐
In line 6 of aggregated_contribution
the individual contributions are summed.
Shouldn't the contributions be converted to their absolute values first and then summed?
For example, a joint feature, that has both high negative and low negative contributions would get a low aggregated joint contribution as of the current implementation?
It's not clear what joint_contribution=True/False does.
I still get the sklearn deprecation warning about sklearn.ensemble.forest, and when I ran pip install --upgrade treeinterpreter
, I got the message that I was already up-to-date.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.