cerlymarco / linear-tree Goto Github PK

View Code? Open in Web Editor NEW

327.0 12.0 55.0 6.06 MB

A python library to build Model Trees with Linear Models at the leaves.

License: MIT License

Python 4.20% Jupyter Notebook 95.80%

machine-learning tree linear-models decision-trees scikit-learn model-trees random-forest boosting-tree

linear-tree's People

Contributors

Stargazers

Watchers

Forkers

adbmd jeromeblanchet pitson3 mohsen-kalantar dconstan jingmouren jonasrauch loochao bharathc346 321hg pspachtholz sandy4321 ashishpatel26 surajitdb seanahmad kskkry tianlongwang veronicachu valeman apatange-source hangzhang10 dineshdyne hugezom vishalbelsare mihohatanaka alexsimonovrus aladinoster sauxpa benwaldner gengbinbin s0ap statmixedml martinalayon takeyama0 joostvanstreels session-id jonas-metzger level-vc jckkvs githubtpx cj555 nishijie stjordanis hamidkm9 zhangl2auto emsunshine deephyper matterda ahmedthahir aap2239 tdl77 junyahiraiwa

linear-tree's Issues

[performance suggestions?]Parallelism btw trees and replace linear fit to SGD with batch?

It seems that the any 2 tree models in a forest can be trained in parallel, is there a way to do njobs=-1 in the parameter or wrap the entire thing into a with block passing in with joblib multiprocessing njob=-1?

Is it possible to replace linear fit with SGD fit for large scale data? Should we? (in terms of speed and model equivalence)

Also, is it possible to call gpu to solve linear each time(either the traditional way or the gradient based optimizers?)

I am thinking of this type of model, if applied on tabular data , can have tracable error sensitivity( because derivative or linear slopes are known, and jumps are finite). Maybe one thing to try is to use these model on a wide range biostats tabular datasets (some of them are very small(<2k obs, < 50 vars), but have good local correlations and need good interpretations). So I am planning to use it at scale.

Why does each leaf node return three arrays of coefficients ?

Hi, I was just going through each leaf node just to see how the coefficients for each feature are behaving. But while looking at it I realised that each node is returning three arrays of coefficients for each feature.

You can see above for one node how it is behaving, I mean it is correct I know but I am not able to understand it properly. Any insight would be appreciated.

export to graphviz -AttributeError: 'LinearTreeRegressor' object has no attribute 'n_features_'

thanks for writing this great package!

I was trying to display the decision tree with graphviz I get this error

AttributeError: 'LinearTreeRegressor' object has no attribute 'n_features_'

from lineartree import LinearTreeRegressor
from sklearn.linear_model import LinearRegression

reg = LinearTreeRegressor(base_estimator=LinearRegression())
reg.fit(train[x_cols], train["y"])

from graphviz import Source
from sklearn import tree

graph = Source( tree.export_graphviz(reg, out_file=None,feature_names=train.columns))

Reference

I am using your Linear Tree code in the context of a research paper related to audio and I would like to have a reference in your work.
Is there a specific way to reference your work in the bibliography of the paper?

[Question] Are there plans for multivariate models?

Min impurity decrease

Hi, I really like your linear-tree library. I have been looking for something like this a while and it perfectly fits my use case.
If I understand the LinearTreeRegressor correctly a node is split when the weighted loss of the child nodes is less than the loss of the parent node.

What I would like to do is to only split a node if the decrease in loss is over a certain threshold. Scikit-learn has something called min_impurity_decrease which could be used.

I implemented a small suggestion in a PR. So I would be happy to expand on this and improve it (e.g. input validation, maybe extend to classification), if you find that useful.

Which traversing method does linear tree use to find the left and right node ?

Hi all, I am having a hard time finding out which method is used by linear tree to traverse the whole linear tree. Cause sometimes when I am plotting the tree plot and comparing it with the summary, the mapping makes no sense. For some left node the plot is displaying it as right and vice-versa.
you guys can compare the summary with the plot and let me know if I am incorrect somewhere.

0: {'col': 1,
'th': 0.0127,
'loss': 0.1937,
'samples': 160,
'children': (1, 2),
'models': (RidgeClassifier(), RidgeClassifier())},
1: {'col': 6,
'th': 0.1461,
'loss': 0.1,
'samples': 80,
'children': (3, 4),
'models': (RidgeClassifier(), RidgeClassifier())},
2: {'col': 0,
'th': 2.6051,
'loss': 0.05,
'samples': 80,
'children': (9, 10),
'models': (RidgeClassifier(), RidgeClassifier())},
4: {'col': 0,
'th': -0.0708,
'loss': 0.0364,
'samples': 55,
'children': (5, 6),
'models': (RidgeClassifier(), RidgeClassifier())},
6: {'col': 2,
'th': -0.7986,
'loss': 0.0,
'samples': 32,
'children': (7, 8),
'models': (RidgeClassifier(), RidgeClassifier())},
9: {'col': 2,
'th': -0.0865,
'loss': 0.0,
'samples': 59,
'children': (11, 12),
'models': (RidgeClassifier(), RidgeClassifier())},
3: {'loss': 0.08,
'samples': 25,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
5: {'loss': 0.0,
'samples': 23,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
7: {'loss': 0.0,
'samples': 16,
'models': RidgeClassifier(),
'classes': array([0, 1])},
8: {'loss': 0.0,
'samples': 16,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
11: {'loss': 0.0,
'samples': 32,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
12: {'loss': 0.0,
'samples': 27,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
10: {'loss': 0.0476,
'samples': 21,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])}}

Why the loss is always 0 for every Linear Tree regression model I run ?

Hi all, each time I am fitting a regression (Linear Tree regression mostly) model on any datasets the loss at each node is always 0 for some reason. Is such behaviour normal ?

Maximum Slope limiter

Hi @cerlymarco , thanks for developing this method into a good library.

Im thinking , maybe for some cases / most cases we need maximum slope for each regressor. The main concept is to prevent over optimistic extrapolation for prediction output.

If slope > max slope , then split that into a new node.

finding breakpoint

Hello,

thank you for your nice tool. I am using the function LinearTreeRegressor to draw a continuous piecewise linear. It works well, I am wondering, is it possible to show the location (the coordinates) of the breakpoints?

thank you

Performance and possibility to split only on subset of features

Hey, I have been playing around a lot with your linear trees. Like them very much. Thanks!

Nevertheless, I am somewhat disappointed by the runtime performance. Compared to XGBoost Regressors (I know it's not a fair comparison) or linear regressions (also not fair), the linear tree is reeeeeaally slow.
50k observations, 80 features: 2s for linear regression, 27s for XGBoost, and 300s for the linear tree.
Have you seen similar runtimes or might I be using it wrong?

Another aspects that's interesting to me is the question whether is possibe to limit the features which are used for splits. I haven't found it in the code. Any change to see it in the future?

LinearTree does not fit well

Hi There!

I was handling with the library when i figured out that the LinearTree doesnt fit to the data, triying to overfitting it. Like in the following image:

¿Is there a reason for that?

Thanks

Use of categorical text attributes

Hello there!

This is a great package that I just found out. I’m still experimenting on it but it’s working nice.

I was trying to use categorical text features but it seems the package can only get numerical attributes and bin them internally to get the categories. I am doing something wrong?
I’d love to give this project 5 stars.

Thanks!

LinearTreeClassifier with Gini Index of 0

I am using the LinearTreeClassifier and ran into an issue where it throws an error due to the split having a Gini Index of 0 and only a single class in the node. See below

When looking at the splits with a decision tree we see the following:

The colab notebook I used to create this issue is here: https://colab.research.google.com/drive/1NLWKZItwdRCmt6Dxmesqu75DLkXaJvs6?usp=sharing

How to gridsearch tree and regression parameters?

Hi, I am wondering how to perform a GridsearchCV to find best parameters for the tree and regression model?
For now I am able to tune the tree component of my model:

 param_grid={
    'n_estimators': [50, 100, 500, 700],
    'max_depth': [10, 20, 30, 50],
    'min_samples_split' : [2, 4, 8, 16, 32],
    'max_features' : ['sqrt', 'log2', None]
}
cv = RepeatedKFold(n_repeats=3,
                   n_splits=3,
                   random_state=1)

model = GridSearchCV(
    LinearForestRegressor(ElasticNet(random_state = 0), random_state=42),
    param_grid=param_grid,
    n_jobs=-1,
    cv=cv,
    scoring='neg_root_mean_squared_error'
    )

Allow the hyperparameter "max_depth = 0".

Thanks for the good library.

When using LinearTreeRegressor, I think that max_depth is often optimized by cross-validation.

This library allows max_depth in the range 1-20. However, depending on the dataset, simple linear regression may be suitable. Even in such a dataset, max_depth is forced to be 1 or more, so Simple Linear Regression cannot be applied properly with LinearTreeRegressor.

Of course, it is appropriate to use sklearn.linear_model.LinearRegression for such datasets.

My suggestion is to change to a program that uses base_estimator to perform regression when "max_depth = 0".
With this change, LinearTreeRegressor can flexibly respond to both segmented regression and simple regression by changing hyperparameters.

Have precision of threshold be customizable

Currently, we hard code the precision of threshold as 5 in here. Having this customizable will allow us to use the linear tree model when the number that are used are smaller in general. My suggestion is have another parameter that defaults to 5 and when people wants to use the model with smaller number, they could do it by set this parameter to the number they desired.

Let me know if this change is good and I could create a PR for it. I'm open for discussion.

How to get the coef_ and intercept_ according to the threshold

If I apply LinearTreeRegressor when there is one explanatory variable, can I get the following range of X and regression coefficient as a table?

numpy deprecation warning

/lineartree/_classes.py:338: DeprecationWarning:

the interpolation= argument to quantile was renamed to method=, which has additional options.
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)

Seems like a quick update here would get this warning to stop showing up, right? I can always ignore it, but figured I would mention it in case it is actually an error on my side.

Also, sorry, I don't actually what the best open source etiquette is. If I'm supposed to create a pull request with a proposed fix instead of just mentioning it then feel free to correct me.

Please consider LinXgboost

dear Marco
please consider LinXgboost compared to your linear-tree
https://github.com/ldv1/LinXGBoost
https://arxiv.org/abs/1710.03634
thank you in advance

Reference

Hi,

I really like your work with Linear Trees, I would like to ask if there are references in some kind of papers, which describe accurately in the form of equations the split procedure!!

Thank you in advance,
George Moiragias

learning_rate in boosting

Hi, is there a way for setting the learning_rate in the boosting regressors and classifiers?

EDIT:
Also, is LinearBoostingRegressor fitting a linear regression first and then boosting the residual via regression trees or boosting via a series of linear regression trees?

Potential bug in LinearForestClassifier 'predict_proba'

Hello! Thank you for useful package!

I think I might have found a potential bug in LinearForestClassifier.

I expected 'predict_proba' to use 'self.decision_function', similarly to 'predict' - to include predictions from both estimators (base + forest). Is that a potential bug or am I in wrong here?

linear-tree/lineartree/lineartree.py

Line 1560 in 8d5beca

pred = self._sigmoid(self.base_estimator_.predict(X))

[Question]

Performing Split on Node with Perfect Results

I have an example where it performs a split on a node a node with a loss of 0. Take a look at the below example. It performs a split on node 1 (where the loss = 0). This split does not add any value to the results and the parent node (node 1) already gives perfect results.

Is this the intended behavior? Or should it not perform splits when the results are already perfect?

Non coherent splitting results

Hello,
I have a dataframe with a column X >= 0. I added its index in the parameter split_features of LinearTreeRegressor.
I set max_depth to 1 and then used LinearRegression() as a base estimator.
When I count the number of samples at node_1 i.e. assumed to be <= to the indicated threshold (from the node_0) I realize that it doesn't correspond to my data for the column X.
When I increase max_depth some negative splitting thresholds appear whereas the column X is >= 0 as said previously.
do you normalize data or scale it somehow before training?
Thanks in advance !

How to quote your work?

Extract Coefficients

How can someone extract the coefficients of each linear model implemented in each leaf?

LinearForestRegressor may give biased coefficients for base estimator

Hi There!

I am very interesting in the linear-tree packge and I found it inspiring for my research. But when I was using LinearForestRegressor in my study, I found that the base estimator of it gave biased coefficients (with too small absolute values) so that the prediction was basically fitted by the forest estimator. Therefore the structure of liear forest will be very similar to a random forest regressor. I found that it may be due to the round off error in the source code function self._validate_data where the dtype "float32" was used.

I generated a synthetic dataset to compare the LinearRegression model in the scikit-learn and the LinearForestRegressor. BTW, how can we deal with the data with features at multiple orders of magnitudes? Will the parameter base_estimator support sklearn pipeline to support preprocessing like StandardScaler in the future release?

Thank you for your excellent works!

import numpy as np
from lineartree import LinearForestRegressor
from sklearn.linear_model import LinearRegression

SEED = 1234


# Genrate a synthetic dataset
X1 = np.random.randn(1000, 1) * 1 + 10
X2 = np.random.randn(1000, 1) * 1e7 + 3e7
X3 = np.random.randn(1000, 1) * 100 + 200
X4 = np.random.randn(1000, 1) + 500
X5 = np.random.randn(1000, 1) + 1000
X6 = np.random.randn(1000, 1)
X7 = np.random.randn(1000, 1)
X8 = np.random.rand(1000, 1)

X = np.concatenate([X1, X2, X3, X4, X5, X6, X7, X8], axis=1)
y = X1 + np.sin(X2 * X6) + (X3 / 1e6) ** 2 + X4 / 1e3 + X2 / 1e7 + \
    X7 * X8 + np.random.randn(1000, 1) * 0.1
y = np.log(y)

# Fit a linear regression model
lr = LinearRegression()
lr.fit(X, y)
lr_coef = lr.coef_
print(lr_coef) 

# this will give [[ 7.49327164e-02  7.59350553e-09 -5.17630150e-06 -1.67616079e-05
#  -1.73796325e-03  3.13294480e-04  4.07092831e-02 -7.15923013e-03]]

# Fit a linear forest model
lf = LinearForestRegressor(base_estimator=LinearRegression(),
                           n_estimators=100, max_depth=5,
                           max_features=1.0, random_state=SEED)
lf.fit(X, y)
lf_coef = lf.coef_
print(lf_coef)

# this will give [ 1.3074668e-09  7.2390938e-09 -2.1693744e-05  9.1071959e-09
# -6.6003052e-09 -7.7589535e-09  7.1229582e-09  5.3837756e-09]

Error when running with multiple jobs: unexpected keyword argument 'target_offload'

I have been using your library for quite a while and am super happy with it. So first, thanks a lot!

Lately, I used my framework (which also uses your library) on modern many core server with many jobs. Worked fine. Now I have updated everything via pip and with 8 jobs on my MacBook, I got the following error.

This error does not occur when using only a single job (I pass the number of jobs to n_jobs).

I cannot nail the down the actual problem, but since it occurred right after the upgrade, I assume this might be the reason?

Am I doing something wrong here?

"""
Traceback (most recent call last):
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 56, in __call__
    with config_context(**self.config):
  File "/Users/martin/opt/anaconda3/lib/python3.7/contextlib.py", line 239, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "/Users/martin/opt/anaconda3/lib/python3.7/contextlib.py", line 82, in __init__
    self.gen = func(*args, **kwds)
TypeError: config_context() got an unexpected keyword argument 'target_offload'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "compression_selection_pipeline.py", line 41, in <module>
    model_pipeline.learn_runtime_models(calibration_result_dir)
  File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 670, in learn_runtime_models
    non_splitting_models("table_scan", table_scans)
  File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 590, in non_splitting_models
    fitted_model = model_dict["model"].fit(X_train, y_train)
  File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 209, in fit
    return self.regression.fit(X, y)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/lineartree.py", line 187, in fit
    self._fit(X, y, sample_weight)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 576, in _fit
    self._grow(X, y, sample_weight)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 387, in _grow
    loss=loss)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 285, in _split
    for feat in split_feat)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/Users/martin/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/Users/martin/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: config_context() got an unexpected keyword argument 'target_offload'

PS: I have already left a star. :D

ModuleNotFoundError: No module named 'lineartree'

Rationale for rounding during _parallel_binning_fit and _grow

I noticed that the implementations of _parallel_binning_fit and _grow internally round loss values to 5 decimal places. This makes the regression results dependent on the scale of the labels, as data with a lower natural loss value will result in many different splits of the data having the same loss when rounded to 5 decimal places. Is there a reason why this is the case?

This behavior can be observed by fitting a LinearTreeRegressor using the default loss function and multiplying the scale of the labels by a small number (like 1e-9). This will result in the regressor no longer learning any splits.

Linear Boosting will it work for categorical features?

TypeError min_impurity_split

Hi,

I trying notebook usage-LinearBoost with collab.

in 3 cell:
regr = LinearBoostRegressor(Ridge(), loss='linear')
regr.fit(X, y)

I have a problem:

`TypeError Traceback (most recent call last)
in ()
1 regr = LinearBoostRegressor(Ridge(), loss='linear')
----> 2 regr.fit(X, y)

1 frames
/usr/local/lib/python3.7/dist-packages/lineartree/_classes.py in _fit(self, X, y, sample_weight)
943 min_impurity_decrease=self.min_impurity_decrease,
944 min_impurity_split=self.min_impurity_split,
--> 945 ccp_alpha=self.ccp_alpha
946 )
947

TypeError: init() got an unexpected keyword argument 'min_impurity_split'`

linear-tree I install:
%pip install linear-tree