Code Monkey home page Code Monkey logo

linear-tree's People

Contributors

cerlymarco avatar jrauch-pros avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linear-tree's Issues

[performance suggestions?]Parallelism btw trees and replace linear fit to SGD with batch?

It seems that the any 2 tree models in a forest can be trained in parallel, is there a way to do njobs=-1 in the parameter or wrap the entire thing into a with block passing in with joblib multiprocessing njob=-1?

Is it possible to replace linear fit with SGD fit for large scale data? Should we? (in terms of speed and model equivalence)

Also, is it possible to call gpu to solve linear each time(either the traditional way or the gradient based optimizers?)

I am thinking of this type of model, if applied on tabular data , can have tracable error sensitivity( because derivative or linear slopes are known, and jumps are finite). Maybe one thing to try is to use these model on a wide range biostats tabular datasets (some of them are very small(<2k obs, < 50 vars), but have good local correlations and need good interpretations). So I am planning to use it at scale.

Why does each leaf node return three arrays of coefficients ?

Hi, I was just going through each leaf node just to see how the coefficients for each feature are behaving. But while looking at it I realised that each node is returning three arrays of coefficients for each feature.
Screenshot from 2022-04-23 11-15-58
You can see above for one node how it is behaving, I mean it is correct I know but I am not able to understand it properly. Any insight would be appreciated.

export to graphviz -AttributeError: 'LinearTreeRegressor' object has no attribute 'n_features_'

Hi

thanks for writing this great package!

I was trying to display the decision tree with graphviz I get this error

AttributeError: 'LinearTreeRegressor' object has no attribute 'n_features_'

from lineartree import LinearTreeRegressor
from sklearn.linear_model import LinearRegression

reg = LinearTreeRegressor(base_estimator=LinearRegression())
reg.fit(train[x_cols], train["y"])

from graphviz import Source
from sklearn import tree

graph = Source( tree.export_graphviz(reg, out_file=None,feature_names=train.columns))

Reference

I am using your Linear Tree code in the context of a research paper related to audio and I would like to have a reference in your work.
Is there a specific way to reference your work in the bibliography of the paper?

Min impurity decrease

Hi, I really like your linear-tree library. I have been looking for something like this a while and it perfectly fits my use case.
If I understand the LinearTreeRegressor correctly a node is split when the weighted loss of the child nodes is less than the loss of the parent node.

What I would like to do is to only split a node if the decrease in loss is over a certain threshold. Scikit-learn has something called min_impurity_decrease which could be used.

I implemented a small suggestion in a PR. So I would be happy to expand on this and improve it (e.g. input validation, maybe extend to classification), if you find that useful.

Which traversing method does linear tree use to find the left and right node ?

Hi all, I am having a hard time finding out which method is used by linear tree to traverse the whole linear tree. Cause sometimes when I am plotting the tree plot and comparing it with the summary, the mapping makes no sense. For some left node the plot is displaying it as right and vice-versa.
you guys can compare the summary with the plot and let me know if I am incorrect somewhere.
output

0: {'col': 1,
'th': 0.0127,
'loss': 0.1937,
'samples': 160,
'children': (1, 2),
'models': (RidgeClassifier(), RidgeClassifier())},
1: {'col': 6,
'th': 0.1461,
'loss': 0.1,
'samples': 80,
'children': (3, 4),
'models': (RidgeClassifier(), RidgeClassifier())},
2: {'col': 0,
'th': 2.6051,
'loss': 0.05,
'samples': 80,
'children': (9, 10),
'models': (RidgeClassifier(), RidgeClassifier())},
4: {'col': 0,
'th': -0.0708,
'loss': 0.0364,
'samples': 55,
'children': (5, 6),
'models': (RidgeClassifier(), RidgeClassifier())},
6: {'col': 2,
'th': -0.7986,
'loss': 0.0,
'samples': 32,
'children': (7, 8),
'models': (RidgeClassifier(), RidgeClassifier())},
9: {'col': 2,
'th': -0.0865,
'loss': 0.0,
'samples': 59,
'children': (11, 12),
'models': (RidgeClassifier(), RidgeClassifier())},
3: {'loss': 0.08,
'samples': 25,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
5: {'loss': 0.0,
'samples': 23,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
7: {'loss': 0.0,
'samples': 16,
'models': RidgeClassifier(),
'classes': array([0, 1])},
8: {'loss': 0.0,
'samples': 16,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
11: {'loss': 0.0,
'samples': 32,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
12: {'loss': 0.0,
'samples': 27,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
10: {'loss': 0.0476,
'samples': 21,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])}}

Maximum Slope limiter

Hi @cerlymarco , thanks for developing this method into a good library.

Im thinking , maybe for some cases / most cases we need maximum slope for each regressor. The main concept is to prevent over optimistic extrapolation for prediction output.

If slope > max slope , then split that into a new node.

finding breakpoint

Hello,

thank you for your nice tool. I am using the function LinearTreeRegressor to draw a continuous piecewise linear. It works well, I am wondering, is it possible to show the location (the coordinates) of the breakpoints?

thank you

Performance and possibility to split only on subset of features

Hey, I have been playing around a lot with your linear trees. Like them very much. Thanks!

Nevertheless, I am somewhat disappointed by the runtime performance. Compared to XGBoost Regressors (I know it's not a fair comparison) or linear regressions (also not fair), the linear tree is reeeeeaally slow.
50k observations, 80 features: 2s for linear regression, 27s for XGBoost, and 300s for the linear tree.
Have you seen similar runtimes or might I be using it wrong?

Another aspects that's interesting to me is the question whether is possibe to limit the features which are used for splits. I haven't found it in the code. Any change to see it in the future?

LinearTree does not fit well

Hi There!

I was handling with the library when i figured out that the LinearTree doesnt fit to the data, triying to overfitting it. Like in the following image:

image

¿Is there a reason for that?

Thanks

Use of categorical text attributes

Hello there!

This is a great package that I just found out. I’m still experimenting on it but it’s working nice.

I was trying to use categorical text features but it seems the package can only get numerical attributes and bin them internally to get the categories. I am doing something wrong?
I’d love to give this project 5 stars.

Thanks!

How to gridsearch tree and regression parameters?

Hi, I am wondering how to perform a GridsearchCV to find best parameters for the tree and regression model?
For now I am able to tune the tree component of my model:

`

 param_grid={
    'n_estimators': [50, 100, 500, 700],
    'max_depth': [10, 20, 30, 50],
    'min_samples_split' : [2, 4, 8, 16, 32],
    'max_features' : ['sqrt', 'log2', None]
}
cv = RepeatedKFold(n_repeats=3,
                   n_splits=3,
                   random_state=1)

model = GridSearchCV(
    LinearForestRegressor(ElasticNet(random_state = 0), random_state=42),
    param_grid=param_grid,
    n_jobs=-1,
    cv=cv,
    scoring='neg_root_mean_squared_error'
    )

`

Allow the hyperparameter "max_depth = 0".

Thanks for the good library.

When using LinearTreeRegressor, I think that max_depth is often optimized by cross-validation.

This library allows max_depth in the range 1-20. However, depending on the dataset, simple linear regression may be suitable. Even in such a dataset, max_depth is forced to be 1 or more, so Simple Linear Regression cannot be applied properly with LinearTreeRegressor.

  • Of course, it is appropriate to use sklearn.linear_model.LinearRegression for such datasets.

My suggestion is to change to a program that uses base_estimator to perform regression when "max_depth = 0".
With this change, LinearTreeRegressor can flexibly respond to both segmented regression and simple regression by changing hyperparameters.

Have precision of threshold be customizable

Currently, we hard code the precision of threshold as 5 in here. Having this customizable will allow us to use the linear tree model when the number that are used are smaller in general. My suggestion is have another parameter that defaults to 5 and when people wants to use the model with smaller number, they could do it by set this parameter to the number they desired.

Let me know if this change is good and I could create a PR for it. I'm open for discussion.

numpy deprecation warning

/lineartree/_classes.py:338: DeprecationWarning:

the interpolation= argument to quantile was renamed to method=, which has additional options.
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)

Seems like a quick update here would get this warning to stop showing up, right? I can always ignore it, but figured I would mention it in case it is actually an error on my side.

Also, sorry, I don't actually what the best open source etiquette is. If I'm supposed to create a pull request with a proposed fix instead of just mentioning it then feel free to correct me.

Reference

Hi,

I really like your work with Linear Trees, I would like to ask if there are references in some kind of papers, which describe accurately in the form of equations the split procedure!!

Thank you in advance,
George Moiragias

learning_rate in boosting

Hi, is there a way for setting the learning_rate in the boosting regressors and classifiers?

EDIT:
Also, is LinearBoostingRegressor fitting a linear regression first and then boosting the residual via regression trees or boosting via a series of linear regression trees?

Performing Split on Node with Perfect Results

I have an example where it performs a split on a node a node with a loss of 0. Take a look at the below example. It performs a split on node 1 (where the loss = 0). This split does not add any value to the results and the parent node (node 1) already gives perfect results.

Is this the intended behavior? Or should it not perform splits when the results are already perfect?

linear_tree

Non coherent splitting results

Hello,
I have a dataframe with a column X >= 0. I added its index in the parameter split_features of LinearTreeRegressor.
I set max_depth to 1 and then used LinearRegression() as a base estimator.
When I count the number of samples at node_1 i.e. assumed to be <= to the indicated threshold (from the node_0) I realize that it doesn't correspond to my data for the column X.
When I increase max_depth some negative splitting thresholds appear whereas the column X is >= 0 as said previously.
do you normalize data or scale it somehow before training?
Thanks in advance !

Extract Coefficients

How can someone extract the coefficients of each linear model implemented in each leaf?

LinearForestRegressor may give biased coefficients for base estimator

Hi There!

I am very interesting in the linear-tree packge and I found it inspiring for my research. But when I was using LinearForestRegressor in my study, I found that the base estimator of it gave biased coefficients (with too small absolute values) so that the prediction was basically fitted by the forest estimator. Therefore the structure of liear forest will be very similar to a random forest regressor. I found that it may be due to the round off error in the source code function self._validate_data where the dtype "float32" was used.

I generated a synthetic dataset to compare the LinearRegression model in the scikit-learn and the LinearForestRegressor. BTW, how can we deal with the data with features at multiple orders of magnitudes? Will the parameter base_estimator support sklearn pipeline to support preprocessing like StandardScaler in the future release?

Thank you for your excellent works!

import numpy as np
from lineartree import LinearForestRegressor
from sklearn.linear_model import LinearRegression

SEED = 1234


# Genrate a synthetic dataset
X1 = np.random.randn(1000, 1) * 1 + 10
X2 = np.random.randn(1000, 1) * 1e7 + 3e7
X3 = np.random.randn(1000, 1) * 100 + 200
X4 = np.random.randn(1000, 1) + 500
X5 = np.random.randn(1000, 1) + 1000
X6 = np.random.randn(1000, 1)
X7 = np.random.randn(1000, 1)
X8 = np.random.rand(1000, 1)

X = np.concatenate([X1, X2, X3, X4, X5, X6, X7, X8], axis=1)
y = X1 + np.sin(X2 * X6) + (X3 / 1e6) ** 2 + X4 / 1e3 + X2 / 1e7 + \
    X7 * X8 + np.random.randn(1000, 1) * 0.1
y = np.log(y)

# Fit a linear regression model
lr = LinearRegression()
lr.fit(X, y)
lr_coef = lr.coef_
print(lr_coef) 

# this will give [[ 7.49327164e-02  7.59350553e-09 -5.17630150e-06 -1.67616079e-05
#  -1.73796325e-03  3.13294480e-04  4.07092831e-02 -7.15923013e-03]]

# Fit a linear forest model
lf = LinearForestRegressor(base_estimator=LinearRegression(),
                           n_estimators=100, max_depth=5,
                           max_features=1.0, random_state=SEED)
lf.fit(X, y)
lf_coef = lf.coef_
print(lf_coef)

# this will give [ 1.3074668e-09  7.2390938e-09 -2.1693744e-05  9.1071959e-09
# -6.6003052e-09 -7.7589535e-09  7.1229582e-09  5.3837756e-09]

Error when running with multiple jobs: unexpected keyword argument 'target_offload'

I have been using your library for quite a while and am super happy with it. So first, thanks a lot!

Lately, I used my framework (which also uses your library) on modern many core server with many jobs. Worked fine. Now I have updated everything via pip and with 8 jobs on my MacBook, I got the following error.

This error does not occur when using only a single job (I pass the number of jobs to n_jobs).

I cannot nail the down the actual problem, but since it occurred right after the upgrade, I assume this might be the reason?

Am I doing something wrong here?

"""
Traceback (most recent call last):
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 56, in __call__
    with config_context(**self.config):
  File "/Users/martin/opt/anaconda3/lib/python3.7/contextlib.py", line 239, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "/Users/martin/opt/anaconda3/lib/python3.7/contextlib.py", line 82, in __init__
    self.gen = func(*args, **kwds)
TypeError: config_context() got an unexpected keyword argument 'target_offload'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "compression_selection_pipeline.py", line 41, in <module>
    model_pipeline.learn_runtime_models(calibration_result_dir)
  File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 670, in learn_runtime_models
    non_splitting_models("table_scan", table_scans)
  File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 590, in non_splitting_models
    fitted_model = model_dict["model"].fit(X_train, y_train)
  File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 209, in fit
    return self.regression.fit(X, y)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/lineartree.py", line 187, in fit
    self._fit(X, y, sample_weight)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 576, in _fit
    self._grow(X, y, sample_weight)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 387, in _grow
    loss=loss)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 285, in _split
    for feat in split_feat)
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/Users/martin/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/Users/martin/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: config_context() got an unexpected keyword argument 'target_offload'

PS: I have already left a star. :D

Rationale for rounding during _parallel_binning_fit and _grow

I noticed that the implementations of _parallel_binning_fit and _grow internally round loss values to 5 decimal places. This makes the regression results dependent on the scale of the labels, as data with a lower natural loss value will result in many different splits of the data having the same loss when rounded to 5 decimal places. Is there a reason why this is the case?

This behavior can be observed by fitting a LinearTreeRegressor using the default loss function and multiplying the scale of the labels by a small number (like 1e-9). This will result in the regressor no longer learning any splits.

TypeError min_impurity_split

Hi,

I trying notebook usage-LinearBoost with collab.

in 3 cell:
regr = LinearBoostRegressor(Ridge(), loss='linear')
regr.fit(X, y)

I have a problem:

`TypeError Traceback (most recent call last)
in ()
1 regr = LinearBoostRegressor(Ridge(), loss='linear')
----> 2 regr.fit(X, y)

1 frames
/usr/local/lib/python3.7/dist-packages/lineartree/_classes.py in _fit(self, X, y, sample_weight)
943 min_impurity_decrease=self.min_impurity_decrease,
944 min_impurity_split=self.min_impurity_split,
--> 945 ccp_alpha=self.ccp_alpha
946 )
947

TypeError: init() got an unexpected keyword argument 'min_impurity_split'`

linear-tree I install:
%pip install linear-tree

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.