scikit-learn-contrib / lightning Goto Github PK
View Code? Open in Web Editor NEWLarge-scale linear classification, regression and ranking in Python
Home Page: https://contrib.scikit-learn.org/lightning/
Large-scale linear classification, regression and ranking in Python
Home Page: https://contrib.scikit-learn.org/lightning/
Hello,
I am using the Penalty
class in sag_fast.pyx and I need to propagate an exception which occurs inside the projection
method.
Since the return value of the function is void
, cython is not able to propagate this error to the layers above, as explained here. The solution they propose is to change the return value to int
and add the except -1
in the declaration.
I wonder if it would be ok to change
cdef void projection(self,
double* w,
int* indices,
double stepsize,
int n_nz):
to
cdef int projection(self,
double* w,
int* indices,
double stepsize,
int n_nz) except -1:
At the moment I am not able to test this change. This is due the fact that I am working on a Windows environement, and as the setup.py
compiles the .cpp
cython-compiled files, that is something I am not able to do in my computer (even though it works in appveyor). The best way would be to remove the cpp files and force at the setup the compilation of the extension by cython, as you already proposed.
If you are ok with the type changement, I will send a pull request whenever I am able to use linux where the package compiles correctly
It is stated that the CDClassifier object supports a group lasso ("l1/l2") penalty, yet it is not clear to me how the groups in the group penalty are specified.
Hey,
What do you think about making a new release? Is there anything blocking it?
We need to add this to track the changes in each release.
Cython 0.15.1 is ok while Cython 0.17.1:
lightning/kernel_fast.pyx:202:23: Compiler crash in AnalyseExpressionsTransform
ModuleNode.body = StatListNode(kernel_fast.pyx:9:0)
StatListNode.stats[9] = StatListNode(kernel_fast.pyx:138:5)
StatListNode.stats[0] = CClassDefNode(kernel_fast.pyx:138:5,
as_name = u'KernelCache',
base_class_module = u'',
base_class_name = u'Kernel',
class_name = u'KernelCache',
module_name = u'',
visibility = u'private')
CClassDefNode.body = StatListNode(kernel_fast.pyx:140:4)
StatListNode.stats[5] = CFuncDefNode(kernel_fast.pyx:189:9,
args = [...]/2,
modifiers = [...]/0,
visibility = u'private')
File 'Nodes.py', line 343, in analyse_expressions: StatListNode(kernel_fast.pyx:190:8)
File 'Nodes.py', line 4283, in analyse_expressions: SingleAssignmentNode(kernel_fast.pyx:202:29)
File 'Nodes.py', line 4389, in analyse_types: SingleAssignmentNode(kernel_fast.pyx:202:29)
File 'ExprNodes.py', line 2601, in analyse_target_types: IndexNode(kernel_fast.pyx:202:23,
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 3017, in is_lvalue: IndexNode(kernel_fast.pyx:202:23,
result_is_used = True,
use_managed_ref = True)
Compiler crash traceback from this point on:
File "/usr/local/lib/python2.6/dist-packages/Cython/Compiler/ExprNodes.py", line 3017, in is_lvalue
return not base_type.base_type.is_array
AttributeError: 'CppClassType' object has no attribute 'base_type'
make: *** [lightning/kernel_fast.cpp] Error 1
Hello,
Are there any plans of including FTRL solver to this library? I guess it's fairly straight forward. I'm sorry I might not be able to contribute it myself because I am not fluent in Cython etc.
Reference:
Thanks!
The theory gives step sizes for which the algorithms are guaranteed to converge (see papers). We need to add an eta="auto"
option.
I've observed that SAG increases the objective function in the first epoch. This would be OK occasionally, except that I'm seeing this behaviour consistently across different datasets, which lead me to think that there might be a bug in the implementation:
I'm not seeing this behaviour with SAGA (se also http://fa.bianp.net/blog/2016/saga-algorithm-in-the-lightning-library/ )
As far as I know, classifier in sklearn supports non integer class such as string now, so we should handle non integer class.
Love the fact that this package offers group lasso! You're probably aware, but with Lasso one often uses a cross validation combined with a one-standard-error (1SE) rule, where one chooses the model with fewest coefficients that's less than 1SE away from the sub-model with the lowest error.
Is there an example anywhere with your group lasso combined with cross-validation or any thoughts on implementing the 1SE functionality? Thanks again for this wonderful resource!
The crash seems to depend on the particular data and value of alpha
, but not on the random_state
. Changing alpha
or turning off debiasing works. Maybe the particular sparsity pattern that is found is problematic?
Code and data is available here.
Running MacOS X 10.8, python 2.7.3, cython 0.17.4, numpy 1.6.2, scipy 0.11.0
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000001006d0000
0x000000010260eb54 in __pyx_f_9lightning_14primal_cd_fast_12LossFunction__lipschitz_constant (__pyx_v_self=0x1022bad10, __pyx_v_X=0x1022bad10, __pyx_v_scale=<value temporarily unavailable, due to optimizations>, __pyx_v_out=0x100652000) at lightning/primal_cd_fast.cpp:2140
2140 (__pyx_v_out[__pyx_t_5]) = ((__pyx_v_out[__pyx_t_5]) + ((__pyx_v_scale * (__pyx_v_data[__pyx_v_ii])) * (__pyx_v_data[__pyx_v_ii])));
The _get_random_state
implementation in base.py
forces all estimators to only take integer seed values for the random_state
attribute. I'd like the functionality to pass an existing RandomState object. This is useful when using lightning objects as part of other estimators.
Unfortunately I'm not sure if this is compatible with the randomkit backport here.
Following scikit-learn and now that we have a release, we should probably remove the CPP files from the repo.
Hey Mathieu.
Is there a reason that predict_proba is not implemented for multi-class classification?
The multi-class log loss is the correct log loss so predict_proba should just be the exp'ed and normalized decision function, right? Or am I overlooking something?
I'm trying to get to some more or less calibrated logistic model.
Running predict_proba in SAGClassifier
method gives the following error
File "C:\Users\M.casotto\AppData\Local\Continuum\Anaconda2\lib\site-packages\lightning\impl\base.py", line 42, in predict_proba
if len(self.classes_) != 2:
AttributeError: 'StructuredSparsitySAGA' object has no attribute 'classes_'
The self.classes_
member is defined inside the BaseClassifier method when
calling _set_label_transformers
that encodes the reponse vector in a vector
of 1/-1 (last value by default).
def _set_label_transformers(self, y, reencode=False, neg_label=-1):
if reencode:
self.label_encoder_ = LabelEncoder()
y = self.label_encoder_.fit_transform(y).astype(np.int32)
else:
y = y.astype(np.int32)
self.label_binarizer_ = LabelBinarizer(neg_label=neg_label,
pos_label=1)
self.label_binarizer_.fit(y)
self.classes_ = self.label_binarizer_.classes_.astype(np.int32)
n_classes = len(self.label_binarizer_.classes_)
n_vectors = 1 if n_classes <= 2 else n_classes
return y, n_classes, n_vectors
Unfortunately in the inherited class SAGClassifier
, when calling the fit
method the reponse vector is casted to 1/-1 using LabelBinarizer
instead
of _set_label_transformers
, see for example
class SAGClassifier(BaseClassifier, _BaseSAG):
def fit(self, X, y):
if not self.is_saga and self.penalty is not None:
raise ValueError('Penalties in SAGClassifier. Please use '
'SAGAClassifier instead.'
'.')
self.label_binarizer_ = LabelBinarizer(neg_label=-1, pos_label=1)
Y = np.asfortranarray(self.label_binarizer_.fit_transform(y),
dtype=np.float64)
return self._fit(X, Y)
As I am not able to compile the lightning package I cannot test compilation,
but changing the above code with
class SAGClassifier(BaseClassifier, _BaseSAG):
def fit(self, X, y):
if not self.is_saga and self.penalty is not None:
raise ValueError('Penalties in SAGClassifier. Please use '
'SAGAClassifier instead.'
'.')
y_binned,___,___ = self._set_label_transformers(y, neg_label=-1)
Y = np.asfortranarray(y_binned,
dtype=np.float64)
return self._fit(X, Y)
should work fine
Also compute objective value at the same time.
Choose best parameter combination
Traceback (most recent call last):
File "bench_linear.py", line 76, in
X_test, y_test)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/memory.py", line 171, in call
return self.call(_args, *_kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/memory.py", line 323, in call
output = self.func(_args, *_kwargs)
File "bench_linear.py", line 30, in fit
gs.fit(X_tr, y_tr)
File "/usr/local/lib/python2.6/dist-packages/sklearn/grid_search.py", line 354, in fit
return self._fit(X, y)
File "/usr/local/lib/python2.6/dist-packages/sklearn/grid_search.py", line 392, in _fit
for clf_params in grid for train, test in cv)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/parallel.py", line 473, in call
self.dispatch(function, args, kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/parallel.py", line 296, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/parallel.py", line 124, in init
self.results = func(_args, *_kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/grid_search.py", line 110, in fit_grid_point
clf.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python2.6/dist-packages/lightning-0.1_git-py2.6-linux-i686.egg/lightning/primal_cd.py", line 93, in fit
self.callback, verbose=self.verbose)
File "primal_cd_fast.pyx", line 780, in lightning.primal_cd_fast._primal_cd_l2r (lightning/primal_cd_fast.cpp:6018)
ValueError: Buffer dtype mismatch, expected 'double' but got 'long'
None
fit the model
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.6/lib-tk/Tkinter.py", line 1413, in call
return self.func(*args)
File "svm_gui.py", line 71, in fit
X = train[:, 0:2]
IndexError: too many indices
Python 2.6.5, Cython 0.15.1, numpy 1.3.0, scipy 0.7.0, scikit 0.13-git
It seems to me that lightning does not fit the intercept anytime. So is it necessary to have a intercept_
attribute?
PrimalLinearSVC(penalty="l2")
gives poor accuracy in the multiclass case.
Still a very solid baseline.
The name eta means different things depending on the classifier. On SAG* classifiers it means the step size while for Fista it means the decrease factor for line-search, while on SGD* this is called eta0. I propose to homogenize by setting the name of the step size to step_size
on all classifiers and deprecate the other names.
Hi @mblondel . Some of the recent additions (such as SAGA) don't show up in the webpage. Would you mind pushing a new version of the doc? (I wouldn't mind doing it myself if it was on github pages)
SAG doesn't give the same results with dense and sparse data.
from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import SAGClassifier
bunch = fetch_20newsgroups_vectorized(subset="test")
X = bunch.data
y = bunch.target
X = X[y <= 1]
y = y[y <= 1]
X = X.toarray()
clf = SAGClassifier(loss="squared_hinge",
max_iter=1,
alpha=1e-2,
tol=1e-3,
random_state=0)
clf.fit(X, y)
print clf.score(X, y)
This is reminder for me to implement a while loop in the adaptive strategy so that the Lipschitz condition is always verified.
Doesn't look like there's much that's not 3 compatible. Could use 2to3 as path of least resistance.
SGD has very fast early convergence. SDCA, SAG and SVRG can benefit from SGD-based warm start.
It would be nice to add SPDC:
http://arxiv.org/abs/1409.3257
Select coordinate whose first derivative is largest in L1 norm.
see, e.g., "Iteration Complexity of Randomized Block-Coordinate Descent Methods for Minimizing a Composite Function", Section 4.
We should check the estimators in lightning with check_estimator
from scikit-learn. I expect that this will raise a number of issues.
lasso => l1
ridge => l2
group_lasso => l1/l2
This is useful because people often confuse l1/l2 with elastic-net.
Hi @mblondel . there is a huge huge performance difference in terms of speed when I turn warm_start=True
for the CDClassifier
for all penalties. I find it to be much faster than the sklearn implementation when warm_start
is on, and much slower than the sklearn implementation when warm_start
is off.
In [1]: from sklearn.datasets import fetch_20newsgroups_vectorized
In [2]: from lightning.classification import CDClassifier
In [3]: datasets = fetch_20newsgroups_vectorized(subset='train')
In [4]: X, y = datasets.data, datasets.target
In [5]: from sklearn.linear_model import LogisticRegression
In [6]: clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=50, tol=1e-3)
In [11]: %timeit clf.fit(X, y)
1 loops, best of 3: 25.9 s per loop
In [7]: test_data = fetch_20newsgroups_vectorized(subset='test')
In [8]: X_test, y_test = test_data.data, test_data.target
In [13]: clf.score(X_test, y_test)
Out[13]: 0.72729686670207117
In [14]: cd_cold = CDClassifier(warm_start=False, multiclass=True, max_iter=50, tol=1e-3, loss='log')
In [15]: cd_warm = CDClassifier(warm_start=True, multiclass=True, max_iter=50, tol=1e-3, loss='log')
In [16]: %timeit cd_warm.fit(X, y)
1 loops, best of 3: 4.9 s per loop
In [17]: cd_warm.score(X_test, y_test)
Out[17]: 0.73274030801911838
In [18]: %timeit cd_cold.fit(X, y)
1 loops, best of 3: 4min 4s per loop
In [20]: cd_cold.score(X_test, y_test)
Out[20]: 0.73274030801911838
Any reason to not set it as default?
I'm working on a package that uses lightning cython code as a dependency via:
from lightning.impl.dataset_fast cimport ColumnDataset
.
When installing lightning via conda or pip, generating the cython file fails, but if I distribute the generated cpp files the code runs fine.
Should the .pxd files be distributed with lightning to allow this use case?
using Travis
Following scikit-learn-contrib/scikit-learn-contrib#3, we need to move lightning's documentation to the gh-pages branch of the lightning repo.
This popped up when I ran a CDRegressor(warm_start=True, permute=True, penalty="l1")
::
_lasso_fx_selection
est.fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/lightning/impl/primal_cd.py", line 430, in fit
if self.kernel:
AttributeError: 'CDRegressor' object has no attribute 'kernel'
I'd like to do a 0.1 release and upload binary packages to pypi and conda. TODO:
What do you think @mblondel ?
Hey,
What does it take to implement partial_fit in lightning? Is there a reason it is not implemented?
I would like to have the possibility to fit an intercept for (at least) SAGAClassifier and SAGARegressor
@mblondel I'm not sure if this is meant to be, but I ran a quick few benchmarks.
# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target
# To remove the effect of parallelization
y[y != 1] = -1
time_logistic = []
time_lightning = []
Cs = np.logspace(-4, 4, 10)
for C in Cs:
print C
t = time()
clf = LogisticRegression(penalty='l1', tol=0.0001, fit_intercept=False, C=C)
t = time()
clf.fit(X, y)
time_logistic.append(time() - t)
print time_logistic
cl = CDClassifier(loss='log', tol=0.0001, max_iter=100, max_steps=0, C=C, penalty='l1')
t = time()
cl.fit(X, y)
time_lightning.append(time() - t)
print time_lightning
I get times like these for a grid of 10 Cs from np.logspace(-4, 4, 10)
time_lightning
[0.20100116729736328, 0.6052899360656738, 0.7211019992828369,
2.470484972000122, 4.043258190155029, 7.791965007781982,
10.92172908782959, 13.969007968902588, 12.534989833831787,
5.275091886520386]
time_logistic
[0.08612680435180664, 0.22542500495910645, 0.5105628967285156,
0.5970029830932617, 0.642221212387085, 0.8863811492919922,
1.241279125213623, 1.1004469394683838, 0.9302711486816406,
0.8940119743347168]
Can this be installed using pip in a virtualenv? I have scikits-learn
and cython
successfully installed this way, but when I try to install lightning
via:
pip install https://github.com/mblondel/lightning/archive/master.zip
I get the error:
g++: error: dataset_fast.cpp: No such file or directory
g++: fatal error: no input files
compilation terminated.
g++: error: dataset_fast.cpp: No such file or directory
g++: fatal error: no input files
compilation terminated.
error: Command "g++ -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -I/tmp/myproject/.env/local/lib/python2.7/site-packages/numpy/core/include -I/tmp/pip-lqLKbu-build/lightning/random -I/tmp/myproject/.env/local/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -c dataset_fast.cpp -o build/temp.linux-x86_64-2.7/dataset_fast.o" failed with exit status 4
https://github.com/mblondel/lightning/blob/master/doc/intro.rst
I can't use the currentmodule
directive since classifiers and regressors are not in the same module. Any idea @ogrisel or @amueller?
I have the option to install via pip
:
pip install https://github.com/mblondel/lightning/archive/master.zip
Or from source:
git clone https://github.com/mblondel/lightning.git
cd lightning
python setup.py build
sudo python setup.py install
What I wanted to know was;
Hi,
I've just found this library and seems pretty good, congratulations! I see that you support group lasso and I was wondering if there was a plan for supporting sparse group lasso, since that shouldn't be much of a trouble (with all due respect). I didn't dig in the code too much so it may even be there but, at first sight, I didn't see it.
I actually saw the @fabianp implementation of sparse group lasso in a personal repo from which you can have the standard group lasso solution just by setting the parameter alpha = 0 and I thought that it may be useful to have a fast and good implementation here.
Either way, thank you for this great work!
Just a personal reminder :)
This loss seems to segfault for the case of 2 classes. Here is a minimal reproducing example:
import numpy as np
import scipy.sparse as sp
from scipy.linalg import svd, diagsvd
from sklearn.utils.testing import assert_almost_equal
from sklearn.datasets import load_digits
from lightning.impl.datasets.samples_generator import make_classification
from lightning.classification import FistaClassifier
from lightning.regression import FistaRegressor
bin_dense, bin_target = make_classification(n_samples=200, n_features=100,
n_informative=5,
n_classes=2, random_state=0)
bin_target = bin_target * 2 - 1
mult_dense, mult_target = make_classification(n_samples=300, n_features=100,
n_informative=5,
n_classes=3, random_state=0)
bin_csr = sp.csr_matrix(bin_dense)
mult_csr = sp.csr_matrix(mult_dense)
digit = load_digits(2)
def test_fista_multiclass_l1l2_log():
for data in (mult_dense, mult_csr):
clf = FistaClassifier(max_iter=200, penalty="l1/l2", loss="log",
multiclass=True)
clf.fit(data, bin_target)
test_fista_multiclass_l1l2_log()
Alternatively, removing boundscheck=False from loss_fast.pyx makes appear a more meaningful traceback:
Traceback (most recent call last):
File "t.py", line 33, in <module>
test_fista_multiclass_l1l2_log()
File "t.py", line 31, in test_fista_multiclass_l1l2_log
clf.fit(data, bin_target)
File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 216, in fit
return self._fit(X, y, n_vectors)
File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 60, in _fit
obj = self._get_regularized_objective(df, y, loss, penalty, coef)
File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 37, in _get_regularized_objective
obj = self._get_objective(df, y, loss)
File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 34, in _get_objective
return self.C * loss.objective(df, y)
File "lightning/impl/loss_fast.pyx", line 244, in lightning.impl.loss_fast.MulticlassLog.objective (lightning/impl/loss_fast.cpp:6262)
cpdef objective(self,
File "lightning/impl/loss_fast.pyx", line 259, in lightning.impl.loss_fast.MulticlassLog.objective (lightning/impl/loss_fast.cpp:6041)
tmp = df[i, k] - df[i, y[i]]
IndexError: Out of bounds on buffer access (axis 1)
the problem seems to be that df is of size (n_samples, 1) and when y[1] = 1 this is out of bounds.
Example using synthetic data from one of the unit tests. CD fails to converge and starts oscillating after iteration 25.
Here is the example: https://gist.github.com/pprett/44d8bb3cbfe84c06a158
@mblondel is this to be expected? or a poor choice of hyper parameters?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.