scikit-learn-contrib / lightning Goto Github PK

Large-scale linear classification, regression and ranking in Python

Home Page: https://contrib.scikit-learn.org/lightning/

Makefile 1.69% Python 53.89% Shell 0.08% C 4.46% Cython 39.88%

lightning's Issues

change return values of Penalty function in sag_fast

Hello,
I am using the Penalty class in sag_fast.pyx and I need to propagate an exception which occurs inside the projection method.
Since the return value of the function is void, cython is not able to propagate this error to the layers above, as explained here. The solution they propose is to change the return value to int and add the except -1 in the declaration.

I wonder if it would be ok to change

 cdef void projection(self,
                         double* w,
                         int* indices,
                         double stepsize,
                         int n_nz):

 cdef int projection(self,
                         double* w,
                         int* indices,
                         double stepsize,
                         int n_nz) except -1:

At the moment I am not able to test this change. This is due the fact that I am working on a Windows environement, and as the setup.py compiles the .cpp cython-compiled files, that is something I am not able to do in my computer (even though it works in appveyor). The best way would be to remove the cpp files and force at the setup the compilation of the extension by cython, as you already proposed.

If you are ok with the type changement, I will send a pull request whenever I am able to use linux where the package compiles correctly

How to specify groups for the group lasso penalty?

It is stated that the CDClassifier object supports a group lasso ("l1/l2") penalty, yet it is not clear to me how the groups in the group penalty are specified.

new release

Hey,

What do you think about making a new release? Is there anything blocking it?

Add Frobenius norm to loss function and penalty functions

"What's new" file

We need to add this to track the changes in each release.

Compile error using Cython 0.17.1

Cython 0.15.1 is ok while Cython 0.17.1:

lightning/kernel_fast.pyx:202:23: Compiler crash in AnalyseExpressionsTransform

ModuleNode.body = StatListNode(kernel_fast.pyx:9:0)
StatListNode.stats[9] = StatListNode(kernel_fast.pyx:138:5)
StatListNode.stats[0] = CClassDefNode(kernel_fast.pyx:138:5,
as_name = u'KernelCache',
base_class_module = u'',
base_class_name = u'Kernel',
class_name = u'KernelCache',
module_name = u'',
visibility = u'private')
CClassDefNode.body = StatListNode(kernel_fast.pyx:140:4)
StatListNode.stats[5] = CFuncDefNode(kernel_fast.pyx:189:9,
args = [...]/2,
modifiers = [...]/0,
visibility = u'private')
File 'Nodes.py', line 343, in analyse_expressions: StatListNode(kernel_fast.pyx:190:8)
File 'Nodes.py', line 4283, in analyse_expressions: SingleAssignmentNode(kernel_fast.pyx:202:29)
File 'Nodes.py', line 4389, in analyse_types: SingleAssignmentNode(kernel_fast.pyx:202:29)
File 'ExprNodes.py', line 2601, in analyse_target_types: IndexNode(kernel_fast.pyx:202:23,
result_is_used = True,
use_managed_ref = True)
File 'ExprNodes.py', line 3017, in is_lvalue: IndexNode(kernel_fast.pyx:202:23,
result_is_used = True,
use_managed_ref = True)

Compiler crash traceback from this point on:
File "/usr/local/lib/python2.6/dist-packages/Cython/Compiler/ExprNodes.py", line 3017, in is_lvalue
return not base_type.base_type.is_array
AttributeError: 'CppClassType' object has no attribute 'base_type'
make: *** [lightning/kernel_fast.cpp] Error 1

FTRL solver

Hello,

Are there any plans of including FTRL solver to this library? I guess it's fairly straight forward. I'm sorry I might not be able to contribute it myself because I am not fluent in Cython etc.
Reference:

Thanks!

Automatic step size in SVRG

The theory gives step sizes for which the algorithms are guaranteed to converge (see papers). We need to add an eta="auto" option.

something looks suspicious with SAG

I've observed that SAG increases the objective function in the first epoch. This would be OK occasionally, except that I'm seeing this behaviour consistently across different datasets, which lead me to think that there might be a bug in the implementation:

I'm not seeing this behaviour with SAGA (se also http://fa.bianp.net/blog/2016/saga-algorithm-in-the-lightning-library/ )

When multiclass=False, SGDClassifier throws exception for non integer target

As far as I know, classifier in sklearn supports non integer class such as string now, so we should handle non integer class.

CV with 1SE?

Love the fact that this package offers group lasso! You're probably aware, but with Lasso one often uses a cross validation combined with a one-standard-error (1SE) rule, where one chooses the model with fewest coefficients that's less than 1SE away from the sub-model with the lowest error.

Is there an example anywhere with your group lasso combined with cross-validation or any thoughts on implementing the 1SE functionality? Thanks again for this wonderful resource!

Segmentation fault at the end of fit when debiasing with l1/l2

The crash seems to depend on the particular data and value of alpha, but not on the random_state. Changing alpha or turning off debiasing works. Maybe the particular sparsity pattern that is found is problematic?

Code and data is available here.

Running MacOS X 10.8, python 2.7.3, cython 0.17.4, numpy 1.6.2, scipy 0.11.0

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000001006d0000
0x000000010260eb54 in __pyx_f_9lightning_14primal_cd_fast_12LossFunction__lipschitz_constant (__pyx_v_self=0x1022bad10, __pyx_v_X=0x1022bad10, __pyx_v_scale=<value temporarily unavailable, due to optimizations>, __pyx_v_out=0x100652000) at lightning/primal_cd_fast.cpp:2140
2140          (__pyx_v_out[__pyx_t_5]) = ((__pyx_v_out[__pyx_t_5]) + ((__pyx_v_scale * (__pyx_v_data[__pyx_v_ii])) * (__pyx_v_data[__pyx_v_ii])));

Using existing numpy RandomState

The _get_random_state implementation in base.py forces all estimators to only take integer seed values for the random_state attribute. I'd like the functionality to pass an existing RandomState object. This is useful when using lightning objects as part of other estimators.

Unfortunately I'm not sure if this is compatible with the randomkit backport here.

Remove cpp files

Following scikit-learn and now that we have a release, we should probably remove the CPP files from the repo.

Multi-class predict_proba

Hey Mathieu.
Is there a reason that predict_proba is not implemented for multi-class classification?
The multi-class log loss is the correct log loss so predict_proba should just be the exp'ed and normalized decision function, right? Or am I overlooking something?
I'm trying to get to some more or less calibrated logistic model.

Bug in predict_proba in SAGClassifier

Running predict_proba in SAGClassifier method gives the following error

  File "C:\Users\M.casotto\AppData\Local\Continuum\Anaconda2\lib\site-packages\lightning\impl\base.py", line 42, in predict_proba
    if len(self.classes_) != 2:

AttributeError: 'StructuredSparsitySAGA' object has no attribute 'classes_'

The self.classes_ member is defined inside the BaseClassifier method when
calling _set_label_transformers that encodes the reponse vector in a vector
of 1/-1 (last value by default).

def _set_label_transformers(self, y, reencode=False, neg_label=-1):
    if reencode:
        self.label_encoder_ = LabelEncoder()
        y = self.label_encoder_.fit_transform(y).astype(np.int32)
    else:
        y = y.astype(np.int32)

    self.label_binarizer_ = LabelBinarizer(neg_label=neg_label,
                                           pos_label=1)
    self.label_binarizer_.fit(y)
    self.classes_ = self.label_binarizer_.classes_.astype(np.int32)
    n_classes = len(self.label_binarizer_.classes_)
    n_vectors = 1 if n_classes <= 2 else n_classes

    return y, n_classes, n_vectors

Unfortunately in the inherited class SAGClassifier, when calling the fit
method the reponse vector is casted to 1/-1 using LabelBinarizer instead
of _set_label_transformers, see for example

class SAGClassifier(BaseClassifier, _BaseSAG):
    def fit(self, X, y):
        if not self.is_saga and self.penalty is not None:
            raise ValueError('Penalties in SAGClassifier. Please use '
                             'SAGAClassifier instead.'
                             '.')
        self.label_binarizer_ = LabelBinarizer(neg_label=-1, pos_label=1)
        Y = np.asfortranarray(self.label_binarizer_.fit_transform(y),
                              dtype=np.float64)
        return self._fit(X, Y)

As I am not able to compile the lightning package I cannot test compilation,
but changing the above code with

class SAGClassifier(BaseClassifier, _BaseSAG):
    def fit(self, X, y):
        if not self.is_saga and self.penalty is not None:
            raise ValueError('Penalties in SAGClassifier. Please use '
                             'SAGAClassifier instead.'
                             '.')

        y_binned,___,___ = self._set_label_transformers(y, neg_label=-1)
        Y = np.asfortranarray(y_binned,
                              dtype=np.float64)
        return self._fit(X, Y)

should work fine

Parallelize gradient computation in loss_fast.pyx

Also compute objective value at the same time.

Issues with examples

Examples do not run until you copy "datasets" to your dist-packages folder containing lightning.
bench_linear.py, plot_iterations.py, plot_hyperparameter_dependency.py :

Choose best parameter combination
Traceback (most recent call last):
File "bench_linear.py", line 76, in
X_test, y_test)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/memory.py", line 171, in call
return self.call(_args, *_kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/memory.py", line 323, in call
output = self.func(_args, *_kwargs)
File "bench_linear.py", line 30, in fit
gs.fit(X_tr, y_tr)
File "/usr/local/lib/python2.6/dist-packages/sklearn/grid_search.py", line 354, in fit
return self._fit(X, y)
File "/usr/local/lib/python2.6/dist-packages/sklearn/grid_search.py", line 392, in _fit
for clf_params in grid for train, test in cv)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/parallel.py", line 473, in call
self.dispatch(function, args, kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/parallel.py", line 296, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/externals/joblib/parallel.py", line 124, in init
self.results = func(_args, *_kwargs)
File "/usr/local/lib/python2.6/dist-packages/sklearn/grid_search.py", line 110, in fit_grid_point
clf.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python2.6/dist-packages/lightning-0.1_git-py2.6-linux-i686.egg/lightning/primal_cd.py", line 93, in fit
self.callback, verbose=self.verbose)
File "primal_cd_fast.pyx", line 780, in lightning.primal_cd_fast._primal_cd_l2r (lightning/primal_cd_fast.cpp:6018)
ValueError: Buffer dtype mismatch, expected 'double' but got 'long'

svm_gui.py:

None
fit the model
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.6/lib-tk/Tkinter.py", line 1413, in call
return self.func(*args)
File "svm_gui.py", line 71, in fit
X = train[:, 0:2]
IndexError: too many indices

Python 2.6.5, Cython 0.15.1, numpy 1.3.0, scipy 0.7.0, scikit 0.13-git

Remove `intercept_` attribute?

It seems to me that lightning does not fit the intercept anytime. So is it necessary to have a intercept_ attribute?

bug in PrimalLinearSVC

PrimalLinearSVC(penalty="l2") gives poor accuracy in the multiclass case.

Add LBFGS based solver

Still a very solid baseline.

naming consistency for step size

The name eta means different things depending on the classifier. On SAG* classifiers it means the step size while for Fista it means the decrease factor for line-search, while on SGD* this is called eta0. I propose to homogenize by setting the name of the step size to step_size on all classifiers and deprecate the other names.

Documentation update

Hi @mblondel . Some of the recent additions (such as SAGA) don't show up in the webpage. Would you mind pushing a new version of the doc? (I wouldn't mind doing it myself if it was on github pages)

SAG incorrect with sparse data

SAG doesn't give the same results with dense and sparse data.

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import SAGClassifier

bunch = fetch_20newsgroups_vectorized(subset="test")
X = bunch.data
y = bunch.target

X = X[y <= 1]
y = y[y <= 1]

X = X.toarray()

clf = SAGClassifier(loss="squared_hinge",
                    max_iter=1,
                    alpha=1e-2,
                    tol=1e-3,
                    random_state=0)
clf.fit(X, y)
print clf.score(X, y)

Lipschitz constant in adaptive step size

This is reminder for me to implement a while loop in the adaptive strategy so that the Lipschitz condition is always verified.

Target Python 3

Doesn't look like there's much that's not 3 compatible. Could use 2to3 as path of least resistance.

SGD-based warm start in SDCA / SAG / SVRG

SGD has very fast early convergence. SDCA, SAG and SVRG can benefit from SGD-based warm start.

Add SAG and/or SAGA

SPDC

It would be nice to add SPDC:
http://arxiv.org/abs/1409.3257

Add greedy coordinate selection in primal_cd

Select coordinate whose first derivative is largest in L1 norm.

Add non-uniform randomized coordinate selection in primal_cd

see, e.g., "Iteration Complexity of Randomized Block-Coordinate Descent Methods for Minimizing a Composite Function", Section 4.

check_estimastor

We should check the estimators in lightning with check_estimator from scikit-learn. I expect that this will raise a number of issues.

Penalty aliases

lasso => l1
ridge => l2
group_lasso => l1/l2

This is useful because people often confuse l1/l2 with elastic-net.

Set `warm_start=True` by default

Hi @mblondel . there is a huge huge performance difference in terms of speed when I turn warm_start=True for the CDClassifier for all penalties. I find it to be much faster than the sklearn implementation when warm_start is on, and much slower than the sklearn implementation when warm_start is off.

In [1]: from sklearn.datasets import fetch_20newsgroups_vectorized
In [2]: from lightning.classification import CDClassifier
In [3]: datasets = fetch_20newsgroups_vectorized(subset='train')
In [4]: X, y = datasets.data, datasets.target
In [5]: from sklearn.linear_model import LogisticRegression
In [6]: clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=50, tol=1e-3)
In [11]: %timeit clf.fit(X, y)
1 loops, best of 3: 25.9 s per loop
In [7]: test_data = fetch_20newsgroups_vectorized(subset='test')
In [8]: X_test, y_test = test_data.data, test_data.target
In [13]: clf.score(X_test, y_test)
Out[13]: 0.72729686670207117
In [14]: cd_cold = CDClassifier(warm_start=False, multiclass=True, max_iter=50, tol=1e-3, loss='log')
In [15]: cd_warm =  CDClassifier(warm_start=True, multiclass=True, max_iter=50, tol=1e-3, loss='log')
In [16]: %timeit cd_warm.fit(X, y)
1 loops, best of 3: 4.9 s per loop
In [17]: cd_warm.score(X_test, y_test)
Out[17]: 0.73274030801911838
In [18]: %timeit cd_cold.fit(X, y)
1 loops, best of 3: 4min 4s per loop
 In [20]: cd_cold.score(X_test, y_test)
 Out[20]: 0.73274030801911838

Any reason to not set it as default?

Should the .pxd files be included with the distribution?

I'm working on a package that uses lightning cython code as a dependency via:

from lightning.impl.dataset_fast cimport ColumnDataset.

When installing lightning via conda or pip, generating the cython file fails, but if I distribute the generated cpp files the code runs fine.

Should the .pxd files be distributed with lightning to allow this use case?

Continuous integration

using Travis

Move documentation to gh-pages branch

Following scikit-learn-contrib/scikit-learn-contrib#3, we need to move lightning's documentation to the gh-pages branch of the lightning repo.

'CDRegressor' object has no attribute 'kernel'

This popped up when I ran a CDRegressor(warm_start=True, permute=True, penalty="l1")::

_lasso_fx_selection
    est.fit(X, y)
  File "/usr/local/lib/python2.7/dist-packages/lightning/impl/primal_cd.py", line 430, in fit
    if self.kernel:
AttributeError: 'CDRegressor' object has no attribute 'kernel'

0.1 release

I'd like to do a 0.1 release and upload binary packages to pypi and conda. TODO:

Make binary conda packages for (at least) windows (appveyor).
Update README with build instructions for binary packages.
Update the website with the latests stable version.
Create maintenance branch 0.1.X
After release, upgrade version number to 0.2.dev0.

What do you think @mblondel ?

online learning

Hey,

What does it take to implement partial_fit in lightning? Is there a reason it is not implemented?

fit_intercept for SAG(A) and perhaps others

I would like to have the possibility to fit an intercept for (at least) SAGAClassifier and SAGARegressor

ENH: Lightning seems to be slow when `loss=log`

@mblondel I'm not sure if this is meant to be, but I ran a quick few benchmarks.

# Load News20 dataset from scikit-learn. 
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target

# To remove the effect of parallelization
y[y != 1] = -1
time_logistic = []
time_lightning = []
Cs = np.logspace(-4, 4, 10)

for C in Cs:
    print C
    t = time()
    clf = LogisticRegression(penalty='l1', tol=0.0001, fit_intercept=False, C=C)
    t = time()
    clf.fit(X, y)
    time_logistic.append(time() - t)
    print time_logistic
    cl = CDClassifier(loss='log', tol=0.0001, max_iter=100, max_steps=0, C=C, penalty='l1')
    t = time()
    cl.fit(X, y)
    time_lightning.append(time() - t)
    print time_lightning

I get times like these for a grid of 10 Cs from np.logspace(-4, 4, 10)

time_lightning

[0.20100116729736328, 0.6052899360656738,  0.7211019992828369,
 2.470484972000122, 4.043258190155029, 7.791965007781982,
 10.92172908782959, 13.969007968902588, 12.534989833831787,
  5.275091886520386]

time_logistic

[0.08612680435180664, 0.22542500495910645, 0.5105628967285156,
 0.5970029830932617, 0.642221212387085, 0.8863811492919922,
 1.241279125213623, 1.1004469394683838, 0.9302711486816406,
 0.8940119743347168]

PIP Installation

Can this be installed using pip in a virtualenv? I have scikits-learn and cython successfully installed this way, but when I try to install lightning via:

pip install https://github.com/mblondel/lightning/archive/master.zip

I get the error:

g++: error: dataset_fast.cpp: No such file or directory

g++: fatal error: no input files

compilation terminated.

g++: error: dataset_fast.cpp: No such file or directory

g++: fatal error: no input files

compilation terminated.

error: Command "g++ -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -I/tmp/myproject/.env/local/lib/python2.7/site-packages/numpy/core/include -I/tmp/pip-lqLKbu-build/lightning/random -I/tmp/myproject/.env/local/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -c dataset_fast.cpp -o build/temp.linux-x86_64-2.7/dataset_fast.o" failed with exit status 4

Make class links clickable in introduction

https://github.com/mblondel/lightning/blob/master/doc/intro.rst

I can't use the currentmodule directive since classifiers and regressors are not in the same module. Any idea @ogrisel or @amueller?

Confusion regarding installation.

I have the option to install via pip :
pip install https://github.com/mblondel/lightning/archive/master.zip

Or from source:

git clone https://github.com/mblondel/lightning.git
cd lightning
python setup.py build
sudo python setup.py install

What I wanted to know was;

What is the difference ?
I plan to fork this repo and contribute to it, in that case, should I install from source ?
How do I view the effects of the changes I make to my local code clone ?

Support for Sparse Group Lasso?

Hi,

I've just found this library and seems pretty good, congratulations! I see that you support group lasso and I was wondering if there was a plan for supporting sparse group lasso, since that shouldn't be much of a trouble (with all due respect). I didn't dig in the code too much so it may even be there but, at first sight, I didn't see it.

I actually saw the @fabianp implementation of sparse group lasso in a personal repo from which you can have the standard group lasso solution just by setting the parameter alpha = 0 and I thought that it may be useful to have a fast and good implementation here.

Either way, thank you for this great work!

Redirect old web page

Just a personal reminder :)

loss_fast.MulticlassLog segfault

This loss seems to segfault for the case of 2 classes. Here is a minimal reproducing example:

import numpy as np
import scipy.sparse as sp

from scipy.linalg import svd, diagsvd

from sklearn.utils.testing import assert_almost_equal

from sklearn.datasets import load_digits

from lightning.impl.datasets.samples_generator import make_classification
from lightning.classification import FistaClassifier
from lightning.regression import FistaRegressor

bin_dense, bin_target = make_classification(n_samples=200, n_features=100,
                                            n_informative=5,
                                            n_classes=2, random_state=0)
bin_target = bin_target * 2 - 1

mult_dense, mult_target = make_classification(n_samples=300, n_features=100,
                                              n_informative=5,
                                              n_classes=3, random_state=0)
bin_csr = sp.csr_matrix(bin_dense)
mult_csr = sp.csr_matrix(mult_dense)
digit = load_digits(2)


def test_fista_multiclass_l1l2_log():
    for data in (mult_dense, mult_csr):
        clf = FistaClassifier(max_iter=200, penalty="l1/l2", loss="log",
                              multiclass=True)
        clf.fit(data, bin_target)

test_fista_multiclass_l1l2_log()

Alternatively, removing boundscheck=False from loss_fast.pyx makes appear a more meaningful traceback:

Traceback (most recent call last):
  File "t.py", line 33, in <module>
    test_fista_multiclass_l1l2_log()
  File "t.py", line 31, in test_fista_multiclass_l1l2_log
    clf.fit(data, bin_target)
  File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 216, in fit
    return self._fit(X, y, n_vectors)
  File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 60, in _fit
    obj = self._get_regularized_objective(df, y, loss, penalty, coef)
  File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 37, in _get_regularized_objective
    obj = self._get_objective(df, y, loss)
  File "/Users/fabian/dev/lightning/lightning/impl/fista.py", line 34, in _get_objective
    return self.C * loss.objective(df, y)
  File "lightning/impl/loss_fast.pyx", line 244, in lightning.impl.loss_fast.MulticlassLog.objective (lightning/impl/loss_fast.cpp:6262)
    cpdef objective(self,
  File "lightning/impl/loss_fast.pyx", line 259, in lightning.impl.loss_fast.MulticlassLog.objective (lightning/impl/loss_fast.cpp:6041)
    tmp = df[i, k] - df[i, y[i]]
IndexError: Out of bounds on buffer access (axis 1)

the problem seems to be that df is of size (n_samples, 1) and when y[1] = 1 this is out of bounds.

Implementation of SAGA

Hi @mblondel :-)

I'm working on an implementation of SAGA for a project of mine. My question is: if I submit such implementation (based on the SAG implementation already in lightning), would be considered for inclusion ?.

Best,

Fabian

CDClassifier does not converge

Example using synthetic data from one of the unit tests. CD fails to converge and starts oscillating after iteration 25.

Here is the example: https://gist.github.com/pprett/44d8bb3cbfe84c06a158

@mblondel is this to be expected? or a poor choice of hyper parameters?

scikit-learn-contrib / lightning Goto Github PK

lightning's Issues

Recommend Projects

Recommend Topics

Recommend Org