scikit-learn / scikit-learn Goto Github PK

scikit-learn: machine learning in Python

License: BSD 3-Clause "New" or "Revised" License

Python 92.27% Shell 0.32% C 0.33% C++ 1.15% Makefile 0.01% Cython 5.61% Starlark 0.01% JavaScript 0.01% CSS 0.09% Meson 0.20%

machine-learning python statistics data-science data-analysis

scikit-learn's Issues

Macro/micro average precision/recall/f1-score

When n_classes > 2, the precision / recall / f1-score need to be averaged in some way.

Currently the code in precision_recall_fscore_support does:

precision = true_pos / (true_pos + false_pos)
recall = true_pos / (true_pos + false_neg)

Since true_pos, false_pos and false_neg are arrays of size n_classes, precision and recall are also arrays of the same size. Then to obtain a single average, the weighted sum is taken.

In the literature, the macro-average and micro-average are usually used but as far as I understand the current code does neither one. The macro is the unweighted average of the precision/recall taken separately for each class. Therefore it is an average over classes. The micro average on the contrary is an average over instances: therefore classes which have many instances are given more importance. However, AFAIK it's not the same as taking the weighted average as currently done in the code.

I think the code should be:

micro_avg_precision = true_pos.sum() / (true_pos.sum() + false_pos.sum())
micro_avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum())

macro_avg_precision = np.mean(true_pos / (true_pos + false_pos))
macro_avg_recall = np.mean(true_pos / (true_pos + false_neg))

It's easy to fix (add a micro=True|False option) but the tests may be a pain to update :-/

beta parameter not explained under "Bayesian Ridge Regression"

beta parameter is extensively used but never explained

RandomizedPCA does not have rst doc

prebuilt docs tarball?

Given that building various images in the user guide requires one to download various large datasets, would it be possible to distribute a tarball containing a prebuilt copy of the docs (e.g., in html)? This would be helpful for scikit learn package maintainers for various distributions because it would obviate the need to include large datasets in the source packages in order to build the docs properly.

Missing documentation for the Recursive Feature Elimination module.

title says it all.

scipy.linalg.qr econ=True compatibility

keyword econ=True has been removed from scipy.linalg.qr, write a compatibility layer for this.

output of example logistic_l1_l2_coef.py lacks explanation

what does output represent? what do different columns represent? etc.

Covariance on non-centered data

in covariance.py it assumes that the data is centered.

There should be an option to enable centering.

The logistic regression has no narritive documentation

The logistic regression should have an entry under 'Generalized Linear Models' and the intro of the GLM section should point to it to do regression.

Right now it is not clear from the documentation that the scikit even does logistic regression. It should appear in the table of content.

User Guide Rease 0.6, Section 3.2

The following sentence
SVMs perform classification as a function of some subset of the training data, called the support vectors. These vectors can be accessed in member support_:

seems to be intended to obtain support vectors. So, the term "support_" should be "support_vectors_"

The subsequent example is also supposed to be:
>>> clf.support_vectors_
array([[ 0., 0.],
[ 1., 1.]])

Bug in k-means centering

Here is a report I received by anonymous private mail:

https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cluster/k_means_.py

Line 176:

175        elif hasattr(init, '__array__'):
176            centers = np.asanyarray(init).copy()
177        elif callable(init):

You take predefined centers as an optional initialization method. You
copy them directly into the kmeans, but you don't account for the fact
that the X data has already been centered on line 167:

167    X -= Xmean

Also, when you return the centers, you make sure to add xmean back:

208    return best_centers + Xmean, best_labels, best_inertia

This seems like a bug, but I could be wrong in some very subtle way.

The obvious fix for this would be to replace line 176 with this:

centers = np.asanyarray(init).copy() - Xmean

Crasher in SVR with probility=True

The following code creates a segfault:

from scikits.learn import svm, datasets

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

svr = svm.SVR(probability=True)
svr.fit(X, y)

Without any surprise, inspecting the gdb traceback tells us that the segfault is in the call to libsvm_train on line 145 in svm/base.py. The first lines of the gdb backtrace are:

Program received signal SIGSEGV, Segmentation fault.
__memcpy_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:1360
1360    ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S: No such file or directory.
    in ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S
(gdb) bt
#0  __memcpy_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:1360
#1  0x015ff843 in copy_probB (data=0x8a594b0 "h\344G", model=0x89e0f68, dims=0x89d1268)
    at /usr/include/bits/string3.h:52
#2  0x016147ac in __pyx_pf_7_libsvm_libsvm_train (__pyx_self=0x0, __pyx_args=
    (, , 3, 2, 3, , , , , , , , , , , 1, 1), __pyx_kwds=0x0)
    at scikits/learn/svm/src/libsvm/_libsvm.c:2111
#3  0x080ddd23 in call_function (f=
    Frame 0x8a43d9c, for file /home/varoquau/dev/scikit-learn/scikits/learn/svm/base.py, line 150, in fit (self=, probability=True, degree=3, shrinking=True, class_weight_label=, p=, impl='epsilon_svr', tol=, cache_size=, coef0=, nu=, gamma=, class_weight=) at remote 0x8a0498c>, X=, y=, class_weight={}, sample_weight=, params={}, kernel_type=2, _X=, solver_type=3), throwflag=0)
    at ../Python/ceval.c:3750

improvements for the doc

Rename Gallery -> Example Gallery
User guide should be more nested
h2, h3 should be padded to one side

KMeans classifier lacks predict method

Pass cv as a class parameter to all *CV models

See comments in 1927389

GridSearchCV doc lacks information

Needs to explain what criterion is used to select the optimal parameter.

lfw not working on scipy 0.8.0

In my windows box I get:

arn\datasets\lfw.py", line 32, in <module>
    from scipy.misc import imread
ImportError: cannot import name imread

add docs on how to build with custom lapack

remove Y from fit in OneClassSVM

Add test for multiple drops on LAR

See comments on commit 269ab1a

New Analyzer objects for text feature extraction

Currently, we have three level of objects for text feature extraction:

Preprocessor
Analyzer
Vectorizer

In this proposal, we would like to merge Preprocessor into Analyzer as well as introducing new methods. This should give more flexibility to the user for supporting different (natural) languages.

An analyzer should implement 4 methods:

preprocess(str) => str
Preprocessing such as punctation removal, lower case conversion. Language dependent.
tokenize(str) => list
Split a string into tokens (words or characters). Language dependent.
postprocess(list) => list
From a list of tokens, output a list of n-grams. Language independent (unless further post-processing is needed).
analyze(iter) => iter
Pass a collection of documents (str, file, io) through the whole chain preprocess => tokenize => postprocess. Since an iterator is returned, CountVectorizer should convert it to a list but HashingCountVectorizer won't have.

The class hierarchy could look like this:

Analyzer (implements postprocess)
- RomanAnalyzer (implements preprocess)
  - RomanWordAnalyzer (implements tokenize)
  - RomanCharacterAnalyzer (implements tokenize)

Furthermore, we can have an EnglishWordAnalyzer to handle things like stop words removal and more elaborate processing for English syntax.

ChineseWordAnalyzer and JapaneseWordAnalyzer will likely require external dependencies (library, dictionary/probabilistic model). Thus they are out of the scope of the project but we may want to provide them in a gist.

Stability of LARS

>>> clf = LassoLARS()
>>> clf.fit([[0, 0], [1, 1]], [0, 1], alpha=0.0).coef_
array([ NaN,  NaN])

kfold cross validation assertion check

https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cross_val.py#L177

shouldn't this allow k=n? isn't that the definition for leave one out?

bind cross_validation methods from libsvm and liblinear

bind function cross_validation (liblinear) and svm_cross_validation (libsvm)

The wepage search also search old versions of the documentation

When I do a search on the webpage, I seem to get answers from many different versions of the documentation (0.4, 0.5, 0.6)

plot examples by default and capture stdout

in the sphinx docs

LARS module broken

the parameter names are different in docstring
error below

/software/python/nipype0.3/lib/python2.6/site-packages/scikits.learn-0.6_git-py2.6-linux-x86_64.egg/scikits/learn/glm/base.pyc in predict(self, X)
40 """
41 X = np.asanyarray(X)
---> 42 return np.dot(X, self.coef_) + self.intercept_

implement warm restarts for logistic regression

assigned to me

implement Mallows Cp for LAR / Lasso

see

http://en.wikipedia.org/wiki/Mallows'_Cp

it's a way to select the regularization parameter of the Lar / Lasso without using cross-validation.

R implements it

change lasso_path output

should output an array of coefficients, as lars_path.

Problem with Sparse implementation of SVM

I am using the scikits.learn on very large sparse data. I have problems when using the sparse SVM with the 'poly' kernel. I attach a simple test case based on the iris example from the scikits.learn website. In this example I use 'linear' and 'poly' kernels both using the dense and sparse implementations. As the graphs show the 'linear' kernel gives similar results (sparse vs dense) but the sparse implementation of 'poly' gives wrong results.

I am using scikits.learn version 0.7.1, and I have tested it both on window 32bit and window 64bit implementations. I am using scipy version 0.8 on the win32 platform and scipy 0.9rc3 on the win64 platform.

"""
==================================================
Plot different SVM classifiers in the iris dataset
==================================================

Comparison of different linear SVM classifiers on the iris dataset. It
will plot the decision surface for four different SVM classifiers.

"""
print __doc__

import numpy as np
import pylab as pl
from scikits.learn import svm, datasets
import scipy as sp


# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
                     # avoid this ugly slicing by using a two-dim dataset
Xs = sp.sparse.lil_matrix( X ).tocsr()

Y = iris.target

h=.02 # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
svc     = svm.SVC(kernel='linear').fit(X, Y)
rbf_svc = svm.SVC(kernel='poly').fit(X, Y)
ssvc     = svm.sparse.SVC(kernel='linear').fit(Xs, Y)
srbf_svc = svm.sparse.SVC(kernel='poly').fit(Xs, Y)

# create a mesh to plot in
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'SVC with polynomial (degree 3) kernel',
          'Sparse SVC with linear kernel',
          'Sparse SVC with polynomial (degree 3) kernel']


pl.set_cmap(pl.cm.Paired)

for i, clf in enumerate((svc, rbf_svc, ssvc, srbf_svc)):
    # Plot the decision boundary. For that, we will asign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    pl.subplot(2, 2, i+1)

    Xp = np.c_[xx.ravel(), yy.ravel()]

    if i > 1:
        Xp = sp.sparse.lil_matrix( Xp ).tocsr()

    Z = clf.predict( Xp )

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    pl.set_cmap(pl.cm.Paired)
    pl.contourf(xx, yy, Z)
    pl.axis('tight')

    # Plot also the training points
    pl.scatter(X[:,0], X[:,1], c=Y)

    pl.title(titles[i])

pl.axis('tight')
pl.show()

improvements in naive_bayes

across the whole module estimated parameters do not end with underscore
- non-standard names are used for some variables, like unique_y (called classes in SGD and labels in SVM), we should think about unifying these.

command line interface

Not high priority but would be nice to have a command line interface. Some possible features:

Input in various formats (libsvm/svmlight's sparse format, arff, raw
documents, ...)
Pipeline (transformers and estimator)
Model selection
Evaluation on a test set
(Plots?)
Model persistence (pickle)

Examples:

$ skl fit --format svmlight --model model.pickle preprocessing.Scaling
pca.PCA svm.LinearSVC --input training_data.txt

$ skl predict --format svmlight --model model.pp --input
test_data.txt --output predictions.txt

when --input is not provided, the input is read from stdin.

Sort decision function in svm module

I think decision function should be sorted just as we do for predict proba, where we sort by class label as in probas[:,np.argsort(self.label_)]

Sign of coefficients in libsvm

See comments on 16dc776.

decision_function for sparse SVC (was: make sparse.svm first-class citizens)

Some functions lack in sparse svm's : probability predict, decision function

Use python logging to report on convergence progress it level info for long running tasks

This is a proposal to use python's logging module instead of using stdout and verbose flags in the models API.

Using the logging module would make it easier for the user to control the verbosity of the scikit using a single and well documented configuration interface and logging API.

http://docs.python.org/library/logging.html

Implement Multinomial Naive Bayes

Multinomial Naive Bayes is a simple algorithm which scales well and has probabilistic output.

use_prior=True/False would be a nice option in the constructor.

In case loops are slow, use Cython.

Make class_weights a constructor parameter for all classifier models

Most classifier models have a parameter named class_weights for the fit method. It would be more user friendly to also have it as a constructor parameter to be able to do grid search on it and potentially to have class_weights='auto' enabled by default.

broken link in "Bayesian Ridge Regression"

Reference link is broken

kernel object interface

For precomputed kernels, a square matrix is not an efficient way to store the kernel matrix (since the kernel matrix is symmetric).

We should create a kernel object interface instead. Advantages:

The object can store the LRU cache
This gives a way for the user to handle kernel re-computations
Internally the gram matrix can be stored in packed format
This will be useful when we create our own kernel-based estimators (some like Ridge already support kernels, they just lack the interface)
This handle nicely the kernel computations between test instances and training instances

The object could be numpy-compatible:

kernel = GaussianKernel(X_train, sigma=0.5)
print kernel[i, j] # recompute only if not cached
print kernel.compute(X_test)

Question: shall we create our own LRU object or shall we just bind libsvm's?

Since there's plan to bind libsvm's cross-validation code, this also means that the cache will be used more efficiently for cross-validation even when kernel="precomputed".

(Sorry for opening many tickets lately: I acually intend to help close them when I get more time ;-)

implement Generalized Cross-Validation for Ridge regression

see

http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines#Generalized_cross_validation_.28GCV.29

http://www.stat.nus.edu.sg/~staxyc/DMchapter0c.pdf (sec. 4.2)

affinity propagation failed for identity matrix

Affinity propagation do not handle identity matrix correctly:

In [81]: s = np.array([[1, 0], [0, 1]])

In [82]: affinity_propagation(s, verbose=True)
Did not converged
Out[82]:
(None, array([[ nan],
[ nan]]))

When type of s is float, ap converged. Thus the reason might be the numerical computation consistency.

BTW, it seems there are two ways to report bugs, sf and github. Do you have any preference?

Bayesian regression

Add priors on the mean instead of assuming that the prior means are zero.
Add more reference in the doc.
Modify ARD in order to use a vector of hyperparameters for the precision instead of a single value.
Spelling : defaut -> default unless you insist on using french, spelling out ARD (Automatic Relevance Determination) regression in the docstring as in the comment line would be useful.

Thanks to Josef for the remarks.

Use distances provided by BallTree

BallTree provides other distances apart from euclidean, l1 & manhattan. Use it!

common file format support

It would be nice to have loaders for common file formats such as libsvm's or weka's.

The loaders should have two modes: batch and online. In the latter case, we could have an iterator that spits X matrices of a given chunk size (suitable for partial_fit)

Return index of support vectors in sparse SVM

Assigned to me

add slides to doc (or reference)

put somewhere the slides so that they can be reused
http://www.scribd.com/doc/36583672/Statistical-Learning-and-Text-Classification-with-NLTK-and-scikit-learn

copy parameter in all transform methods

For consistency and to be able to write an efficient pipeline, all transform methods should honor a copy=True|False parameter.

I'm not on my working environment so I cannot check right now but we should make a list of the methods that need to be fixed in this ticket.

k-means++ initialization wrong

The k-means center initialization in https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cluster/k_means_.py, function k_init(), is not k-means++. The paper authors' implementation (http://www.stanford.edu/~darthur/kMeansppTest.zip, Utils.cpp, chooseSmartCenters()) does the following:

One center is chosen randomly.
Now repeat numCenters-1 times:
- Repeat numLocalTries times:
  - Add a point x with probability proportional to the distance squared from x to the closest existing center
- Add the point chosen above that results in the smallest potential.

scikit-learn's implementation (taken from pybrain, which in turn took it from Yong Sun's blog at http://blogs.sun.com/yongsun/entry/k_means_and_k_means) does this instead:

One center is chosen randomly.
Now repeat numCenters-1 times:
- Repeat numSamples times:
  - Add a point x that has not been tried yet
- Add the point chosen above that results in the smallest potential.

The authors' implementation samples numLocalTries points with D^2 weighting and chooses the best among those (repeated for each of the k-1 centers to find). For all the results in their paper, the authors used numLocalTries==1. (Only in their "Conclusion and future work" section they state that "experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased \phi as much as possible", and in their code you can see they tried numLocalTries==2+log(k).)

scikit-learn completely omits the sampling step (authors' Utils.cpp, lines 299-305) and instead greedily chooses the center that minimizes the potential. While this is not necessarily bad, it is not k-means++.

The easy way to fix this would be changing the documentation to not refer to k-means++ any longer (and find out how this greedy scheme is called in the literature; I assume somebody described it already), the better way would be fixing the implementation. I will do the latter (unless I decide I don't need it) and post back here; until then just take this as a warning.

scikit-learn / scikit-learn Goto Github PK

scikit-learn's Issues

Recommend Projects

Recommend Topics

Recommend Org