scikit-learn / scikit-learn Goto Github PK
View Code? Open in Web Editor NEWscikit-learn: machine learning in Python
Home Page: https://scikit-learn.org
License: BSD 3-Clause "New" or "Revised" License
scikit-learn: machine learning in Python
Home Page: https://scikit-learn.org
License: BSD 3-Clause "New" or "Revised" License
When n_classes > 2, the precision / recall / f1-score need to be averaged in some way.
Currently the code in precision_recall_fscore_support does:
precision = true_pos / (true_pos + false_pos) recall = true_pos / (true_pos + false_neg)
Since true_pos, false_pos and false_neg are arrays of size n_classes, precision and recall are also arrays of the same size. Then to obtain a single average, the weighted sum is taken.
In the literature, the macro-average and micro-average are usually used but as far as I understand the current code does neither one. The macro is the unweighted average of the precision/recall taken separately for each class. Therefore it is an average over classes. The micro average on the contrary is an average over instances: therefore classes which have many instances are given more importance. However, AFAIK it's not the same as taking the weighted average as currently done in the code.
I think the code should be:
micro_avg_precision = true_pos.sum() / (true_pos.sum() + false_pos.sum()) micro_avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum())
macro_avg_precision = np.mean(true_pos / (true_pos + false_pos)) macro_avg_recall = np.mean(true_pos / (true_pos + false_neg))
It's easy to fix (add a micro=True|False option) but the tests may be a pain to update :-/
beta parameter is extensively used but never explained
Given that building various images in the user guide requires one to download various large datasets, would it be possible to distribute a tarball containing a prebuilt copy of the docs (e.g., in html)? This would be helpful for scikit learn package maintainers for various distributions because it would obviate the need to include large datasets in the source packages in order to build the docs properly.
title says it all.
keyword econ=True has been removed from scipy.linalg.qr, write a compatibility layer for this.
what does output represent? what do different columns represent? etc.
in covariance.py it assumes that the data is centered.
There should be an option to enable centering.
The logistic regression should have an entry under 'Generalized Linear Models' and the intro of the GLM section should point to it to do regression.
Right now it is not clear from the documentation that the scikit even does logistic regression. It should appear in the table of content.
The following sentence
SVMs perform classification as a function of some subset of the training data, called the support vectors. These vectors can be accessed in member support_:
seems to be intended to obtain support vectors. So, the term "support_
" should be "support_vectors_
"
The subsequent example is also supposed to be:
>>> clf.support_vectors_
array([[ 0., 0.],
[ 1., 1.]])
Here is a report I received by anonymous private mail:
https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cluster/k_means_.py
Line 176:
175 elif hasattr(init, '__array__'):
176 centers = np.asanyarray(init).copy()
177 elif callable(init):
You take predefined centers as an optional initialization method. You
copy them directly into the kmeans, but you don't account for the fact
that the X data has already been centered on line 167:
167 X -= Xmean
Also, when you return the centers, you make sure to add xmean back:
208 return best_centers + Xmean, best_labels, best_inertia
This seems like a bug, but I could be wrong in some very subtle way.
The obvious fix for this would be to replace line 176 with this:
centers = np.asanyarray(init).copy() - Xmean
The following code creates a segfault:
from scikits.learn import svm, datasets diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target svr = svm.SVR(probability=True) svr.fit(X, y)
Without any surprise, inspecting the gdb traceback tells us that the segfault is in the call to libsvm_train on line 145 in svm/base.py. The first lines of the gdb backtrace are:
Program received signal SIGSEGV, Segmentation fault. __memcpy_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:1360 1360 ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S: No such file or directory. in ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S (gdb) bt #0 __memcpy_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:1360 #1 0x015ff843 in copy_probB (data=0x8a594b0 "h\344G", model=0x89e0f68, dims=0x89d1268) at /usr/include/bits/string3.h:52 #2 0x016147ac in __pyx_pf_7_libsvm_libsvm_train (__pyx_self=0x0, __pyx_args= (, , 3, 2, 3, , , , , , , , , , , 1, 1), __pyx_kwds=0x0) at scikits/learn/svm/src/libsvm/_libsvm.c:2111 #3 0x080ddd23 in call_function (f= Frame 0x8a43d9c, for file /home/varoquau/dev/scikit-learn/scikits/learn/svm/base.py, line 150, in fit (self=, probability=True, degree=3, shrinking=True, class_weight_label=, p=, impl='epsilon_svr', tol=, cache_size=, coef0=, nu=, gamma=, class_weight=) at remote 0x8a0498c>, X=, y=, class_weight={}, sample_weight=, params={}, kernel_type=2, _X=, solver_type=3), throwflag=0) at ../Python/ceval.c:3750
Rename Gallery -> Example Gallery
User guide should be more nested
h2, h3 should be padded to one side
See comments in 1927389
Needs to explain what criterion is used to select the optimal parameter.
In my windows box I get:
arn\datasets\lfw.py", line 32, in <module>
from scipy.misc import imread
ImportError: cannot import name imread
See comments on commit 269ab1a
Currently, we have three level of objects for text feature extraction:
In this proposal, we would like to merge Preprocessor into Analyzer as well as introducing new methods. This should give more flexibility to the user for supporting different (natural) languages.
An analyzer should implement 4 methods:
The class hierarchy could look like this:
Furthermore, we can have an EnglishWordAnalyzer to handle things like stop words removal and more elaborate processing for English syntax.
ChineseWordAnalyzer and JapaneseWordAnalyzer will likely require external dependencies (library, dictionary/probabilistic model). Thus they are out of the scope of the project but we may want to provide them in a gist.
>>> clf = LassoLARS()
>>> clf.fit([[0, 0], [1, 1]], [0, 1], alpha=0.0).coef_
array([ NaN, NaN])
https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cross_val.py#L177
shouldn't this allow k=n? isn't that the definition for leave one out?
bind function cross_validation (liblinear) and svm_cross_validation (libsvm)
When I do a search on the webpage, I seem to get answers from many different versions of the documentation (0.4, 0.5, 0.6)
in the sphinx docs
/software/python/nipype0.3/lib/python2.6/site-packages/scikits.learn-0.6_git-py2.6-linux-x86_64.egg/scikits/learn/glm/base.pyc in predict(self, X)
40 """
41 X = np.asanyarray(X)
---> 42 return np.dot(X, self.coef_) + self.intercept_
assigned to me
see
http://en.wikipedia.org/wiki/Mallows'_Cp
it's a way to select the regularization parameter of the Lar / Lasso without using cross-validation.
R implements it
should output an array of coefficients, as lars_path.
I am using the scikits.learn on very large sparse data. I have problems when using the sparse SVM with the 'poly' kernel. I attach a simple test case based on the iris example from the scikits.learn website. In this example I use 'linear' and 'poly' kernels both using the dense and sparse implementations. As the graphs show the 'linear' kernel gives similar results (sparse vs dense) but the sparse implementation of 'poly' gives wrong results.
I am using scikits.learn version 0.7.1, and I have tested it both on window 32bit and window 64bit implementations. I am using scipy version 0.8 on the win32 platform and scipy 0.9rc3 on the win64 platform.
"""
==================================================
Plot different SVM classifiers in the iris dataset
==================================================
Comparison of different linear SVM classifiers on the iris dataset. It
will plot the decision surface for four different SVM classifiers.
"""
print __doc__
import numpy as np
import pylab as pl
from scikits.learn import svm, datasets
import scipy as sp
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
Xs = sp.sparse.lil_matrix( X ).tocsr()
Y = iris.target
h=.02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
svc = svm.SVC(kernel='linear').fit(X, Y)
rbf_svc = svm.SVC(kernel='poly').fit(X, Y)
ssvc = svm.sparse.SVC(kernel='linear').fit(Xs, Y)
srbf_svc = svm.sparse.SVC(kernel='poly').fit(Xs, Y)
# create a mesh to plot in
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel',
'SVC with polynomial (degree 3) kernel',
'Sparse SVC with linear kernel',
'Sparse SVC with polynomial (degree 3) kernel']
pl.set_cmap(pl.cm.Paired)
for i, clf in enumerate((svc, rbf_svc, ssvc, srbf_svc)):
# Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
pl.subplot(2, 2, i+1)
Xp = np.c_[xx.ravel(), yy.ravel()]
if i > 1:
Xp = sp.sparse.lil_matrix( Xp ).tocsr()
Z = clf.predict( Xp )
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.set_cmap(pl.cm.Paired)
pl.contourf(xx, yy, Z)
pl.axis('tight')
# Plot also the training points
pl.scatter(X[:,0], X[:,1], c=Y)
pl.title(titles[i])
pl.axis('tight')
pl.show()
Not high priority but would be nice to have a command line interface. Some possible features:
Examples:
$ skl fit --format svmlight --model model.pickle preprocessing.Scaling
pca.PCA svm.LinearSVC --input training_data.txt
$ skl predict --format svmlight --model model.pp --input
test_data.txt --output predictions.txt
when --input is not provided, the input is read from stdin.
I think decision function should be sorted just as we do for predict proba, where we sort by class label as in probas[:,np.argsort(self.label_)]
See comments on 16dc776.
Some functions lack in sparse svm's : probability predict, decision function
This is a proposal to use python's logging module instead of using stdout and verbose flags in the models API.
Using the logging module would make it easier for the user to control the verbosity of the scikit using a single and well documented configuration interface and logging API.
Multinomial Naive Bayes is a simple algorithm which scales well and has probabilistic output.
use_prior=True/False would be a nice option in the constructor.
In case loops are slow, use Cython.
Most classifier models have a parameter named class_weights for the fit method. It would be more user friendly to also have it as a constructor parameter to be able to do grid search on it and potentially to have class_weights='auto' enabled by default.
Reference link is broken
For precomputed kernels, a square matrix is not an efficient way to store the kernel matrix (since the kernel matrix is symmetric).
We should create a kernel object interface instead. Advantages:
The object could be numpy-compatible:
kernel = GaussianKernel(X_train, sigma=0.5) print kernel[i, j] # recompute only if not cached print kernel.compute(X_test)
Question: shall we create our own LRU object or shall we just bind libsvm's?
Since there's plan to bind libsvm's cross-validation code, this also means that the cache will be used more efficiently for cross-validation even when kernel="precomputed".
(Sorry for opening many tickets lately: I acually intend to help close them when I get more time ;-)
Affinity propagation do not handle identity matrix correctly:
In [81]: s = np.array([[1, 0], [0, 1]])
In [82]: affinity_propagation(s, verbose=True)
Did not converged
Out[82]:
(None, array([[ nan],
[ nan]]))
When type of s is float, ap converged. Thus the reason might be the numerical computation consistency.
BTW, it seems there are two ways to report bugs, sf and github. Do you have any preference?
Add priors on the mean instead of assuming that the prior means are zero.
Add more reference in the doc.
Modify ARD in order to use a vector of hyperparameters for the precision instead of a single value.
Spelling : defaut -> default unless you insist on using french, spelling out ARD (Automatic Relevance Determination) regression in the docstring as in the comment line would be useful.
Thanks to Josef for the remarks.
BallTree provides other distances apart from euclidean, l1 & manhattan. Use it!
It would be nice to have loaders for common file formats such as libsvm's or weka's.
The loaders should have two modes: batch and online. In the latter case, we could have an iterator that spits X matrices of a given chunk size (suitable for partial_fit)
Assigned to me
put somewhere the slides so that they can be reused
http://www.scribd.com/doc/36583672/Statistical-Learning-and-Text-Classification-with-NLTK-and-scikit-learn
For consistency and to be able to write an efficient pipeline, all transform methods should honor a copy=True|False parameter.
I'm not on my working environment so I cannot check right now but we should make a list of the methods that need to be fixed in this ticket.
The k-means center initialization in https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cluster/k_means_.py, function k_init(), is not k-means++. The paper authors' implementation (http://www.stanford.edu/~darthur/kMeansppTest.zip, Utils.cpp, chooseSmartCenters()) does the following:
scikit-learn's implementation (taken from pybrain, which in turn took it from Yong Sun's blog at http://blogs.sun.com/yongsun/entry/k_means_and_k_means) does this instead:
The authors' implementation samples numLocalTries points with D^2 weighting and chooses the best among those (repeated for each of the k-1 centers to find). For all the results in their paper, the authors used numLocalTries==1. (Only in their "Conclusion and future work" section they state that "experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased \phi as much as possible", and in their code you can see they tried numLocalTries==2+log(k).)
scikit-learn completely omits the sampling step (authors' Utils.cpp, lines 299-305) and instead greedily chooses the center that minimizes the potential. While this is not necessarily bad, it is not k-means++.
The easy way to fix this would be changing the documentation to not refer to k-means++ any longer (and find out how this greedy scheme is called in the literature; I assume somebody described it already), the better way would be fixing the implementation. I will do the latter (unless I decide I don't need it) and post back here; until then just take this as a warning.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.