This is a work in progress.
Auto SVM is a Least Squares Support Vector Machine (LS-SVM) that is improved in several ways:
- A next-gen regularization term that penalizes model complexity directly:
- Regression: the objective is a tradeoff between fit on training data and a direct estimation of model complexity (instead of a proxy like the L2 norm).
- Classification: the objective is a mix between fit on the training data, model complexity (next-gen term) and maximizing the margin (L2 norm). While maximizing the margin is important, minimizing the decision boundary complexity may significantly improve generalization performance.
- Fully normalized so that:
- Hyperparameters are easily interpreted.
- Default hyperparameter values of
$1.0$ will give good results. - Adding or removing rows (observations) or columns (features) has a minimal influence of the choice of the hyperparameter values.
- Least squares formulation allows for a cheap closed-form expression for the leave-one-out error. This enables a powerful way to search for the optimal hyperparameters.
- In a later stage, perhaps a large scale implementation too.
We begin with an observation
We choose an RBF kernel in the same format as sklearn
:
$$
\begin{align}
k(x,y) &:= \exp(-\gamma |x-y|^2) \in \mathbb{R}\
K(x) &:= \left[k(x, x_1) ~ \ldots ~ k(x, x_n)\right] \in \mathbb{R}^{1 \times n}
\end{align}
$$
The gradient of the kernel is given by: $$ \begin{align} \frac{dk}{dx} &= -2\gamma k(x,y) \cdot (x-y)\ \nabla K := \frac{dK}{dx} &= -2\gamma\left( x \cdot 1_{1\times n} - X \right) \cdot \mathrm{diag}(K(x)) \end{align} $$
The SVM model is of the form: $$ f(x) := K(x)\cdot \alpha $$
The normal on the prediction surface is: $$ n := \begin{bmatrix}\nabla K \cdot \alpha\ -1\end{bmatrix} $$
And so the norm of the normal vector is: $$ \begin{align} |n|^2 &= \alpha^T\cdot (\nabla K)^T (\nabla K) \cdot \alpha + 1\ &= 1 + 4\gamma^2 \cdot \alpha^T \cdot \mathrm{diag}(K(x)) \cdot \left(|x|^2 - 1_{n\times 1} \cdot x^T \cdot X - X^T \cdot x \cdot 1_{1 \times n} + X^T X\right) \cdot \mathrm{diag}(K(x)) \cdot \alpha\ &= 1 + 4\gamma^2 \cdot \alpha^T \cdot \left[ k(x,x_i)k(x,x_j)\left(|x|^2 - x_i^T\cdot x - x_j^T\cdot x + x_i^T\cdot x_j \right) \right] \cdot \alpha \end{align} $$
We can integrate
Let's integrate each of the three types of terms.
First, let's integrate out the
$$ \begin{align} I^{(3,p)}{ij} &= \int{-\infty}^\infty k(x,x_i)k(x,x_j)x_i^T\cdot x_j dx^{(p)}\ &= k(x^{(/p)}, x_i^{(/p)}) k(x^{(/p)}, x_j^{(/p)})x_i^T\cdot x_j \int_{-\infty}^\infty \exp(-\gamma (x^{(p)} - x_i^{(p)})^2-\gamma (x^{(p)} - x_j^{(p)})^2) dx^{(p)}\ &= C_{ij} \int_{-\infty}^\infty \exp(-\gamma (x^{(p)} - x_i^{(p)})^2-\gamma (x^{(p)} - x_j^{(p)})^2) dx^{(p)}\ &= C_{ij} \int_{-\infty}^\infty \exp\left(-2\gamma\left(x^{(p)} - \frac{x_i^{(p)}+x_j^{(p)}}{2}\right)^2 - \frac{\gamma}{2}\left(x_i^{(p)} - x_j^{(p)}\right)^2\right) dx^{(p)}\ &= C_{ij} \sqrt{\frac{\pi}{2 \gamma}} \exp\left(- \frac{\gamma}{2}\left(x_i^{(p)} - x_j^{(p)}\right)^2\right) \end{align} $$
This means that after integrating out all
First, let's integrate out the
$$ \begin{align} I^{(2,p,i)}{ij} &= \int{-\infty}^\infty k(x,x_i)k(x,x_j)x_i^T\cdot x dx^{(p)}\ &= \sum_{q=1}^d\int_{-\infty}^\infty k(x,x_i)k(x,x_j)x_{i}^{(q)} x^{(q)} dx^{(p)}\ &= \sum_{q=1}^d k(x^{(/p)}, x_i^{(/p)}) k(x^{(/p)}, x_j^{(/p)}) x_{i}^{(q)} x^{(q)} \sqrt{\frac{\pi}{2 \gamma}} \exp\left(- \frac{\gamma}{2}\left(x_i^{(p)} - x_j^{(p)}\right)^2\right) \end{align} $$
Next, we integrate out all other dimensions:
First, let's integrate out the
$$ \begin{align} I^{(1,p)}{ij} &= \int{-\infty}^\infty k(x,x_i)k(x,x_j) |x|^2 dx^{(p)}\ &= \sum_{q=1}^d\int_{-\infty}^\infty k(x,x_i)k(x,x_j)x^{(q)2} dx^{(p)}\ &= \sum_{q=1}^d k(x^{(/p)}, x_i^{(/p)}) k(x^{(/p)}, x_j^{(/p)}) x^{(q)2} \sqrt{\frac{\pi}{2 \gamma}} \exp\left(- \frac{\gamma}{2}\left(x_i^{(p)} - x_j^{(p)}\right)^2\right) \end{align} $$
Next, we integrate out all other dimensions:
We want to achieve a number of things:
- The solution should be close to invariant under addition or removal of columns.
- The solution should be close to invariant under addition or removal of rows.
- A default value of
$\gamma = 1.0$ should be a good starting value for data sets of any size. - A default value of
$\mu = 1.0$ should be a good starting value for data sets of any size.
First, we fix
Second, we replace
A counterargument to this is that adding an all-zero feature would require a different value of