Code Monkey home page Code Monkey logo

Comments (1)

XuSShuai avatar XuSShuai commented on May 28, 2024

Improve Performance on Testing Set

1 - Early Stopping

image

使用validation set进行early stopping。

2 - Regularization

最好的参数不仅仅能够使得Cost function最小,而且其norm也是很小的(模型很平滑)。

$$J(\theta) = L(\theta) + \lambda \frac{1}{2}||\theta||_2, \text{ where } ||\theta||_2 = (w_1)^2 +(w_2)^2 +(w_3)^2 +\cdots$$

Gradient:

$$\frac{\partial J(\theta)}{\partial w} = \frac{\partial L(\theta)}{\partial w} + \lambda w$$

Update Role:

$$w^{t+1} = w^{t} - \eta \frac{\partial J(\theta)}{\partial w^t} = w^{t} - \eta (\frac{\partial L(\theta)}{\partial w^t} + \lambda w^t) = (1 - \lambda\eta)w^{t} - \eta \frac{\partial L(\theta)}{\partial w^t}$$

相比于没有regularizer,参数更新的公式多了一项$(1 - \lambda\eta)$,通常该值略小于1,所以参数每次在更新之前都会被缩小,虽然每次都会乘以一个小于1的数,但是并不是每一个参数都会变为0,因为还有$\eta \frac{\partial L(\theta)}{\partial w^t}$,只有当$\frac{\partial L(\theta)}{\partial w^t}$也是0的时候,参数才会变为0,而这也是合理的,$\frac{\partial L(\theta)}{\partial w^t}$为0说明该参数对于loss function的来说没有影响。

L1 regularization:

$$J(\theta) = L(\theta) + \lambda \frac{1}{2}||\theta||_1, \text{ where }||\theta||_1=|w_1| + |w_2| + \cdots$$

Gradient:

$$\frac{\partial J(\theta)}{\partial w} = \frac{\partial L(\theta)}{\partial w} + \lambda sgn(w)$$

Update Role:

$$w^{t+1} = w^{t} - \eta \frac{\partial J(\theta)}{\partial w^t} = w^{t} - \eta (\frac{\partial L(\theta)}{\partial w^t} + \lambda sgn(w^t)) = w^{t} - \eta \lambda sgn(w^t) - \eta \frac{\partial L(\theta)}{\partial w^t}$$

从上面的更新规则可以看出,当参数为负时,会加一个值;参数为正时,会减去一个值,这会导致参数有很大的可能性更新到0。L2在更新的过程中,对较大的参数惩罚程度较大,而对于较小的参数惩罚较小,所以参数一般不会被更新到0,而是都在0附近。

3 - Dropout

  • Training
    image
    对于每一个mini-batch,都要重新选择失活的neurons。
  • Testing
    • no dropout
    • 如果训练时,drop prob=p%,那么在测试的时候,所有的参数乘以1-p%

为什么dropout会有用呢?Dropout is a kind of ensemble

  • 多个variance很大的模型集成起来可以得到一个bias很小的模型,ensemble的过程如下:
    image

dropout在训练时的过程如下:每一个batch的数据使用一个不同的network来训练。

image

在进行测试时候,dropout运用了一个近似的方式:

image

  • 以下这个例子解释了这种近似方式的合理性
    image

但是在这个例子中,激活函数是线性的,但是实证结果表示,对于使用非linear activation function的network来说,dropout也可以work的很好。有论文指出:maxout network(可以视为linear network)+drouput会得到更好的结果。

from machine-learning-2017-fall.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.