The gradient-to-cost-function-data-science-pilot from cutterbuck

Introduction

In the previous lesson, we learned the mathematical defintion of a gradient. We saw that the gradient of a function was a combination of our partial derivatives with respect to each variable of that function. We saw the direction of gradient descent was simply to move in the negative direction of the gradient. So if the direction of ascent of a function is a move up and to the right, the descent is down and to the left. In this lesson we will apply gradient descent to our cost function to see how we can move towards a best fit regression line by changing variables of $m$ and $b$.

Representing RSS as a multivariable function

Think about why gradient descent applies so well to a cost function. Initially, we said that the cost of our function, meaning the difference between what our regression line predicted and the dataset, changed as we altered the y-intercept or the slope of the function.

Remember that mathematically, when we say cost function, we use residual sum of squares where $$ RSS = \sum_{i=1}^n(actual - expected)^2 = \sum(y_i - \overline{y})^2 = \sum(y_i - mx_i + b)^2$$ for all $x$ and $y$ values of our dataset. So in the graph directly below, $x_i$ and $y_i$ would be our points representing a movie's budget and revenue. Meanwhile, $mx_i + b $ is our predicted $y$ value for a given $x$ value, of a budget.

And RSS takes the difference between $mx_i + b$, the $y_i$ value our regression line predicts, and our actual $y$, represented by the length of the red lines. Then we square this difference, and sum up these squares for each piece of data in our dataset. That is the residual sum of squares.

And we when we just plotted how RSS changes as we change one variable of our regression line, $m$ or $b$, we note how this looks like a curve, and call it our cost curve.

import plotly
from plotly.offline import init_notebook_mode, iplot
from graph import m_b_trace, trace_values, plot
init_notebook_mode(connected=True)
b_values = list(range(70, 150, 10))
rss = [10852, 9690, 9128, 9166, 9804, 11042, 12880, 15318]
cost_curve_trace = trace_values(b_values, rss, mode="line", name = 'RSS with changes to y-intercept')
plot([cost_curve_trace])

In two dimensions, we decrease our RSS simply by moving forwards or backwards along the cost curve which is the equivalent of changing our variable, in this case y-intercept. So the cost curve above indicates that changing the regression line from having a y-intercept of 70 to 80 decreases our cost, RSS.

Allowing us to change both variables, $m$ and $b$ means calculating how RSS varies with both $m$ and $b$.

So because our RSS, is a function of how we change our values of $m$ and $b$, we can express this mathematically by saying the cost function, $J$ is the following:
$$J(m, b) = \sum_{i=1}^{n}(y_i - (mx_i + b))^2$$

So in the function above, where you see, $J$ is a function of $m$ and $b$. $J$ just represents the residual sum of squares, which varies as the $m$ and $b$ variables of our regession line are changed.

Just our other multivariable functions we have seen thus far, we can display it in three dimensions, and it looks like the following.

So the three dimensional graph below shows how the cost associated with our regression line changes as the slope and intercept values are changed.

Calculating the gradient of our cost function

Now let's explore how to use gradient descent to determine which direction to move changing both $m$ and $b$. Applied to the general multivariable function $f(x,y)$, gradient descent answered how much move the $x$ variable and the $y$ variable to produce the greatest decrease in output. Now that we are applying gradient descent to our cost curve $J(m, b)$, the technique should answer how much to move the $m$ variable and the $b$ variable to produce the greatest decrease in cost, or RSS. So when altering our regression line, how much of that change should come through changing the slope versus how much of that change should come from a change to the y-intercept.

As we know, the gradient of a function is simply the partial derivatives with respect to each of the variables, so:

$$ \nabla J(m, b) = \frac{\delta J}{\delta m}, \frac{\delta J}{\delta b}$$

In calculating the partial derivatives of our function $J(m, b) = \sum_{i=1}^{n}(y_i - (mx_i + b))^2$, we won't change the result if we ignore the summation then replace it back at the end, so that's what we'll do to make our lives easier.

Ok, so let's take our partial derivatives. $$\frac{\delta J}{\delta m}J(m, b) = \frac{dJ}{dm}(y - (mx + b))^2$$

$$\frac{\delta J}{\delta m}J(m, b) = \frac{\delta J}{\delta b}(y - (mx + b))^2$$

Taking our first partial derivative

Let's start with taking the partial derivative with respect to $m$.

$$\frac{\delta J}{\delta m}J(m, b) = \frac{dJ}{dm}(y - (mx + b))^2$$

Now this is a tricky function to take the derivative of. So we can use functional composition followed by the chain rule to make it easier. Using functional composition, we can rewrite our function $J$ as two functions:

$$g(m,b) = y - (mx + b)$$

$$J(g(m,b)) = (g(m,b))^2$$

Now using the chain rule to find the partial derivative with respect to a change in the slope, gives us:

$$\frac{dJ}{dm}J(g) = \frac{dJ}{dg}J(g(m, b))*\frac{dg}{dm}g(m,b)$$

Our next step is to solve these derivatives individually:

$$\frac{dJ}{dg}J(g(m, b)) = \frac{dJ}{dg}g(m,b)^2 = 2*g(m,b)$$

$$\frac{dg}{dm}g(m,b) = \frac{dg}{dm} (y - mx +b) = \frac{dg}{dm}y - \frac{dg}{dm}mx - \frac{dg}{dm}b = -x $$

Now plugging these back into our chain rule we have:

$\frac{dJ}{dg}J(g(m,b))\frac{dg}{dm}g(m,b) = (2g(m,b))x = 2(y - mx + b)*-x $

$$\frac{\delta J}{\delta m}J(m, b) = 2*(y - mx + b )-x = -2x(y - (mx + b )) $$

Our second partial derivative

Ok, now let's calculate the partial derivative with respect to a change in the y-intercept. We express this mathematically with the following:

$$\frac{\delta J}{\delta m}J(m, b) = \frac{dJ}{dm}(mx + b - y)^2$$

Then once again, we use functional composition following by the chain rule. So we view our cost function as the same two functions $g(m,b)$ and $J(g(m,b))$.

$$g(m,b) = y - (mx + b)$$

$$J(g(m,b)) = (g(m,b))^2$$

So applying the chain rule, to this same function composition, we get.

$$\frac{dJ}{db}J(g) = \frac{dJ}{dg}J(g)*\frac{dg}{db}g(m,b)$$

Now, our next step is to calculate these partial derivatives individually.

From our earlier calculation of the partial derivative, we know that $\frac{dJ}{dg}J(g(m,b)) = \frac{dJ}{dg}g(m,b)^2 = 2*g(m,b)$. The only thing left to calculate is $\frac{dg}{db}g(m,b)$.

$\frac{dg}{db}g(m,b) = \frac{dg}{db}(y - (mx + b) ) = -1$

Now we plug our terms into our chain rule and get:

$$ \frac{dJ}{dg}J(g)\frac{dg}{db}g(m,b) = 2g(m,b)-1 = -2(y - (mx + b)) $$

Using both of our partial derivatives for gradient descent

Ok, so now we have our two partial derivatives for $\nabla J(m, b)$:

$$ \frac{dJ}{dm}J(m,b) = -2x(y - (mx + b )) $$ $$ \frac{dJ}{db}J(m,b) = -2(y - (mx + b)) $$

And as $mx + b$ = is just our regression line, we can simplify these formulas to be:

$$ \frac{dJ}{dm}J(m,b) = -2x(y - \overline{y}) = -2x\epsilon$$ $$ \frac{dJ}{db}J(m,b) = -2*(y - \overline{y}) = -2\epsilon$$

Remember, error = actual - expected, so we can replace $y - \overline{y}$ with $\epsilon$, our error. Finally, adding back in our summations we have:

$$ \frac{dJ}{dm}J(m,b) = -2*\sum_{i=1}^n x(y_i - \overline{y}i) = -2*\sum{i=1}^n x_i*\epsilon_i$$ $$ \frac{dJ}{db}J(m,b) = -2*\sum_{i=1}^n(y_i - \overline{y}i) = -2*\sum{i=1}^n \epsilon_i$$

So that is what what we'll do to find the our "best fit regression line." We'll start with an initial regression line with values of $m$ and $b$. Then we'll go through our dataset, and with each point will use the above formulas to tell us how to update our regression line such that it continues to minimize our cost function.

So in the context of gradient descent, we use these partial derivatives to take a step size. And remember that our step should be in the opposite direction of our partial derivatives as we are descending to move towards the minimum. So to take a step towards gradient descent we use the general formula of:

current_m = old_m $ - \frac{dJ}{dm}J(m,b)$

current_b = old_b $ - \frac{dJ}{db}J(m,b) $

or in the code that we just calculated:

current_m = old_m $ - (-2*\sum_{i=1}^n x*\epsilon_i )$

current_b = old_b $ - ( -2*\sum_{i=1}^n \epsilon_i )$

In the next lesson, we'll work through translating this technique, with use of our $\nabla J(m, b)$, into code to descend along our cost curve and find the "best fit" regression line.

Summary

In this section we developed some intuition for why the gradient of a function is the direction of steepest ascent and the negative gradient of a function is the direction of steepest decent. Essentially, the gradient uses the partial derivatives to see what change will result from a of any of the function's dimensions, and then moves in that direction weights towards the partial derivative with the larger magnitude.

We also practiced calculating some gradients, and ultimately calculated the gradient for our cost function. This gave us two formulas which tell us how to update our regression line so that it descends along our cost function and approaches a "best fit line".

cutterbuck / gradient-to-cost-function-data-science-pilot Goto Github PK

gradient-to-cost-function-data-science-pilot's Introduction

Introduction

Representing RSS as a multivariable function

Calculating the gradient of our cost function

Taking our first partial derivative

Our second partial derivative

Using both of our partial derivatives for gradient descent

Summary

gradient-to-cost-function-data-science-pilot's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent