In statistics, linear regression is an approach for modeling the relationship between a response variable
- Linearity: linear regression assumes our response and explanatory variables are linearly related.
- Normality: The residuals of the model should follow a normal distribution.
- Independence: Observations are independent of each other.
- Homoscedasticity: The variance of residual is the same for any value of X.
The formula for a simple linear regression is given by
where
where
Linear regression is basically just the fancy term for finding the line of best fit. In other words, we are looking for the intercept and slope that defines a line that fits the data as well as possible. "As well as possible..." often means that we are trying to minimize the sum of squared residuals which is also called the Ordinary Least Squares (OLS) method.
-
Intercept: The value for our response variable
$y$ when our explanatory variable$x=0$ . ($b_0$ in the formula above) -
Slope: For each unit increase in
$x$ , the expected increase or decrease in$y$ . This definition is sufficient only in the case of a simple linear regression. In the case of multiple linear regression, we have to add "holding all other explanatory variables constant" to our definition, because there is more than one explanatory variable in the model. ($b_1$ in the formula above) -
Residual: The difference between an observed value and the fitted value provided by a model:
$e_i = y_i - \hat{y}_i$ -
Ordinary Least Squares (OLS): OLS is one of the simplest and most common methods used to find the parameters in a linear regression model. It tries to minimize the sum of squared residuals:
$\sum_{i=1}^n e_i^2$ .
We can calculate the
Mathematically speaking we can define
where SSR is the sum of squared residuals (
The value for
Whenever we deal with multiple linear regression
This is where adjusted
The formula of the adjusted
where
Results from notebook. Baseline model is just a simple linear regression model with only one feature. Multiple Reg-1 include 4 features with high correlation and Multiple Reg-2 only include 3 features.
Baseline Model | Multiple Reg-1 | Multiple Reg-2 |
---|---|---|
60.5 % | 70.5 % | 70.6 % |
This dataset is a slightly modified version of the dataset provided in the StatLib library. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
- mpg: continuous
- cylinders: multi-valued discrete
- displacement: continuous
- horsepower: continuous
- weight: continuous
- acceleration: continuous
- model year: multi-valued discrete
- origin: multi-valued discrete
- car name: string (unique for each instance)
Please make sure you have forked the repo and set up a new virtual environment. For this purpose you can use the following commands:
pyenv local 3.11.3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
The added requirements file contains all libraries and dependencies we need to execute the linear regression notebooks.