asadoughi / stat-learning Goto Github PK

View Code? Open in Web Editor NEW

2.1K 227.0 1.6K 19.13 MB

Notes and exercise attempts for "An Introduction to Statistical Learning"

Home Page: http://asadoughi.github.io/stat-learning

R 0.24% HTML 99.76%

stat-learning's Introduction

stat-learning

Notes and exercise attempts for "An Introduction to Statistical Learning"

http://www.statlearning.com http://statlearning.class.stanford.edu/

"(*)" means I am not sure about the answer

Try out RStudio (www.RStudio.com) as an R IDE with the knitr package.

Pull requests gladly accepted. If a pull request is too much effort, please at least file a new issue. :)

Visit http://asadoughi.github.io/stat-learning for an index of exercise solutions.

stat-learning's People

Contributors

Stargazers

Watchers

Forkers

nrolland jrla andland planetblix grancalavera al2na dsivaji nvanfalk isaias phillip-burger-sculpturearts fch808 sjuvekar mnozary yurimejia prophet555 honglang akim-delli jiehanwang carlesla liamks jamesowers imgeek orazaro flaviobarros nsolcampbell paquicf lixx1746 feynman0825 pinakimahata cetrulin njvijay parashar1505-yahoo fsgp capgalea junghyun87 drmoca vmandrade pedroambh ma-t bryan3b vlsi1217 everglory99 jameselvy kilburnw berrykim wavelets segszilo asif2477 satyaprakashnayak snowdj renco lincrampton ikol1729 nuwanda07 czucchet twistedmove waith invinciblejha xin2010zhang drakulic pslai maddyshenk cheerzzh geoclick vincentfirmansyah gangna ghost321 cnisapha miguelosh drifap hshwdpain arturochian zlk205 edwardt liyi-1989 jawadrashid2011 cleonard1261 themltrader erikyao konggas antdit sadapple jinhyeokjang narayana1208 praneethj13 fertueros tpkalman jaimebayes fdoperezi 3sr johntd500 chaturanj wgarvino rainstar82 pgnepal gitshang yuwei2341 yogessvaren yesemsanthoshkumar f-zhan

stat-learning's Issues

Chapter 6: Exercise 7

Unaware of a complete solution. Googling for "ridge regression posterior mean mode" comes up with things like http://ssli.ee.washington.edu/courses/ee511/HW/hw3_solns.pdf. Googling "lasso posterior mode" hasn't rendered anything useful.

Chapter-2 Excercise-4

In part-b of the question, the examples 1 and 2 are 'prediction' and not 'inference' as mentioned in the solution.

Chapter 7 Exercise 2

I believe the solutions are all shifted 1 degree of freedom.
For a) the function should be 0 everywhere, otherwise the regularization term is infinity, for b) the first dericative has to be 0 --> function needs to be constant, etc.

Question 9e (applied.html) Chapter 03

Q: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Your answer:
From the correlation matrix, I obtained the two highest correlated pairs and used them in picking my interaction effects. From the p-values, we can see that the interaction between displacement and weight is statistically signifcant, while the interactiion between cylinders and displacement is not.

Interaction is not relevent to the correlation. The interaction determines the influence of one predictor on the effectiveness of the other predictor on the response which is different from correlation.
We have cases where there is not correlation but there is interaction. For example in the Advertising dataset:

a = read.csv("data/Advertising.csv")
pairs(a)

No strong correlation between TV and Radio but there is interaction!

In addition, a better way to evaluate the interaction effect is to use .*. in the lm function; It will include all the predictors and the combination of all possible binary interactions:

sset = subset(Auto, select=-name)
fit = lm(mpg~.*.,sset)
summary(fit)

Note that the decision to include any interaction term in the final model is under the topic of model selection.

Thanks

Exercise 5a Chapter 10

Why is (3,4,6,8) clustered and not (3,4,5,6) or (2) and (1,3,4,5,6,7,8)? Because these are closest to each other and have the least socks and computers?

Quick check for answer to Chapter 2, question 6

A parametric approach also allows us to make inferences about how the predictors affect the underlying function.

A non-parametric approach has the disadvantage that it is more difficult to draw inferences about how the predictors affect the function.

Thought it might be a good addition to that answer. Great repository! Thank you for uploading. I am using it to check my answers.

9.7.d SVM plots

The SVM pair plots for polynomial and radial kernels are homogeneous and therefore not helpful for assessing the fits.

The slice argument for plot.svm specifies to which constants the other dimensions should be held, that is, which 2-D hyperplane the data are projected onto for visualization. The default is zero, which may give a hyperplane that does not intersect with the decision boundary. Setting the slice argument to the medians of the other dimensions ensures the hyperplane the data are projected onto is near the data and intersects the decision boundary. The intersection of the boundary and the hyperplane may not perfectly partition data projected from higher dimensions, but it might indicate, for example, appropriate curvature, or provide ideas to improve the fit.

Ch3 13i

The answer is wrong.
eps2 = rnorm(100, 0, 0.5) is the same as before, while the problem requires you to increase the variance.

Chapter 3: Exercise 13

As reported by @shayan-mj in #3

In the exercise 13 of chapter 3 was said "create vector, esp, from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25". So, the standard deviation is equal to 0.5, and in solution was written "esp = rnorm(100, 0, 0.25)", but in it should be written "esp = rnorm(100, 0, 0.5)", because "rnorm" function get standard deviation as its argument.

minor typo in chapter 5 exercise #5.c

It is quite a minor issue but JFYI.
There are two consecutive "the"s ;)
"as the the sample size"

Chapter 2 Exercise 10 d-h: Boston suburbs data set

On page 14 of ISLR (29 of PDF) it describes the Boston data set as "Housing values and other information about Boston suburbs." I didn't know this when I originally answered these questions, and instead I made a new suburbs variable based on the towns in the data set. Instead, the data set should have been used as-is.

Ch 8 Ex 2

I agree with the intuition of the argument, but it is not necessarily true that each iteration will produce a split of an unused variable. I.e. if variable X(1) is used for f(1) in the first iteration, then X(1) will not be used in f(2). However, it could be used in f(3). In the end, there can be multiple f(b)'s that use X(j).

The final summation statement is true if f(j) is the summation of all f-hats of variable X(j).
I.e. each f(j) = sum[I(X(j) = Xb)*f-hat(Xb)] over all the iterations of boosting, which can be (much) greater than p.

Excuse my poor math notation here

Clarify usage of the 0.2 standard deviation criterion in Chapter 7 Q10

In the proposed answer to part a, 0.2 standard deviation of optimum is used to demarcate the range of acceptable evaluation metrics. Is there any evidence to support why 0.2? It appears rather arbitrary to me.

Ch5 lab 7

I've also got the answer to this exercise close to 0.45.
But if cv.glm is used, I get an answer around 0.25 which is confusing:

glm.fit.all <- glm(Direction ~ Lag1 + Lag2)
LOOCV_error <- cv.glm(frame, glm.fit.all)$delta[1] # is around 0.25 instead of 0.45

Can anyone clarify this?
Thank you!

Navigation/index between answers

As mentioned in #3, adding a navigation or index between answers would be useful.

Maybe something with Github Pages?

Any update for ISLR 2nd edition?

Re-format Chapter 2 and 3 exercises

In the Chapter 4 and beyond solutions, I followed a format of using one Rmd file per problem with the knitr package to generate Markdown and HTML files. It would be nice to have a consistent format.

It would be especially nice to have photos of the handwritten notes typed.

Unexpected > in chapter 2 applied.R

In line 156 there is an unexpected '>'

Chapter 4 - Exercise 3

The answer you gave is rather misleading.
And the $\delta_{k}(x)$ is not correct compare to (4.23).
The answer should be simplified.

Since we have
\begin{gather}
$f_{k}(x) = \frac{1}{(2 \pi)^{1/2}\sigma_{k}} e^{-\frac{1}{2}(x-\mu_{k})^{T}\sigma_{k}^{-1}(x-\mu_{k})}$
\end{gather}

For $\delta_{k}(x)$,

\begin{align}
\delta_{k}(x) &= \log \big( f_{k}(x)\pi_{k} \big) \
&= -\log (2 \pi)^{1/2} - \log \sigma_{k} - \frac{1}{2}(x-\mu_{k})^{T} \sigma_{k}^{-1}(x-\mu_{k}) + \log \pi_{k} \
&= - \log \sigma_{k}-\frac{1}{2}(x-\mu_{k})^{T}\sigma_{k}^{-1}(x-\mu_{k}) + \log \pi_{k} + c \
&= - \log \sigma_{k} + \log \pi_{k} -\frac{1}{2}\sigma_{k}^{-1}(x-\mu_{k})^{2}
\end{align}

$c$ can be cancelled due to it is merely a constant for all $K$ functions.
The above formula will generate an quadratic term $x^{2}$.

Chapter 9 Question 7

It is not mentioned in the task statement explicitly, but I think it is implied that mpg should be excluded as a predictor.
The solution proposed here uses mpg as a predictor for the mpglevel, which makes this model completely useless (as if we knew mpg, then there would be a much simpler solution for getting mpglevel)

p2 ch7

Hey, I think your answer for p2 ch7 is wrong.
(a) Since we are minimizing the area under the curve of g(x)^2, g(x) would be just 0.
(b) Since we are minimizing the area under the curve of g'(x)^2, g(x)' would be just 0 and g(x) = k, where k is some constant.
(c) Since we are minimizing the area under the curve of g''(x)^2, g(x)'' would be just 0 and g(x) = ax+b, which is a straight line passing through the points.
(d) g'''(x)= 0, so g(x) =ax^2+bx+c, which is a quadratic line passing through the points.
(e) the penalty term is dropped, so g(x) is a interpolant that goes through all the points to make RSS = 0.

CH4. Ex 3

Could you expand the process on getting the final result?

I think there might be sth wrong with log operation

7.2.a

Isn't the answer to this exercise g(x)=0 instead of g(x)=k?

Suggestion for chapter 7 exercise 12

set.seed(1)
p=100
n=1000
max.rep=1000
x=matrix(ncol=p,nrow=n)
coefi=rep(NA,p)
for (i in 1:p){
x[,i]=rnorm(n)
coefi[i]=rnorm(1)100
}
y=x%%coefi+rnorm(n)
beta=rep(0,p)
error=rep(0,max.rep)
for (j in 1:max.rep){
for (i in 1:p){
a=y-x%*%beta+beta[i]x[,i]
beta[i]=lm(a~x[,i])$coef[2]
}
error[j]=sum((y-x%%beta)^2)
}
error
plot(1:max.rep,error,ylim=c(0,1))
plot(1:10,error[1:10])

Chapter 7 Ex 3

The R formula is not correct. It should be y = 1 + x + -2 * (x - 1)^2 * I(x > 1)

3(b)v.

https://github.com/asadoughi/stat-learning/blob/master/ch2/answers#L45

Line 45 says - When the training error is lower than the irreducible error,
overfitting has taken place. This is not true since most of the times the training error would be less
than the irreducible error. I don't thing training error says much about overfitting alone. Only when training error keeps reducing and test error keeps increasing we can say that we're overfitting.
Thoughts?

Ch2 Q8 C

pairs(college[,1:10])
Error in pairs.default(college[, 1:10]) : non-numeric argument to 'pairs'
plot(college$Private, college$Outstate)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf

Ch3 Ex9c

The question requires the use of multiple linear regression.
The predictor "origin" is stored as a num, but it is a qualitative data.
Shouldn't it be change to factor using the following code before doing the regression?

origin_fac = factor(Auto$origin, levels = c("American","European","Japanese"))
new_data = subset(Auto, select = -c(name,origin))
new_data = new_data.frame(new_data, origin_fac)
lm.fix =  lm(mpg~. ,data= new_data)

Suggestion for Chapter 2 question 8,c,iii

With R version 4.0.2, it seems I must perform a factor() to categorize the "char" column of "Private". Thus, the solution would become
plot(as.factor(college$Private), college$Outstate)

Ch3, 11c

Both results in (a) and (b) reflect the same line created in 11a. In other words, y=2x+ϵy=2x+ϵ could also be written x=0.5∗(y−ϵ)x=0.5∗(y−ϵ).

I think they both arent same regression lines, because they both have different slope. or ?

Ch4-Exercise4

Hi, I have a question on 4.4.(c)
the solution is 100*(0.1)^100, why do we multiply 100 here
In4.4.(b), the answer did not multiply 2

Thanks

Ch10 Ex7

The question require us to scale each observation to have mean 0 and SD 1, not each predictor, in order to produce the desire proportionality.

Sub-Section 3.6.2 Simple Linear regression page 112, missing $ in code

In chapter 3, Sub-Section 3.6.2 Simple Linear regression on page 112, there is a paragraph that states, to quote,

we will now plot medv and lstat along with the least squares regression line using the plot() and albline () functions.
plot(lstat, medv)

This code returns an error Error in plot(lstat, medv) : object 'lstat' not found
there is a missing $ in the code plot(lstat, medv)
The correct code syntax should be > plot(Boston$lstat, Boston$medv)

7.5b)

solutions say g1 is expected to have smaller test RSS because of one less degree of freedom but this is not necessarily true. For example, if the true DGP is cubic, then g2 will fit better.

Not Kosher Proof 0=0

stat-learning/ch3/answers

Line 63 in 8ccf543

0 = 0

Ch7 9(d) bs(4) has ONE knot

First, bs(4) should have only ONE knot;
Second, it seems like when adding both df and knot in bs(), the result is not just ignoring knot.

Chapter 5: Exercise 8

p = 2 not 1

Chapter 4 - Exercise 2

I think the answer you gave is confusing and didn't explain the purpose of doing the transformation.
Actually the reason why you should find the $k$th class that will maximize $\delta_{k}(x)$ is due to the Bayes Theorem.
From the Bayes' Theorem(4.12) we know , for any class $k$, the total probability $\sum\limits_{l=1}^{K} \pi_{l} f_{l}(x)$ for each $p_{k}(x)$ is the same.
However, the prior probability $\pi_{k}$ and the probability $f_{k}(x)$ will differ depending on it's $k$.
So, the objective is to find the largest $\pi_{k}f_{k}(x)$ among the range of $(\pi_{1}f_{1}(x),\dots,\pi_{k}f_{k}(x), \dots,\pi_{K}f_{K}(x))$. This will lead us to find the largest $p_{k}(x)$
With the logarithm transformation we get $\delta_{k}(x) = \log \big(\pi_{k}f_{k}(x)\big)$.
In the end finding the largest $\delta_{k}(x)$ among $K$ classes is equivalent to find the largest $p_{k}(x)$ among $K$ classes. But the computation for $\delta_{k}(x)$ is much easier than $p_{k}(x)$.

Chapter 4 Exercise 11

Point (g):
The selected train data (train = (year%%2 == 0) ) leads to optimal KNN solution with k = 3. Test error in this case amounts to 13.7 %.
#######

knn.pred = knn(train.X, test.X, train.mpg01, k = 3)
mean(knn.pred != mpg01.test)
[1] 0.1373626
#######

Question 10.7

Thank you for writing up these solutions! They were very helpful to me.
The answer to 10.7 is incorrect- the observations should be normalized by row rather than by column, which will yield a fixed ratio of 1/6. All that needs to change is your first line:

dsc = t(scale(t(USArrests)))
a = dist(dsc)^2
b = as.dist(1 - cor(t(dsc)))
summary(b/a)

Chapter 10 Question 7

The solution is obviously wrong because the check at the end fails (there should be no variability in this ratio).
The reason is that "scale" in R standardizes columns, not rows (which is counter-intuitive, but they do tell us to make sure that the sd of each observation, not variable is 1).
This can be solved by running dsc = t(scale(t(USArrests))) instead of dsc = scale(USArrests)

Ch2 Q9 Quantiles

This does not remove the 10th and 85th quantile, it removes elements 10 through 85. newAuto = Auto[-(10:85),] If auto had 100 entries all in the same order it would work. I think you need to sort each quantitative variable and then remove elements 0.10_length(mpg): 0.85_length(mpg) for instance.

Better transformation for Ch3 Q9-(f)

log transform mpg, produced better R-squared and residual diagnostics

here is the R code and output

lm.fit2<-lm(log(mpg)~cylinders+displacement+horsepower+weight+sqrt(acceleration)+year+origin,data=Auto)
summary(lm.fit2)
par(mfrow=c(2,2))
plot(lm.fit2)
plot(predict(lm.fit2),rstudent(lm.fit2))

Call:
lm(formula = log(mpg) ~ cylinders + displacement + horsepower +
weight + sqrt(acceleration) + year + origin, data = Auto)

Residuals:
Min 1Q Median 3Q
-0.41288 -0.06546 0.00002 0.06837
Max
0.34063

Coefficients:
Estimate
(Intercept) 1.829e+00
cylinders -2.806e-02
displacement 6.171e-04
horsepower -1.615e-03
weight -2.497e-04
sqrt(acceleration) -2.344e-02
year 2.952e-02
origin 4.068e-02
Std. Error t value
(Intercept) 1.986e-01 9.210
cylinders 1.156e-02 -2.428
displacement 2.699e-04 2.287
horsepower 5.008e-04 -3.225
weight 2.373e-05 -10.520
sqrt(acceleration) 2.896e-02 -0.810
year 1.823e-03 16.193
origin 9.948e-03 4.090
Pr(>|t|)
(Intercept) < 2e-16 ***
cylinders 0.01562 *
displacement 0.02276 *
horsepower 0.00137 **
weight < 2e-16 ***
sqrt(acceleration) 0.41865
year < 2e-16 ***
origin 5.26e-05 ***

Signif. codes:
0 ‘**’ 0.001 ‘__’ 0.01 ‘’ 0.05
‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.119 on 384 degrees of freedom
Multiple R-squared: 0.8797, Adjusted R-squared: 0.8775
F-statistic: 401 on 7 and 384 DF, p-value: < 2.2e-16

Chapter 3 exercise 1

"The high p-value of newspaper suggests that the null hypothesis is true for newspaper." is inaccurate, it should be "The high p-value of newspaper suggests that we can't reject the null hypothesis for newspaper."