Code Monkey home page Code Monkey logo

stat-learning's Introduction

stat-learning's People

Contributors

asadoughi avatar clarkfitzg avatar eignatenkov avatar jamesowers avatar jdavis avatar jsterkel avatar justmarkham avatar rzijp avatar sjuvekar avatar skhal avatar sudar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stat-learning's Issues

Chapter-2 Excercise-4

In part-b of the question, the examples 1 and 2 are 'prediction' and not 'inference' as mentioned in the solution.

Chapter 7 Exercise 2

I believe the solutions are all shifted 1 degree of freedom.
For a) the function should be 0 everywhere, otherwise the regularization term is infinity, for b) the first dericative has to be 0 --> function needs to be constant, etc.

Question 9e (applied.html) Chapter 03

Q: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Your answer:
From the correlation matrix, I obtained the two highest correlated pairs and used them in picking my interaction effects. From the p-values, we can see that the interaction between displacement and weight is statistically signifcant, while the interactiion between cylinders and displacement is not.

Interaction is not relevent to the correlation. The interaction determines the influence of one predictor on the effectiveness of the other predictor on the response which is different from correlation.
We have cases where there is not correlation but there is interaction. For example in the Advertising dataset:

a = read.csv("data/Advertising.csv")
pairs(a)

No strong correlation between TV and Radio but there is interaction!

In addition, a better way to evaluate the interaction effect is to use .*. in the lm function; It will include all the predictors and the combination of all possible binary interactions:

sset = subset(Auto, select=-name)
fit = lm(mpg~.*.,sset)
summary(fit)

Note that the decision to include any interaction term in the final model is under the topic of model selection.

Thanks

Exercise 5a Chapter 10

Why is (3,4,6,8) clustered and not (3,4,5,6) or (2) and (1,3,4,5,6,7,8)? Because these are closest to each other and have the least socks and computers?

Quick check for answer to Chapter 2, question 6

A parametric approach also allows us to make inferences about how the predictors affect the underlying function.

A non-parametric approach has the disadvantage that it is more difficult to draw inferences about how the predictors affect the function.

Thought it might be a good addition to that answer. Great repository! Thank you for uploading. I am using it to check my answers.

9.7.d SVM plots

The SVM pair plots for polynomial and radial kernels are homogeneous and therefore not helpful for assessing the fits.

The slice argument for plot.svm specifies to which constants the other dimensions should be held, that is, which 2-D hyperplane the data are projected onto for visualization. The default is zero, which may give a hyperplane that does not intersect with the decision boundary. Setting the slice argument to the medians of the other dimensions ensures the hyperplane the data are projected onto is near the data and intersects the decision boundary. The intersection of the boundary and the hyperplane may not perfectly partition data projected from higher dimensions, but it might indicate, for example, appropriate curvature, or provide ideas to improve the fit.

Ch3 13i

The answer is wrong.
eps2 = rnorm(100, 0, 0.5) is the same as before, while the problem requires you to increase the variance.

Chapter 3: Exercise 13

As reported by @shayan-mj in #3

In the exercise 13 of chapter 3 was said "create vector, esp, from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25". So, the standard deviation is equal to 0.5, and in solution was written "esp = rnorm(100, 0, 0.25)", but in it should be written "esp = rnorm(100, 0, 0.5)", because "rnorm" function get standard deviation as its argument.

Chapter 2 Exercise 10 d-h: Boston suburbs data set

On page 14 of ISLR (29 of PDF) it describes the Boston data set as "Housing values and other information about Boston suburbs." I didn't know this when I originally answered these questions, and instead I made a new suburbs variable based on the towns in the data set. Instead, the data set should have been used as-is.

Ch 8 Ex 2

I agree with the intuition of the argument, but it is not necessarily true that each iteration will produce a split of an unused variable. I.e. if variable X(1) is used for f(1) in the first iteration, then X(1) will not be used in f(2). However, it could be used in f(3). In the end, there can be multiple f(b)'s that use X(j).

The final summation statement is true if f(j) is the summation of all f-hats of variable X(j).
I.e. each f(j) = sum[I(X(j) = Xb)*f-hat(Xb)] over all the iterations of boosting, which can be (much) greater than p.

Excuse my poor math notation here

Ch5 lab 7

I've also got the answer to this exercise close to 0.45.
But if cv.glm is used, I get an answer around 0.25 which is confusing:

glm.fit.all <- glm(Direction ~ Lag1 + Lag2)
LOOCV_error <- cv.glm(frame, glm.fit.all)$delta[1] # is around 0.25 instead of 0.45

Can anyone clarify this?
Thank you!

Re-format Chapter 2 and 3 exercises

In the Chapter 4 and beyond solutions, I followed a format of using one Rmd file per problem with the knitr package to generate Markdown and HTML files. It would be nice to have a consistent format.

It would be especially nice to have photos of the handwritten notes typed.

Chapter 4 - Exercise 3

The answer you gave is rather misleading.
And the $\delta_{k}(x)$ is not correct compare to (4.23).
The answer should be simplified.


Since we have
\begin{gather}
$f_{k}(x) = \frac{1}{(2 \pi)^{1/2}\sigma_{k}} e^{-\frac{1}{2}(x-\mu_{k})^{T}\sigma_{k}^{-1}(x-\mu_{k})}$
\end{gather}

For $\delta_{k}(x)$,

\begin{align}
\delta_{k}(x) &= \log \big( f_{k}(x)\pi_{k} \big) \
&= -\log (2 \pi)^{1/2} - \log \sigma_{k} - \frac{1}{2}(x-\mu_{k})^{T} \sigma_{k}^{-1}(x-\mu_{k}) + \log \pi_{k} \
&= - \log \sigma_{k}-\frac{1}{2}(x-\mu_{k})^{T}\sigma_{k}^{-1}(x-\mu_{k}) + \log \pi_{k} + c \
&= - \log \sigma_{k} + \log \pi_{k} -\frac{1}{2}\sigma_{k}^{-1}(x-\mu_{k})^{2}
\end{align}

$c$ can be cancelled due to it is merely a constant for all $K$ functions.
The above formula will generate an quadratic term $x^{2}$.

Chapter 9 Question 7

It is not mentioned in the task statement explicitly, but I think it is implied that mpg should be excluded as a predictor.
The solution proposed here uses mpg as a predictor for the mpglevel, which makes this model completely useless (as if we knew mpg, then there would be a much simpler solution for getting mpglevel)

p2 ch7

Hey, I think your answer for p2 ch7 is wrong.
(a) Since we are minimizing the area under the curve of g(x)^2, g(x) would be just 0.
(b) Since we are minimizing the area under the curve of g'(x)^2, g(x)' would be just 0 and g(x) = k, where k is some constant.
(c) Since we are minimizing the area under the curve of g''(x)^2, g(x)'' would be just 0 and g(x) = ax+b, which is a straight line passing through the points.
(d) g'''(x)= 0, so g(x) =ax^2+bx+c, which is a quadratic line passing through the points.
(e) the penalty term is dropped, so g(x) is a interpolant that goes through all the points to make RSS = 0.

CH4. Ex 3

Could you expand the process on getting the final result?

I think there might be sth wrong with log operation

7.2.a

Isn't the answer to this exercise g(x)=0 instead of g(x)=k?

Suggestion for chapter 7 exercise 12

set.seed(1)
p=100
n=1000
max.rep=1000
x=matrix(ncol=p,nrow=n)
coefi=rep(NA,p)
for (i in 1:p){
x[,i]=rnorm(n)
coefi[i]=rnorm(1)100
}
y=x%
%coefi+rnorm(n)
beta=rep(0,p)
error=rep(0,max.rep)
for (j in 1:max.rep){
for (i in 1:p){
a=y-x%*%beta+beta[i]x[,i]
beta[i]=lm(a~x[,i])$coef[2]
}
error[j]=sum((y-x%
%beta)^2)
}
error
plot(1:max.rep,error,ylim=c(0,1))
plot(1:10,error[1:10])

Chapter 7 Ex 3

The R formula is not correct. It should be y = 1 + x + -2 * (x - 1)^2 * I(x > 1)

3(b)v.

https://github.com/asadoughi/stat-learning/blob/master/ch2/answers#L45

Line 45 says - When the training error is lower than the irreducible error,
overfitting has taken place. This is not true since most of the times the training error would be less
than the irreducible error. I don't thing training error says much about overfitting alone. Only when training error keeps reducing and test error keeps increasing we can say that we're overfitting.
Thoughts?

Ch2 Q8 C

pairs(college[,1:10])
Error in pairs.default(college[, 1:10]) : non-numeric argument to 'pairs'
plot(college$Private, college$Outstate)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf

Ch3 Ex9c

The question requires the use of multiple linear regression.
The predictor "origin" is stored as a num, but it is a qualitative data.
Shouldn't it be change to factor using the following code before doing the regression?

origin_fac = factor(Auto$origin, levels = c("American","European","Japanese"))
new_data = subset(Auto, select = -c(name,origin))
new_data = new_data.frame(new_data, origin_fac)
lm.fix =  lm(mpg~. ,data= new_data)

Suggestion for Chapter 2 question 8,c,iii

With R version 4.0.2, it seems I must perform a factor() to categorize the "char" column of "Private". Thus, the solution would become
plot(as.factor(college$Private), college$Outstate)

Ch3, 11c

Both results in (a) and (b) reflect the same line created in 11a. In other words, y=2x+ϵy=2x+ϵ could also be written x=0.5∗(y−ϵ)x=0.5∗(y−ϵ).

I think they both arent same regression lines, because they both have different slope. or ?

Ch4-Exercise4

Hi, I have a question on 4.4.(c)
the solution is 100*(0.1)^100, why do we multiply 100 here
In4.4.(b), the answer did not multiply 2

Thanks

Ch10 Ex7

The question require us to scale each observation to have mean 0 and SD 1, not each predictor, in order to produce the desire proportionality.

Suggested answer:

data(USArrests)

scale_data = scale(t(USArrests))

correalation = cor(scale_data)
corr.1.minus.r = 1 - as.dist(correalation)

distance.squared = dist(t(scale_data), method = "euclidean") ^ 2

summary(corr.1.minus.r/distance.squared)

Sub-Section 3.6.2 Simple Linear regression page 112, missing $ in code

In chapter 3, Sub-Section 3.6.2 Simple Linear regression on page 112, there is a paragraph that states, to quote,

we will now plot medv and lstat along with the least squares regression line using the plot() and albline () functions.
plot(lstat, medv)

This code returns an error Error in plot(lstat, medv) : object 'lstat' not found
there is a missing $ in the code plot(lstat, medv)
The correct code syntax should be > plot(Boston$lstat, Boston$medv)

7.5b)

solutions say g1 is expected to have smaller test RSS because of one less degree of freedom but this is not necessarily true. For example, if the true DGP is cubic, then g2 will fit better.

Ch7 9(d) bs(4) has *ONE* knot

  1. First, bs(4) should have only ONE knot;
  2. Second, it seems like when adding both df and knot in bs(), the result is not just ignoring knot.

Chapter 4 - Exercise 2

I think the answer you gave is confusing and didn't explain the purpose of doing the transformation.
Actually the reason why you should find the $k$th class that will maximize $\delta_{k}(x)$ is due to the Bayes Theorem.
From the Bayes' Theorem(4.12) we know , for any class $k$, the total probability $\sum\limits_{l=1}^{K} \pi_{l} f_{l}(x)$ for each $p_{k}(x)$ is the same.
However, the prior probability $\pi_{k}$ and the probability $f_{k}(x)$ will differ depending on it's $k$.
So, the objective is to find the largest $\pi_{k}f_{k}(x)$ among the range of $(\pi_{1}f_{1}(x),\dots,\pi_{k}f_{k}(x), \dots,\pi_{K}f_{K}(x))$. This will lead us to find the largest $p_{k}(x)$
With the logarithm transformation we get $\delta_{k}(x) = \log \big(\pi_{k}f_{k}(x)\big)$.
In the end finding the largest $\delta_{k}(x)$ among $K$ classes is equivalent to find the largest $p_{k}(x)$ among $K$ classes. But the computation for $\delta_{k}(x)$ is much easier than $p_{k}(x)$.

Chapter 4 Exercise 11

Point (g):
The selected train data (train = (year%%2 == 0) ) leads to optimal KNN solution with k = 3. Test error in this case amounts to 13.7 %.
#######

knn.pred = knn(train.X, test.X, train.mpg01, k = 3)
mean(knn.pred != mpg01.test)
[1] 0.1373626
#######

Question 10.7

Thank you for writing up these solutions! They were very helpful to me.
The answer to 10.7 is incorrect- the observations should be normalized by row rather than by column, which will yield a fixed ratio of 1/6. All that needs to change is your first line:

dsc = t(scale(t(USArrests)))
a = dist(dsc)^2
b = as.dist(1 - cor(t(dsc)))
summary(b/a)

Chapter 10 Question 7

The solution is obviously wrong because the check at the end fails (there should be no variability in this ratio).
The reason is that "scale" in R standardizes columns, not rows (which is counter-intuitive, but they do tell us to make sure that the sd of each observation, not variable is 1).
This can be solved by running dsc = t(scale(t(USArrests))) instead of dsc = scale(USArrests)

Ch2 Q9 Quantiles

This does not remove the 10th and 85th quantile, it removes elements 10 through 85. newAuto = Auto[-(10:85),] If auto had 100 entries all in the same order it would work. I think you need to sort each quantitative variable and then remove elements 0.10_length(mpg): 0.85_length(mpg) for instance.

Better transformation for Ch3 Q9-(f)

log transform mpg, produced better R-squared and residual diagnostics

here is the R code and output

lm.fit2<-lm(log(mpg)~cylinders+displacement+horsepower+weight+sqrt(acceleration)+year+origin,data=Auto)
summary(lm.fit2)
par(mfrow=c(2,2))
plot(lm.fit2)
plot(predict(lm.fit2),rstudent(lm.fit2))

Call:
lm(formula = log(mpg) ~ cylinders + displacement + horsepower +
weight + sqrt(acceleration) + year + origin, data = Auto)

Residuals:
Min 1Q Median 3Q
-0.41288 -0.06546 0.00002 0.06837
Max
0.34063

Coefficients:
Estimate
(Intercept) 1.829e+00
cylinders -2.806e-02
displacement 6.171e-04
horsepower -1.615e-03
weight -2.497e-04
sqrt(acceleration) -2.344e-02
year 2.952e-02
origin 4.068e-02
Std. Error t value
(Intercept) 1.986e-01 9.210
cylinders 1.156e-02 -2.428
displacement 2.699e-04 2.287
horsepower 5.008e-04 -3.225
weight 2.373e-05 -10.520
sqrt(acceleration) 2.896e-02 -0.810
year 1.823e-03 16.193
origin 9.948e-03 4.090
Pr(>|t|)
(Intercept) < 2e-16 ***
cylinders 0.01562 *
displacement 0.02276 *
horsepower 0.00137 **
weight < 2e-16 ***
sqrt(acceleration) 0.41865
year < 2e-16 ***
origin 5.26e-05 ***

Signif. codes:
0 ‘**’ 0.001 ‘__’ 0.01 ‘’ 0.05
‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.119 on 384 degrees of freedom
Multiple R-squared: 0.8797, Adjusted R-squared: 0.8775
F-statistic: 401 on 7 and 384 DF, p-value: < 2.2e-16

Chapter 3 exercise 1

"The high p-value of newspaper suggests that the null hypothesis is true for newspaper." is inaccurate, it should be "The high p-value of newspaper suggests that we can't reject the null hypothesis for newspaper."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.