taddylab / bds Goto Github PK

View Code? Open in Web Editor NEW

160.0 160.0 135.0 75.16 MB

Code and examples from Business Data Science

HTML 7.44% R 0.01% Jupyter Notebook 92.55%

bds's People

Contributors

Stargazers

Watchers

Forkers

yufernando byelenin rahul-c1 blagburn kartechbabu myancode2015 simonfrisch1 sebanick mrom34 gfroza msure kmera ml4econ itamarcaspi asafmanela jrevelo jp178686 vishalinvincible fintrek rherbrich74 itsmonterey binhtn nkeleher jeanfraga thomas2ong xiduoladata snowdj elizabethst xinyuren1997 xinhe97 kingaman11 lnsongxf fredymoko sfarhd14 47181770 paulgrieco apgrieco mustapha-wasseja anhnguyendepocen laszlosandor buihamy tingli81 whbarlowii josemilio123 khakii esegui amoran57 steegasaurus paulledin prevostadam ebselman tspthomas jravur1308 courtlin-holt-nguyen tsuney eduardo-zambrano decastro-alex albarran gchevillon jbpatty lkersten hjuerges kabirnagpal cesarruy ack0202 nav1490 siruih1 rohunvati drmauro yuchukungang23 hanqiaozhang1994 pacov 820025nitta pepsalehi kaikogure lucyxiaoluwang lijush rodmorley bcecon scottlyden jmalaver-ecapital admariner guanxinyuan comp3916 mulderf09 cwzchen abhradeepmaiti boukos gcbastos srgmld okashimo jlinehan31 alexjxela jvega68 dan0126yoo johndon98 ruthgn jessicamiranda1506 koonn daisywdai

bds's Issues

eq: 2.22 subscript missing for y_i (digital edition, cannot confirm for printed)

Small subscript missing in equation 2.22

In the digital edition (maybe printed also) we have:

lhd = Πⁿ_i=1 p(y_i | x_i) =Πⁿ_i=1 p_i^yi(1-p_i)^1-y_i

should be:

lhd = Πⁿ_i=1 p(y_i | x_i) =Πⁿ_i=1 p_i^y_i(1-p_i)^1-y_i

Note the added subscript at p_i^y_i instead of p_i^yi

oj.R line 8 levels(oj$brand) returns NULL

Original code
## read in the data oj <- read.csv("oj.csv") head(oj) levels(oj$brand)

levels(oj$brand) returns NULL.

Modified code

## read in the data oj <- read.csv("oj.csv", stringsAsFactors=T) head(oj) levels(oj$brand)

This works.

timespace.R - crash on line 77

When running line 77 summary(ARdj <- glm(dja[2:n] ~ d of timespace.R, R Studio crashes and reports a "Segmentation fault" in the Bash terminal. I read on StackOverflow that this could be due to limited memory. I'm running Crostini / Debian on a Pixelbook with 8 GB ram and approx. 80 GB free disk space. Can anyone suggest a possible solution or troubleshooting tips?

ALGORITHM 4 K-Fold OOS (semiconductor)

Error when runing the semiconductor.R code on the Out of sample prediction experiment.

When using the loop to run the experiment in line 72 when defining the rcut object, cutvar is not specified do define the data.

Any clarifications on this?

Thanks

BDS/examples/oj.R

Lines 75 through 84 reference oj$logmove which is not part of oj.csv data and isn't established in the previous code.

Figure 5.3 appears to be total log revenue rather than average, conflicts with text; model uses mean(log(revenue)) rather than log(mean(revenue))

Pg 143/144 & https://github.com/TaddyLab/BDS/blob/master/examples/paidsearch.R

Text:

Figure 5.3 shows the log difference between average revenues in each group.

Caption:

The log-scale average revenue difference ..

Although, in the code, both plots are using totalrev and are created before semavg is defined.

The total vs average log differences will produce the same pattern on different scales, but initially confused me as I walked through the code/example.

Related, let's assume the graphs plot the mean instead of total, so it is the same as the model.

The graphs first take the average (or total in the current code) and then take the log of the average. (i.e. log(mean(revenue)))

The model uses y from semavg which takes the log and then the mean. In the code, y is defined as y=mean(log(revenue)))

Whether we use sum or mean in the model, it seems like would want to take the log after the mean. This seems especially true if we were going to use sum rather than mean.

Original Code (mean(log(revenue)))

library(data.table)
sem <- as.data.table(sem)
sem_avg_log <- sem[, 
			list(d=mean(1-search.stays.on), y=mean(log(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_avg_log, "treatment_period", "t") # names to match slides
sem_avg_log <- as.data.frame(sem_avg_log)
coef(glm(y ~ d*t, data=sem_avg_log))['d:t']

gives -0.006586852

log(mean(revenue)):

sem_log_avg <- sem[, 
			list(d=mean(1-search.stays.on), y=log(mean(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_avg, "treatment_period", "t") # names to match slides
sem_log_avg <- as.data.frame(sem_log_avg)
coef(glm(y ~ d*t, data=sem_log_avg))['d:t']

gives -0.005775498

If we were to use sum rather than mean and then log i.e. log(sum(revenue))

sem_log_sum <- sem[, 
			list(d=mean(1-search.stays.on), y=log(sum(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_sum, "treatment_period", "t") # names to match slides
sem_log_sum <- as.data.frame(sem_log_sum)
coef(glm(y ~ d*t, data=sem_log_sum))['d:t']

gives -0.005775498, which is the same as log(mean(revenue))

If we were to do sum(log(revenue)) which would clearly be wrong because the control is a larger group, then we'd get -0.2534986...

Is there a reason we should specifically use mean(log(revenue)) rather than log(mean(revenue))?

Typos

Great book - got both the digital and print version!

Just a couple of typos I found:

fixed affects (vs. fixed effects) - Footnote 14, Chapter 5
In the code comments under semi-conductors there is a typo in the word deviance.

BDS/examples/semiconductor.R

Line 48 in 128b8ba

## get null devaince too, and return R2

BDS Errata

First, thanks for your work which really provides a useful knowledge source for the data science community. Is there a list of errata already available? Thanks.

Broken url

Hi,

The link to the book's datasets that appears in the Kindle edition (Introduction) is broken.

Now: http://taddylab.com/BDS

Should be: http://taddylab.com/bds.html

(LOVE the book!)

Chapter 1 errata (as of O'Reilly has it online on March 14, 2020)

page 33, line -2: The simple proof for this assumes independence between tests and (see Figure 1.12) ## and?

page 34, fig 1.12: FDP was confusing in print (only clarified in the caption), but online it's completely different, looks truncated and also mentions FDF (so neither FDR or FDP)

(Bayes' Rule still wrong in 1.10, by the way)

betas[,1:5] -> betas[1:5,]

New to R :), but I think pg. 25 line ~6 in the browser example should be betas[1:5,] instead of betas[,1:5] to print the first 5 lines of betas.

further typos

Some little corrections (as of first printing):

p. 13, second code insert has weird spacing after matrix name trucks, also missing a $ in penultimate example
p. 14 last line of code insert: legend is "topleft" in the figure
p. 46, line 1: even thRough
p. 82, line 1: "sequence of 100 \lambda_T uses wrong subscript
p. 86, line -2: "data data"
p. 112, line 9: false negative rate should be rounded to 8%, if anything
p. 116, line 1: text says to cv.gamlr but insert calls cv.glmnet
p. 120, line -2 before first code insert: missing full stop.
p. 132, second insert: tapply is missing a closing parenthesis for ybar_w
p. 166 mid-page: "Equation 105" is Equation (6.2)
p. 184, paragraph title should read "Turning back to" (and not the)
p. 186, line 12: "relevant to consumerS."
p. 191, line -1 before last insert: "Equation 120" is actually Eq. (6.17)
p. 193, line -6: "in aN simple"
p. 199, line 2: Spain joined the European Community in 1986 (not the Commission)
p. 242, Algo 22: rotations are the eigenvectors, not the values (as the code has it)

BDS/examples/stocks.R - Dropped Records

First off, thanks for your work and I'm excited for the second edition!

I have been reproducing some examples in the book with the Julia language and came across something that threw me for a loop in the introduction. This is the first time I'd seen the response variable as a matrix with the lm function. When the regression is done with a matrix as the response variable the lm documentation notes:

If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.

This all made sense but I was getting different coefficients than in the stocks.R script provided in this repo. It turns out that lm will drop records in the response matrix if any of the variables have missing values. Since the GOOGL ticker has missing values from 2010-01-01 until 2014-03-01 it drops those records for all the other tickers as well before fitting the models.

So the original plot of coefficients was this:

If all the available data is used then the plot becomes:

This changes the interpretation of the plot slightly for most stocks but Facebook (FB) is likely notable.

Here's the code I used to create the plots:

library(tidyverse)

# get all stocks data 
url_stocks <- "https://raw.githubusercontent.com/TaddyLab/BDS/master/examples/stocks.csv"

stocks <- read.csv(url_stocks)
stocks$RET <- as.numeric(as.character(stocks$RET))
stocks$date <- as.Date(as.character(stocks$date), format="%Y%m%d")
stocks <- stocks %>% filter(TICKER!="" & RET!="")
dups <- which(duplicated(stocks[,c("TICKER","date")]))
stocks <- stocks[-dups,]

stocks$month <- paste(format(stocks$date, "%Y-%m"),"-01",sep="")
stocks$month <- as.Date(stocks$month)

agg <- function(r) prod(1+r, na.rm=TRUE) - 1
mnthly <- stocks %>%
  group_by(TICKER, month) %>%
  summarize(RET = agg(RET), SNP = agg(sprtrn))

RET <- as.data.frame(mnthly[,-4]) %>% spread(TICKER, RET)
SNP <- as.data.frame(mnthly[,c("month","SNP")])
SNP <- SNP[match(unique(SNP$month),SNP$month),]

RET <- RET %>% select(-MPET)

# get three-month U.S. treasury bills data
url_tbill <- "https://raw.githubusercontent.com/TaddyLab/BDS/master/examples/tbills.csv"
tbills <- read.csv(url_tbill)
tbills$date <- as.Date(tbills$date)

# get big company market cap data
url_bigs <- "https://raw.githubusercontent.com/TaddyLab/BDS/master/examples/bigstocks.csv"
bigs <- read.csv(url_bigs, header = FALSE, as.is = TRUE)
exr <- (as.matrix(RET[,bigs[,1]]) - tbills[,2])
mkt <- (SNP[,2] - tbills[,2])

# regression models from book
capm <- lm(exr ~ mkt)
(ab <- t(coef(capm))[,2:1])

ab <- ab[-9,]

par(mai=c(.8,.8,0,0), xpd=FALSE)
plot(ab, type="n", bty="n", xlab="beta", ylab="alpha")
abline(v=1, lty=2, col=8)
abline(h=0, lty=2, col=8)
text(ab, labels=rownames(ab), cex=bigs[,2]/350, col="navy")

# create regression per variable
exrdf <- as.data.frame(exr)
exrdf <- mutate(exrdf, mkt = mkt)

allmods <- exrdf %>% 
  pivot_longer(-mkt, names_to = "ticker", values_to = "exr") %>% 
  group_by(ticker) %>% 
  nest() %>% 
  mutate(
    regmods = map(data, ~ lm(exr ~ mkt, data = .)),
    coefs = map(regmods, broom::tidy)
    ) %>%
  unnest(coefs) %>% 
  select(ticker, term, estimate) %>% 
  pivot_wider(names_from = term, values_from = estimate) %>% 
  filter(ticker != "WMT") %>% 
  ungroup() %>% 
  select(ticker, mkt, `(Intercept)`) %>% 
  column_to_rownames("ticker")

# plot new results
par(mai=c(.8,.8,0,0), xpd=FALSE)
plot(allmods, type="n", bty="n", xlab="beta", ylab="alpha")
abline(v=1, lty=2, col=8)
abline(h=0, lty=2, col=8)
text(allmods, labels=rownames(allmods), cex=bigs[,2]/350, col="navy")

Typos eq: 1.10 and Marginal Likelihood

Hi,

Bought the digital edition of your book and, while not too far into it as for right now, I enjoy it a lot.

I noticed some typos in the Bayesian Inference subsection, specifically for equation 1.10 and the Marginal Likelihood equation.

Given the defintion of P(X|Θ) given as the probability of X given Θ

For equation 1.10, in the book (digital) we have:

 P(Θ|X) = P(Θ|X)π(Θ)/P(X) ∝ P(Θ|X)π(Θ)

I believe it should instead be:

 P(Θ|X) = P(X|Θ)π(Θ)/P(X) ∝ P(X|Θ)π(Θ)

Similarly, for the Marginal Likelihood equation:

 P(X) = ∫P(Θ|X)π(Θ)dΘ

I believe it should instead be:

 P(X) = ∫P(X|Θ)π(Θ)dΘ

code update for R 4.0

Some of the example code in the repo (and the book) does not work on R 4.0.

Instead of a PR on fixes separately, maybe I can guide you to my forked repo with Jupyter notebooks reproducing the examples chapter by chapter (with minimal comments, mostly reformatting the original repo's inline comments into cells with Markdown.)

examples/semincoductors.r

Hi!

There seems to be an issue with the cross-validation code (starts from line 59. The problem lies in line 72 where the subsetting results in a NULL variable (it uses "cutvar" that is not declared anywhere throughout the code).

I resolved the problem by correcting the subsetting bit: changed data=cutvar to data=SC[,c("FAIL",names(signif))]. It's rather crude but works like magic,