tidy-finance / website Goto Github PK

View Code? Open in Web Editor NEW

74.0 9.0 43.0 339.49 MB

This repository hosts the source code for the website tidy-finance.org

Home Page: https://tidy-finance.org

License: Other

R 25.88% TeX 56.99% CSS 7.68% HTML 2.38% JavaScript 5.28% SCSS 0.15% Python 0.25% Makefile 1.39%

asset-pricing finance ggplot2 tidyverse dplyr sqlite machine-learning numpy pandas plotnine

website's People

Contributors

Stargazers

Watchers

Forkers

matixr rodnm han-xiao-psu am0505 licheng-sun stoianandrei yyzeng snowdj jestover shitao5 crcastro08 duncanchen sstoeckl shizelong1985 binliwhu muellerlukas2 anhnguyendepocen quinfer jkitching skedasticity codesbymeng yanrui-zhou tiffany-yqy maraaaaaaaa qiqipixel ramnathv tentorlb jyingnan crconline dshen1 veronicazhang1106 firmai-research jcasman freedog yuxuangu12139 iangow hermansulistiyo olivair blucap heydarabdurahmanov cyrus723 jbvill

website's Issues

Possible code Bug

I was reproducing the code bellow and did not working properly.

returns <- prices |>
arrange(date) |>
mutate(ret = adjusted / lag(adjusted) - 1) |>
select(symbol, date, ret)
returns

I tinck the formula needs a parenthesis

returns <- prices |>
arrange(date) |>
mutate(ret = (adjusted / lag(adjusted)) - 1) |>
select(symbol, date, ret)
returns

Add positive quotes section to main index page

We want to highlight that important people think that tidy finance is great by displaying quotes in some form of carousel.

Global: speed up preparing daily CRSP data

I'm following this guide to download and prepare daily crsp data. And I think there is room for improvement.

The 'big data' issue indeed exists and spliting the data is a good solution. However, the chunk size should be a tunable parameter, depending on how much RAM is available. A bigger chunk_size means more parallel processing but higher memory consumption.

I'm only downloading data from 1996 to 2020. The current code in the book only download and process 1 (chunk_size = 1)PERMNO at a time. The esimated time is about 4 hours (by the progress package). However, if I process 100 (chunk_size = 100) at a time, this will only take about 50 minutes.

Below is my modifed code

dsf_db <- tbl(wrds, in_schema("crsp", "dsf"))

factors_ff3_daily <- tbl(db, "factors_ff3_daily") |>
  collect()

permnos <- tbl(db, "crsp_monthly") |>
  distinct(permno) |>
  pull()

# Determine the number of chunks
chunk_size <- 100
num_chunks <- ceiling(length(permnos) / chunk_size)

# Progress bar using progress package
pb <- progress_bar$new(
  format = "[:bar] :percent eta: :eta",
  total = num_chunks
)

for (j in 1:num_chunks) {
  # Select the permnos for this chunk
  permno_chunk <- permnos[((j - 1) * chunk_size + 1):min(j * chunk_size, length(permnos))]

  # Process all permnos in the chunk at once
  crsp_daily_sub <- dsf_db |>
    filter(permno %in% permno_chunk &
      date >= start_date & date <= end_date) |>
    select(permno, date, ret) |>
    collect() |>
    drop_na()

  if (nrow(crsp_daily_sub) > 0) {
    crsp_daily_sub <- crsp_daily_sub |>
      mutate(month = floor_date(date, "month")) |>
      left_join(factors_ff3_daily |>
        select(date, rf), by = "date") |>
      mutate(
        ret_excess = ret - rf,
        ret_excess = pmax(ret_excess, -1)
      ) |>
      select(permno, date, month, ret, ret_excess)

    dbWriteTable(db,
      "crsp_daily",
      value = crsp_daily_sub,
      overwrite = ifelse(j == 1, TRUE, FALSE),
      append = ifelse(j != 1, TRUE, FALSE)
    )
  }

  pb$tick()
}

Update to dbplyr 2.5.0 and `I()` instead of `in_schema()` syntax

Updating to new dbplyr version with I() instead of in_schema() syntax: https://www.tidyverse.org/blog/2024/04/dbplyr-2-5-0/

Issue with "data/macro_predictors.xlsx"

I get an error message when using drive_download() in Chapter 2.

If the directory data is not present, then running this code:

drive_download(
  macro_predictors_link,
  path = "data/macro_predictors.xlsx"
)

gives an annoyingly cryptic error message:

Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Failed to open file data/macro_predictors.xlsx.

One option would be to include:

if (!dir.exists("data")) dir.create("data")

Another would be to remove data/ and simply use:

drive_download(
  macro_predictors_link,
  path = "macro_predictors.xlsx",
)

Given the file.remove("data/macro_predictors.xlsx") command given later, the second option would seem to work.

A third option would be to use tempfile().

Add affiliate link to pre-order book

Error in connecting to tidyfinance

tidy_finance <- dbConnect(

SQLite(), "data/tidy_finance.sqlite",
extended_types = TRUE
)

Error: Could not connect to database:
unable to open database file

Reconsider industry selection in CRSP v2

We currently use the siccd from msf_db (i.e., in_schema("crsp", "msf_v2")). However, the more accurate way is siccd from stksecurityinfohist_db (i.e., in_schema("crsp", "stksecurityinfohist")). Yes, this variable has the same name but a different meaning depending on the table.

Reasoning: Back-fills in msf_v2's siccd. A simple check to reveal this back-fill (with an active WRDS connection):

siccd_permno_pairs <- msf_db |> select(permno, siccd) |> collect() |> distinct()
nrow(siccd_permno_pairs) == length(unique(siccd_permno_pairs$permno))

This shows that every permno only has a single industry, which implies that no firm ever changed the industry - but they do: contradiction

Relevant for r-tidyfinance

Typo Elastic net

It should be Lasso for rho = 0 and Ridge for rho = 1.

Use dbplyr to calculate betas

The code from the book (see here) takes 12 minutes and maxes out at about 37GB of RAM. Code here takes 25 seconds and peaks at about 7GB.

The latter code uses DuckDB (I'm not sure it would work well with SQLite) and is a little inelegant because I tried to conform to the book code (e.g., using months to partition daily data). Using say 90 days for the daily betas would allow the code to be much simpler (more "tidyverse", less SQL).

I get exactly the same numbers with beta_daily, but a small number of differences with beta_monthly … when I spot-checked these numbers, I couldn't work out why the book code is giving different answers.

I'm still working through the book, but there are other places where keeping data in the database (rather than using collect()) dramatically improves performance---in some cases with minimal changes to the code). But in some cases, one needs to use DuckDB, as SQLite does strange things to dates and the like (because you seem to always use collect(), the book use SQLite pretty much as a dumb data store and these issues don't crop up with that approach).

Add reference to `tidyfinance` package to TFWR

Add it to preface or getting started?

Include reference to FFDownload package

https://cran.r-project.org/web/packages/FFdownload/index.html

Have WRDS server do work for TRACE data

Why not have the WRDS PostgreSQL server do all the work?

Steps:

Don't collect() until the end
Make nrow() function to handle remote data frames.
Use window_order() in place of arrange() when used for window functions
Remove code to convert trd_exctn_tm (PostgreSQL wants a time zone … this could be sorted out later)

Based on output in comment below, code seems to work fine. (Not sure if it delivers much performance benefit, but it saves on readers' electricity bills.)

Note that code in comment below does data aggregation in the database and is about 10 times faster than code using collect().

clean_enhanced_trace <- function(cusips, 
                                 connection, 
                                 start_date = as.Date("2002-01-01"), 
                                 end_date = today()) {

  nrow <- function(df) {
    df %>% count() %>% pull(n)
  }
    
  # Packages (required)
  library(tidyverse)
  library(lubridate)
  library(dbplyr)
  library(RPostgres)
  
  # Function checks ---------------------------------------------------------
  # Input parameters
  ## Cusips
  if(length(cusips) == 0 | any(is.na(cusips))) stop("Check cusips.")
  
  ## Dates
  if(!is.Date(start_date) | !is.Date(end_date)) stop("Dates needed")
  if(start_date < as.Date("2002-01-01")) stop("TRACE starts later.")
  if(end_date > today()) stop("TRACE does not predict the future.")
  if(start_date >= end_date) stop("Date conflict.")
  
  ## Connection
  if(!dbIsValid(connection)) stop("Connection issue.")
  
  # Enhanced Trace ----------------------------------------------------------
  # Main file
  trace_all <- tbl(connection, 
                   in_schema("trace", "trace_enhanced")) |> 
    filter(cusip_id %in% cusips) |>
    filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |> 
    select(cusip_id, msg_seq_nb, orig_msg_seq_nb,
           entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,
           trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm, 
           pr_trd_dt, trc_st, asof_cd, wis_fl, 
           days_to_sttl_ct, stlmnt_dt, spcl_trd_fl)
  
  # Enhanced Trace: Post 06-02-2012 -----------------------------------------
  # Trades (trc_st = T) and correction (trc_st = R)
  trace_post_TR <- trace_all |> 
    filter((trc_st == "T" | trc_st == "R"),
           trd_rpt_dt >= as.Date("2012-02-06"))
  
  # Cancelations (trc_st = X) and correction cancelations (trc_st = C)
  trace_post_XC <- trace_all |>
    filter((trc_st == "X" | trc_st == "C"),
           trd_rpt_dt >= as.Date("2012-02-06"))
  
  # Cleaning corrected and cancelled trades
  trace_post_TR <- trace_post_TR |>
    anti_join(trace_post_XC,
              by = c("cusip_id", "msg_seq_nb", "entrd_vol_qt", 
                     "rptd_pr", "rpt_side_cd", "cntra_mp_id", 
                     "trd_exctn_dt", "trd_exctn_tm"))
  
  # Reversals (trc_st = Y)
  trace_post_Y <- trace_all |>
    filter(trc_st == "Y",
           trd_rpt_dt >= as.Date("2012-02-06"))
  
  # Clean reversals
  ## match the orig_msg_seq_nb of the Y-message to 
  ## the msg_seq_nb of the main message
  trace_post <- trace_post_TR |>
    anti_join(trace_post_Y,
              by = c("cusip_id", "msg_seq_nb" = "orig_msg_seq_nb", 
                     "entrd_vol_qt", "rptd_pr", "rpt_side_cd", 
                     "cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"))
  
  
  # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------
  # Cancelations (trc_st = C)
  trace_pre_C <- trace_all |>
    filter(trc_st == "C",
           trd_rpt_dt < as.Date("2012-02-06"))
  
  # Trades w/o cancelations
  ## match the orig_msg_seq_nb of the C-message 
  ## to the msg_seq_nb of the main message
  trace_pre_T <- trace_all |>
    filter(trc_st == "T",
           trd_rpt_dt < as.Date("2012-02-06")) |>
    anti_join(trace_pre_C, 
              by = c("cusip_id", "msg_seq_nb" = "orig_msg_seq_nb", 
                     "entrd_vol_qt", "rptd_pr", "rpt_side_cd", 
                     "cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"))
  
  # Corrections (trc_st = W) - W can also correct a previous W
  trace_pre_W <- trace_all |>
    filter(trc_st == "W",
           trd_rpt_dt < as.Date("2012-02-06"))
  
  # Implement corrections in a loop
  ## Correction control
  correction_control <- nrow(trace_pre_W)
  correction_control_last <- nrow(trace_pre_W)
  
  ## Correction loop
  while(correction_control > 0) {
    # Corrections that correct some msg
    trace_pre_W_correcting <- trace_pre_W |>
      semi_join(trace_pre_T, 
                by = c("cusip_id", "trd_exctn_dt",
                       "orig_msg_seq_nb" = "msg_seq_nb"))
    
    # Corrections that do not correct some msg
    trace_pre_W <- trace_pre_W |>
      anti_join(trace_pre_T, 
                by = c("cusip_id", "trd_exctn_dt",
                       "orig_msg_seq_nb" = "msg_seq_nb"))
    
    # Delete msgs that are corrected and add correction msgs
    trace_pre_T <- trace_pre_T |>
      anti_join(trace_pre_W_correcting, 
                by = c("cusip_id", "trd_exctn_dt",
                       "msg_seq_nb" = "orig_msg_seq_nb")) |>
      union_all(trace_pre_W_correcting) 
    
    # Escape if no corrections remain or they cannot be matched
    correction_control <- nrow(trace_pre_W)
    if(correction_control == correction_control_last) {
      correction_control <- 0 
    }
    correction_control_last <- nrow(trace_pre_W)
  }
  
  
  # Clean reversals
  ## Record reversals
  trace_pre_R <- trace_pre_T |>
    filter(asof_cd == 'R') |>
    group_by(cusip_id, trd_exctn_dt, entrd_vol_qt, 
             rptd_pr, rpt_side_cd, cntra_mp_id) |>
    window_order(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>
    mutate(seq = row_number()) |>
    ungroup()
  
  ## Remove reversals and the reversed trade
  trace_pre <- trace_pre_T |> 
    filter(is.na(asof_cd) | !(asof_cd %in% c('R', 'X', 'D'))) |> 
    group_by(cusip_id, trd_exctn_dt, entrd_vol_qt, 
             rptd_pr, rpt_side_cd, cntra_mp_id) |> 
    window_order(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |> 
    mutate(seq = row_number()) |> 
    ungroup() |> 
    anti_join(trace_pre_R,
              by = c("cusip_id", "trd_exctn_dt", "entrd_vol_qt", 
                     "rptd_pr", "rpt_side_cd", "cntra_mp_id", "seq")) |> 
    select(-seq)
  
  
  # Agency trades -----------------------------------------------------------
  # Combine pre and post trades
  trace_clean <- trace_post |> 
    union_all(trace_pre)
  
  # Keep angency sells and unmatched agency buys
  ## Agency sells
  trace_agency_sells <- trace_clean |> 
    filter(cntra_mp_id == "D",
           rpt_side_cd == "S")
  
  # Agency buys that are unmatched
  trace_agency_buys_filtered <- trace_clean |> 
    filter(cntra_mp_id == "D",
           rpt_side_cd == "B") |> 
    anti_join(trace_agency_sells, 
              by = c("cusip_id", "trd_exctn_dt", 
                     "entrd_vol_qt", "rptd_pr"))
  
  # Agency clean
  trace_clean <- trace_clean |> 
    filter(cntra_mp_id == "C")  |> 
    union_all(trace_agency_sells) |> 
    union_all(trace_agency_buys_filtered) 
  
  
  # Additional Filters ------------------------------------------------------
  trace_add_filters <- trace_clean |> 
    mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |> 
    filter(is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,
           is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,
           wis_fl == "N",
           is.na(spcl_trd_fl) | spcl_trd_fl == "",
           is.na(asof_cd) | asof_cd == "")
  
  
  # Output ------------------------------------------------------------------
  # Only keep necessary columns
  trace_final <- 
    trace_add_filters |> 
    select(cusip_id, trd_exctn_dt, trd_exctn_tm, 
           rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id) 
  
  # Return
  return(trace_final |> 
           arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) %>%
           collect())
}

wrds <- dbConnect(
  Postgres(),
  host = "wrds-pgdata.wharton.upenn.edu",
  dbname = "wrds",
  port = 9737,
  sslmode = "require",
  bigint = "integer")

tidy_finance <- dbConnect(
  duckdb::duckdb(),
  "data/tidy_finance.duckdb",
  read_only = FALSE
)

mergent_cusips <- 
  tbl(tidy_finance, "mergent") |>
  pull(complete_cusip)

mergent_parts <- split(
  mergent_cusips,
  rep(1:100, 
      length.out = length(mergent_cusips))
)

for (j in 1:length(mergent_parts)) {
  trace_enhanced <- clean_enhanced_trace(
    cusips = mergent_parts[[j]],
    connection = wrds,
    start_date = ymd("2014-01-01"),
    end_date = ymd("2016-11-30")
  )
  print(paste("Starting", j))
  dbWriteTable(
    conn = tidy_finance,
    name = "trace_enhanced",
    value = trace_enhanced,
    overwrite = ifelse(j == 1, TRUE, FALSE),
    append = ifelse(j != 1, TRUE, FALSE))
  print(paste("Finishing", j))
}

dbDisconnect(wrds)
dbDisconnect(tidy_finance, shutdown = TRUE)

Sigmoid activation function in option-pricing with R

Brulee now has "sigmoid" implemented as an activation function.

Update renv
Change the activation function
Test output

Update cover image on index page to book cover

Wrong month values for CRSP dummy data

Replicating Accessing and Managing Financial Data and WRDS Dummy Data to apply Parametric Portfolio Policies creates an error when computing evaluate_portfolio(weights_crsp).

In evaluate_portfolio() at evaluation_capm line :
ValueError: zero-size array to reduction operation maximum which has no identity

It likely comes from a mis-alignement of date between evaluation and factors_ff_monthly.

evaluation['month'] is End of Month
factors_ff_monthly['month] is Start of month
Which leads evaluation.merge(factors_ff_monthly, how="left", on="month") to not have any common month with factors_ff_monthly.

evaluation is constructed from crsp_monthly which I created following WRDS Dummy Data.

From WRDS Dummy Data :

crsp_monthly_dummy = (stock_panel_monthly
  .assign(
    date=lambda x: x["month"]+pd.offsets.MonthEnd(-1),
    ret=lambda x: np.fmax(np.random.normal(size=len(x)), -1),

I am unsure at this point but it may be x["month"]+pd.offsets.MonthEnd(-1) who is incorrect here.
I don't have access to WRDS data to compare with dummy data.

Pandas version : 2.2.2
Database : running fine
Full replicate, no change to code, only diff is freshly installed environment.

incorrect exclusion of Dow when getting the index_prices.

In the introduction under the heading Scaling up the analysis you have this part of the code. Starting at line 130

ticker <- tq_index("DOW") 
index_prices <- tq_get(ticker,
  get = "stock.prices",
  from = "2000-01-01",
  to = "2022-03-30"
) %>%
  filter(symbol != "DOW") # Exclude the index itself

It is wrong to exclude the symbol "DOW" as this is not the index but the company Dow inc.

Your code should be:

ticker <- tq_index("DOW") 
index_prices <- tq_get(ticker,
  get = "stock.prices",
  from = "2000-01-01",
  to = "2022-03-30")

Of course this means the subsequent graphs and results need to be rechecked.

Fama-Macbeth Regression

Sir,
In this part of code risk_premiums <- data_fama_macbeth |>
nest(data = c(ret_excess_lead, beta, log_mktcap, bm, permno)) |>
mutate(estimates = map(
data,
~ tidy(lm(ret_excess_lead ~ beta + log_mktcap + bm, data = .x))
)) |>
unnest(estimates)

i see that the result obtained are general. We dont get the results for each id or permno. So if for each permno we need then what modifications are needed?

Add information about the exclusion of financial firms

Because we received a request, we need to add:

Clarification in */wrds-crsp-and-compustat that industry format INDL excludes financial firms to some extent.
Footnote (?) in */replicating-fama-and-french-factors that states that theoriginal Fama & French (1992) paper claims to exclude financial firms. To some extent this happens through using industry format "INDL". Neither the Fama & French (1994) paper, nor Ken French's website or the WRDS replication contains any indication that financial companies are excluded using additional filters such as industry codes.

FamaMacBeth Exercise 3.

I feel like exercise 3. in the Fama-MacBeth chapter would benefit from a little more explaining. Maybe it would make sense to mention in the chapter that one can use the characteristics or a factor exposures derived from time series regressions as input for the cross-sectional regressions to clearly where one would use the rolling windows regressions.

.melt() returns ValueError in pandas 2.2.2

When running the melt() function in the index_prices.
.melt(ignore_index=False, var_name=["variable", "symbol"])

it returns a ValueError:

ValueError: var_name=['variable', 'symbol'] must be a scalar.

Figure 2 (Optimal allocation for different strategies) omits weights for reg-t constraints

Not evaluated the code in detail yet but I see that the input to the figure is

tibble(
  `No short-sale` = w_no_short_sale$solution,
  `Minimum Variance` = w_mvp,
  `Efficient portfolio` = compute_efficient_weight(Sigma, mu),
  `Regulation-T` = w_reg_t$par,
  Industry = colnames(industry_returns)
)

At the same time, the resulting figure omits the "Regulation-T" column.

NAs for effective spreads, duration weighted

Niklas alerted me that a missing na.rm = TRUE rendered the computation of the effective spread NA.

macro_predictors_dy

The calculation of dy is correct? or it should be
dy=lambda x: np.log(x["D12"])-np.log(x["Index"].shift(1)),

macro_predictors = (
pd.read_csv(macro_predictors_link, thousands=",")
.assign(
month=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
dp=lambda x: np.log(x["D12"])-np.log(x["Index"]),
dy=lambda x: np.log(x["D12"])-np.log(x["D12"].shift(1)),
ep=lambda x: np.log(x["E12"])-np.log(x["Index"]),
de=lambda x: np.log(x["D12"])-np.log(x["E12"]),
tms=lambda x: x["lty"]-x["tbl"],
dfy=lambda x: x["BAA"]-x["AAA"]
)
.rename(columns={"b/m": "bm"})
.get(["month", "dp", "dy", "ep", "de", "svar", "bm",
"ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
.query("month >= @start_date and month <= @end_date")
.dropna()
)

Incorrect efficient portfolio weight (small mistake)

The equation for the efficient portfolio weight in the book is written as:

\omega_{\text{eff}}(\Bar{\mu}) = \text{arg min $w '\Sigma w$ s.t. $ \omega' \iota = 1$}

When it should rather be:

\omega_{\text{eff}}(\Bar{\mu}) = \text{arg min $\omega '\Sigma \omega$ s.t. $ \omega' \iota = 1$}

Update to tidyverse 2.0.0

Workflow:

Update R version + packages in renv
Delete _freeze folder
Render website and publish to quarto-pub for testing
Check whether there are warnings in the rendered version
Push changes to branch
Code review and merge