Code Monkey home page Code Monkey logo

website's People

Contributors

christophscheuch avatar derkevinriehl avatar iangow avatar jestover avatar niklaslandsberg avatar patrick-weiss avatar voigtstefan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

website's Issues

Possible code Bug

I was reproducing the code bellow and did not working properly.

returns <- prices |>
arrange(date) |>
mutate(ret = adjusted / lag(adjusted) - 1) |>
select(symbol, date, ret)
returns

I tinck the formula needs a parenthesis

returns <- prices |>
arrange(date) |>
mutate(ret = (adjusted / lag(adjusted)) - 1) |>
select(symbol, date, ret)
returns

Global: speed up preparing daily CRSP data

I'm following this guide to download and prepare daily crsp data. And I think there is room for improvement.

The 'big data' issue indeed exists and spliting the data is a good solution. However, the chunk size should be a tunable parameter, depending on how much RAM is available. A bigger chunk_size means more parallel processing but higher memory consumption.

I'm only downloading data from 1996 to 2020. The current code in the book only download and process 1 (chunk_size = 1)PERMNO at a time. The esimated time is about 4 hours (by the progress package). However, if I process 100 (chunk_size = 100) at a time, this will only take about 50 minutes.

Below is my modifed code

dsf_db <- tbl(wrds, in_schema("crsp", "dsf"))

factors_ff3_daily <- tbl(db, "factors_ff3_daily") |>
  collect()

permnos <- tbl(db, "crsp_monthly") |>
  distinct(permno) |>
  pull()

# Determine the number of chunks
chunk_size <- 100
num_chunks <- ceiling(length(permnos) / chunk_size)

# Progress bar using progress package
pb <- progress_bar$new(
  format = "[:bar] :percent eta: :eta",
  total = num_chunks
)

for (j in 1:num_chunks) {
  # Select the permnos for this chunk
  permno_chunk <- permnos[((j - 1) * chunk_size + 1):min(j * chunk_size, length(permnos))]

  # Process all permnos in the chunk at once
  crsp_daily_sub <- dsf_db |>
    filter(permno %in% permno_chunk &
      date >= start_date & date <= end_date) |>
    select(permno, date, ret) |>
    collect() |>
    drop_na()

  if (nrow(crsp_daily_sub) > 0) {
    crsp_daily_sub <- crsp_daily_sub |>
      mutate(month = floor_date(date, "month")) |>
      left_join(factors_ff3_daily |>
        select(date, rf), by = "date") |>
      mutate(
        ret_excess = ret - rf,
        ret_excess = pmax(ret_excess, -1)
      ) |>
      select(permno, date, month, ret, ret_excess)

    dbWriteTable(db,
      "crsp_daily",
      value = crsp_daily_sub,
      overwrite = ifelse(j == 1, TRUE, FALSE),
      append = ifelse(j != 1, TRUE, FALSE)
    )
  }

  pb$tick()
}

Issue with "data/macro_predictors.xlsx"

I get an error message when using drive_download() in Chapter 2.


If the directory data is not present, then running this code:

drive_download(
  macro_predictors_link,
  path = "data/macro_predictors.xlsx"
)

gives an annoyingly cryptic error message:

Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Failed to open file data/macro_predictors.xlsx.

One option would be to include:

if (!dir.exists("data")) dir.create("data")

Another would be to remove data/ and simply use:

drive_download(
  macro_predictors_link,
  path = "macro_predictors.xlsx",
)

Given the file.remove("data/macro_predictors.xlsx") command given later, the second option would seem to work.

A third option would be to use tempfile().

Error in connecting to tidyfinance

tidy_finance <- dbConnect(

  • SQLite(), "data/tidy_finance.sqlite",
  • extended_types = TRUE
  • )

Error: Could not connect to database:
unable to open database file

Reconsider industry selection in CRSP v2

We currently use the siccd from msf_db (i.e., in_schema("crsp", "msf_v2")). However, the more accurate way is siccd from stksecurityinfohist_db (i.e., in_schema("crsp", "stksecurityinfohist")). Yes, this variable has the same name but a different meaning depending on the table.

Reasoning: Back-fills in msf_v2's siccd. A simple check to reveal this back-fill (with an active WRDS connection):

siccd_permno_pairs <- msf_db |> select(permno, siccd) |> collect() |> distinct()
nrow(siccd_permno_pairs) == length(unique(siccd_permno_pairs$permno))

This shows that every permno only has a single industry, which implies that no firm ever changed the industry - but they do: contradiction

Relevant for r-tidyfinance

Use dbplyr to calculate betas

The code from the book (see here) takes 12 minutes and maxes out at about 37GB of RAM. Code here takes 25 seconds and peaks at about 7GB.

The latter code uses DuckDB (I'm not sure it would work well with SQLite) and is a little inelegant because I tried to conform to the book code (e.g., using months to partition daily data). Using say 90 days for the daily betas would allow the code to be much simpler (more "tidyverse", less SQL).

I get exactly the same numbers with beta_daily, but a small number of differences with beta_monthly … when I spot-checked these numbers, I couldn't work out why the book code is giving different answers.

I'm still working through the book, but there are other places where keeping data in the database (rather than using collect()) dramatically improves performance---in some cases with minimal changes to the code). But in some cases, one needs to use DuckDB, as SQLite does strange things to dates and the like (because you seem to always use collect(), the book use SQLite pretty much as a dumb data store and these issues don't crop up with that approach).

Have WRDS server do work for TRACE data

Why not have the WRDS PostgreSQL server do all the work?

Steps:

  • Don't collect() until the end
  • Make nrow() function to handle remote data frames.
  • Use window_order() in place of arrange() when used for window functions
  • Remove code to convert trd_exctn_tm (PostgreSQL wants a time zone … this could be sorted out later)

Based on output in comment below, code seems to work fine. (Not sure if it delivers much performance benefit, but it saves on readers' electricity bills.)

Note that code in comment below does data aggregation in the database and is about 10 times faster than code using collect().

clean_enhanced_trace <- function(cusips, 
                                 connection, 
                                 start_date = as.Date("2002-01-01"), 
                                 end_date = today()) {

  nrow <- function(df) {
    df %>% count() %>% pull(n)
  }
    
  # Packages (required)
  library(tidyverse)
  library(lubridate)
  library(dbplyr)
  library(RPostgres)
  
  # Function checks ---------------------------------------------------------
  # Input parameters
  ## Cusips
  if(length(cusips) == 0 | any(is.na(cusips))) stop("Check cusips.")
  
  ## Dates
  if(!is.Date(start_date) | !is.Date(end_date)) stop("Dates needed")
  if(start_date < as.Date("2002-01-01")) stop("TRACE starts later.")
  if(end_date > today()) stop("TRACE does not predict the future.")
  if(start_date >= end_date) stop("Date conflict.")
  
  ## Connection
  if(!dbIsValid(connection)) stop("Connection issue.")
  
  # Enhanced Trace ----------------------------------------------------------
  # Main file
  trace_all <- tbl(connection, 
                   in_schema("trace", "trace_enhanced")) |> 
    filter(cusip_id %in% cusips) |>
    filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |> 
    select(cusip_id, msg_seq_nb, orig_msg_seq_nb,
           entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,
           trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm, 
           pr_trd_dt, trc_st, asof_cd, wis_fl, 
           days_to_sttl_ct, stlmnt_dt, spcl_trd_fl)
  
  # Enhanced Trace: Post 06-02-2012 -----------------------------------------
  # Trades (trc_st = T) and correction (trc_st = R)
  trace_post_TR <- trace_all |> 
    filter((trc_st == "T" | trc_st == "R"),
           trd_rpt_dt >= as.Date("2012-02-06"))
  
  # Cancelations (trc_st = X) and correction cancelations (trc_st = C)
  trace_post_XC <- trace_all |>
    filter((trc_st == "X" | trc_st == "C"),
           trd_rpt_dt >= as.Date("2012-02-06"))
  
  # Cleaning corrected and cancelled trades
  trace_post_TR <- trace_post_TR |>
    anti_join(trace_post_XC,
              by = c("cusip_id", "msg_seq_nb", "entrd_vol_qt", 
                     "rptd_pr", "rpt_side_cd", "cntra_mp_id", 
                     "trd_exctn_dt", "trd_exctn_tm"))
  
  # Reversals (trc_st = Y)
  trace_post_Y <- trace_all |>
    filter(trc_st == "Y",
           trd_rpt_dt >= as.Date("2012-02-06"))
  
  # Clean reversals
  ## match the orig_msg_seq_nb of the Y-message to 
  ## the msg_seq_nb of the main message
  trace_post <- trace_post_TR |>
    anti_join(trace_post_Y,
              by = c("cusip_id", "msg_seq_nb" = "orig_msg_seq_nb", 
                     "entrd_vol_qt", "rptd_pr", "rpt_side_cd", 
                     "cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"))
  
  
  # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------
  # Cancelations (trc_st = C)
  trace_pre_C <- trace_all |>
    filter(trc_st == "C",
           trd_rpt_dt < as.Date("2012-02-06"))
  
  # Trades w/o cancelations
  ## match the orig_msg_seq_nb of the C-message 
  ## to the msg_seq_nb of the main message
  trace_pre_T <- trace_all |>
    filter(trc_st == "T",
           trd_rpt_dt < as.Date("2012-02-06")) |>
    anti_join(trace_pre_C, 
              by = c("cusip_id", "msg_seq_nb" = "orig_msg_seq_nb", 
                     "entrd_vol_qt", "rptd_pr", "rpt_side_cd", 
                     "cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"))
  
  # Corrections (trc_st = W) - W can also correct a previous W
  trace_pre_W <- trace_all |>
    filter(trc_st == "W",
           trd_rpt_dt < as.Date("2012-02-06"))
  
  # Implement corrections in a loop
  ## Correction control
  correction_control <- nrow(trace_pre_W)
  correction_control_last <- nrow(trace_pre_W)
  
  ## Correction loop
  while(correction_control > 0) {
    # Corrections that correct some msg
    trace_pre_W_correcting <- trace_pre_W |>
      semi_join(trace_pre_T, 
                by = c("cusip_id", "trd_exctn_dt",
                       "orig_msg_seq_nb" = "msg_seq_nb"))
    
    # Corrections that do not correct some msg
    trace_pre_W <- trace_pre_W |>
      anti_join(trace_pre_T, 
                by = c("cusip_id", "trd_exctn_dt",
                       "orig_msg_seq_nb" = "msg_seq_nb"))
    
    # Delete msgs that are corrected and add correction msgs
    trace_pre_T <- trace_pre_T |>
      anti_join(trace_pre_W_correcting, 
                by = c("cusip_id", "trd_exctn_dt",
                       "msg_seq_nb" = "orig_msg_seq_nb")) |>
      union_all(trace_pre_W_correcting) 
    
    # Escape if no corrections remain or they cannot be matched
    correction_control <- nrow(trace_pre_W)
    if(correction_control == correction_control_last) {
      correction_control <- 0 
    }
    correction_control_last <- nrow(trace_pre_W)
  }
  
  
  # Clean reversals
  ## Record reversals
  trace_pre_R <- trace_pre_T |>
    filter(asof_cd == 'R') |>
    group_by(cusip_id, trd_exctn_dt, entrd_vol_qt, 
             rptd_pr, rpt_side_cd, cntra_mp_id) |>
    window_order(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>
    mutate(seq = row_number()) |>
    ungroup()
  
  ## Remove reversals and the reversed trade
  trace_pre <- trace_pre_T |> 
    filter(is.na(asof_cd) | !(asof_cd %in% c('R', 'X', 'D'))) |> 
    group_by(cusip_id, trd_exctn_dt, entrd_vol_qt, 
             rptd_pr, rpt_side_cd, cntra_mp_id) |> 
    window_order(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |> 
    mutate(seq = row_number()) |> 
    ungroup() |> 
    anti_join(trace_pre_R,
              by = c("cusip_id", "trd_exctn_dt", "entrd_vol_qt", 
                     "rptd_pr", "rpt_side_cd", "cntra_mp_id", "seq")) |> 
    select(-seq)
  
  
  # Agency trades -----------------------------------------------------------
  # Combine pre and post trades
  trace_clean <- trace_post |> 
    union_all(trace_pre)
  
  # Keep angency sells and unmatched agency buys
  ## Agency sells
  trace_agency_sells <- trace_clean |> 
    filter(cntra_mp_id == "D",
           rpt_side_cd == "S")
  
  # Agency buys that are unmatched
  trace_agency_buys_filtered <- trace_clean |> 
    filter(cntra_mp_id == "D",
           rpt_side_cd == "B") |> 
    anti_join(trace_agency_sells, 
              by = c("cusip_id", "trd_exctn_dt", 
                     "entrd_vol_qt", "rptd_pr"))
  
  # Agency clean
  trace_clean <- trace_clean |> 
    filter(cntra_mp_id == "C")  |> 
    union_all(trace_agency_sells) |> 
    union_all(trace_agency_buys_filtered) 
  
  
  # Additional Filters ------------------------------------------------------
  trace_add_filters <- trace_clean |> 
    mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |> 
    filter(is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,
           is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,
           wis_fl == "N",
           is.na(spcl_trd_fl) | spcl_trd_fl == "",
           is.na(asof_cd) | asof_cd == "")
  
  
  # Output ------------------------------------------------------------------
  # Only keep necessary columns
  trace_final <- 
    trace_add_filters |> 
    select(cusip_id, trd_exctn_dt, trd_exctn_tm, 
           rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id) 
  
  # Return
  return(trace_final |> 
           arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) %>%
           collect())
}

wrds <- dbConnect(
  Postgres(),
  host = "wrds-pgdata.wharton.upenn.edu",
  dbname = "wrds",
  port = 9737,
  sslmode = "require",
  bigint = "integer")

tidy_finance <- dbConnect(
  duckdb::duckdb(),
  "data/tidy_finance.duckdb",
  read_only = FALSE
)

mergent_cusips <- 
  tbl(tidy_finance, "mergent") |>
  pull(complete_cusip)

mergent_parts <- split(
  mergent_cusips,
  rep(1:100, 
      length.out = length(mergent_cusips))
)

for (j in 1:length(mergent_parts)) {
  trace_enhanced <- clean_enhanced_trace(
    cusips = mergent_parts[[j]],
    connection = wrds,
    start_date = ymd("2014-01-01"),
    end_date = ymd("2016-11-30")
  )
  print(paste("Starting", j))
  dbWriteTable(
    conn = tidy_finance,
    name = "trace_enhanced",
    value = trace_enhanced,
    overwrite = ifelse(j == 1, TRUE, FALSE),
    append = ifelse(j != 1, TRUE, FALSE))
  print(paste("Finishing", j))
}

dbDisconnect(wrds)
dbDisconnect(tidy_finance, shutdown = TRUE)

Wrong month values for CRSP dummy data

Replicating Accessing and Managing Financial Data and WRDS Dummy Data to apply Parametric Portfolio Policies creates an error when computing evaluate_portfolio(weights_crsp).

In evaluate_portfolio() at evaluation_capm line :
ValueError: zero-size array to reduction operation maximum which has no identity

It likely comes from a mis-alignement of date between evaluation and factors_ff_monthly.

  • evaluation['month'] is End of Month
  • factors_ff_monthly['month] is Start of month
    Which leads evaluation.merge(factors_ff_monthly, how="left", on="month") to not have any common month with factors_ff_monthly.

evaluation is constructed from crsp_monthly which I created following WRDS Dummy Data.

From WRDS Dummy Data :

crsp_monthly_dummy = (stock_panel_monthly
  .assign(
    date=lambda x: x["month"]+pd.offsets.MonthEnd(-1),
    ret=lambda x: np.fmax(np.random.normal(size=len(x)), -1),

I am unsure at this point but it may be x["month"]+pd.offsets.MonthEnd(-1) who is incorrect here.
I don't have access to WRDS data to compare with dummy data.

Pandas version : 2.2.2
Database : running fine
Full replicate, no change to code, only diff is freshly installed environment.

incorrect exclusion of Dow when getting the index_prices.

In the introduction under the heading Scaling up the analysis you have this part of the code. Starting at line 130

ticker <- tq_index("DOW") 
index_prices <- tq_get(ticker,
  get = "stock.prices",
  from = "2000-01-01",
  to = "2022-03-30"
) %>%
  filter(symbol != "DOW") # Exclude the index itself

It is wrong to exclude the symbol "DOW" as this is not the index but the company Dow inc.

Your code should be:

ticker <- tq_index("DOW") 
index_prices <- tq_get(ticker,
  get = "stock.prices",
  from = "2000-01-01",
  to = "2022-03-30")

Of course this means the subsequent graphs and results need to be rechecked.

Fama-Macbeth Regression

Sir,
In this part of code risk_premiums <- data_fama_macbeth |>
nest(data = c(ret_excess_lead, beta, log_mktcap, bm, permno)) |>
mutate(estimates = map(
data,
~ tidy(lm(ret_excess_lead ~ beta + log_mktcap + bm, data = .x))
)) |>
unnest(estimates)

i see that the result obtained are general. We dont get the results for each id or permno. So if for each permno we need then what modifications are needed?

Add information about the exclusion of financial firms

Because we received a request, we need to add:

  • Clarification in */wrds-crsp-and-compustat that industry format INDL excludes financial firms to some extent.
  • Footnote (?) in */replicating-fama-and-french-factors that states that theoriginal Fama & French (1992) paper claims to exclude financial firms. To some extent this happens through using industry format "INDL". Neither the Fama & French (1994) paper, nor Ken French's website or the WRDS replication contains any indication that financial companies are excluded using additional filters such as industry codes.

FamaMacBeth Exercise 3.

I feel like exercise 3. in the Fama-MacBeth chapter would benefit from a little more explaining. Maybe it would make sense to mention in the chapter that one can use the characteristics or a factor exposures derived from time series regressions as input for the cross-sectional regressions to clearly where one would use the rolling windows regressions.

.melt() returns ValueError in pandas 2.2.2

When running the melt() function in the index_prices.
.melt(ignore_index=False, var_name=["variable", "symbol"])

it returns a ValueError:

ValueError: var_name=['variable', 'symbol'] must be a scalar.

Figure 2 (Optimal allocation for different strategies) omits weights for reg-t constraints

Not evaluated the code in detail yet but I see that the input to the figure is

tibble(
  `No short-sale` = w_no_short_sale$solution,
  `Minimum Variance` = w_mvp,
  `Efficient portfolio` = compute_efficient_weight(Sigma, mu),
  `Regulation-T` = w_reg_t$par,
  Industry = colnames(industry_returns)
)

At the same time, the resulting figure omits the "Regulation-T" column.

macro_predictors_dy

The calculation of dy is correct? or it should be
dy=lambda x: np.log(x["D12"])-np.log(x["Index"].shift(1)),

macro_predictors = (
pd.read_csv(macro_predictors_link, thousands=",")
.assign(
month=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
dp=lambda x: np.log(x["D12"])-np.log(x["Index"]),
dy=lambda x: np.log(x["D12"])-np.log(x["D12"].shift(1)),
ep=lambda x: np.log(x["E12"])-np.log(x["Index"]),
de=lambda x: np.log(x["D12"])-np.log(x["E12"]),
tms=lambda x: x["lty"]-x["tbl"],
dfy=lambda x: x["BAA"]-x["AAA"]
)
.rename(columns={"b/m": "bm"})
.get(["month", "dp", "dy", "ep", "de", "svar", "bm",
"ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
.query("month >= @start_date and month <= @end_date")
.dropna()
)

Incorrect efficient portfolio weight (small mistake)

The equation for the efficient portfolio weight in the book is written as:

\omega_{\text{eff}}(\Bar{\mu}) = \text{arg min $w '\Sigma w$ s.t. $ \omega' \iota = 1$}

When it should rather be:

\omega_{\text{eff}}(\Bar{\mu}) = \text{arg min $\omega '\Sigma \omega$ s.t. $ \omega' \iota = 1$}

Update to tidyverse 2.0.0

Workflow:

  1. Update R version + packages in renv
  2. Delete _freeze folder
  3. Render website and publish to quarto-pub for testing
  4. Check whether there are warnings in the rendered version
  5. Push changes to branch
  6. Code review and merge

r/clean_enhanced_trace

Function yields warning (without impacting expected behavior): "tz(): Don't know how to compute timezone for object of class hms/difftime; returning "UTC"."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.