tidy-finance / website Goto Github PK
View Code? Open in Web Editor NEWThis repository hosts the source code for the website tidy-finance.org
Home Page: https://tidy-finance.org
License: Other
This repository hosts the source code for the website tidy-finance.org
Home Page: https://tidy-finance.org
License: Other
I was reproducing the code bellow and did not working properly.
returns <- prices |>
arrange(date) |>
mutate(ret = adjusted / lag(adjusted) - 1) |>
select(symbol, date, ret)
returns
I tinck the formula needs a parenthesis
returns <- prices |>
arrange(date) |>
mutate(ret = (adjusted / lag(adjusted)) - 1) |>
select(symbol, date, ret)
returns
We want to highlight that important people think that tidy finance is great by displaying quotes in some form of carousel.
I'm following this guide to download and prepare daily crsp data. And I think there is room for improvement.
The 'big data' issue indeed exists and spliting the data is a good solution. However, the chunk size should be a tunable parameter, depending on how much RAM is available. A bigger chunk_size
means more parallel processing but higher memory consumption.
I'm only downloading data from 1996 to 2020. The current code in the book only download and process 1 (chunk_size = 1
)PERMNO at a time. The esimated time is about 4 hours (by the progress
package). However, if I process 100 (chunk_size = 100
) at a time, this will only take about 50 minutes.
Below is my modifed code
dsf_db <- tbl(wrds, in_schema("crsp", "dsf"))
factors_ff3_daily <- tbl(db, "factors_ff3_daily") |>
collect()
permnos <- tbl(db, "crsp_monthly") |>
distinct(permno) |>
pull()
# Determine the number of chunks
chunk_size <- 100
num_chunks <- ceiling(length(permnos) / chunk_size)
# Progress bar using progress package
pb <- progress_bar$new(
format = "[:bar] :percent eta: :eta",
total = num_chunks
)
for (j in 1:num_chunks) {
# Select the permnos for this chunk
permno_chunk <- permnos[((j - 1) * chunk_size + 1):min(j * chunk_size, length(permnos))]
# Process all permnos in the chunk at once
crsp_daily_sub <- dsf_db |>
filter(permno %in% permno_chunk &
date >= start_date & date <= end_date) |>
select(permno, date, ret) |>
collect() |>
drop_na()
if (nrow(crsp_daily_sub) > 0) {
crsp_daily_sub <- crsp_daily_sub |>
mutate(month = floor_date(date, "month")) |>
left_join(factors_ff3_daily |>
select(date, rf), by = "date") |>
mutate(
ret_excess = ret - rf,
ret_excess = pmax(ret_excess, -1)
) |>
select(permno, date, month, ret, ret_excess)
dbWriteTable(db,
"crsp_daily",
value = crsp_daily_sub,
overwrite = ifelse(j == 1, TRUE, FALSE),
append = ifelse(j != 1, TRUE, FALSE)
)
}
pb$tick()
}
Updating to new dbplyr version with I() instead of in_schema() syntax: https://www.tidyverse.org/blog/2024/04/dbplyr-2-5-0/
I get an error message when using drive_download()
in Chapter 2.
If the directory data
is not present, then running this code:
drive_download(
macro_predictors_link,
path = "data/macro_predictors.xlsx"
)
gives an annoyingly cryptic error message:
Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Failed to open file data/macro_predictors.xlsx.
One option would be to include:
if (!dir.exists("data")) dir.create("data")
Another would be to remove data/
and simply use:
drive_download(
macro_predictors_link,
path = "macro_predictors.xlsx",
)
Given the file.remove("data/macro_predictors.xlsx")
command given later, the second option would seem to work.
A third option would be to use tempfile()
.
tidy_finance <- dbConnect(
Error: Could not connect to database:
unable to open database file
We currently use the siccd
from msf_db
(i.e., in_schema("crsp", "msf_v2")
). However, the more accurate way is siccd
from stksecurityinfohist_db
(i.e., in_schema("crsp", "stksecurityinfohist")
). Yes, this variable has the same name but a different meaning depending on the table.
Reasoning: Back-fills in msf_v2
's siccd
. A simple check to reveal this back-fill (with an active WRDS connection):
siccd_permno_pairs <- msf_db |> select(permno, siccd) |> collect() |> distinct()
nrow(siccd_permno_pairs) == length(unique(siccd_permno_pairs$permno))
This shows that every permno only has a single industry, which implies that no firm ever changed the industry - but they do: contradiction
Relevant for r-tidyfinance
It should be Lasso for rho = 0 and Ridge for rho = 1.
The code from the book (see here) takes 12 minutes and maxes out at about 37GB of RAM. Code here takes 25 seconds and peaks at about 7GB.
The latter code uses DuckDB (I'm not sure it would work well with SQLite) and is a little inelegant because I tried to conform to the book code (e.g., using months to partition daily data). Using say 90 days for the daily betas would allow the code to be much simpler (more "tidyverse", less SQL).
I get exactly the same numbers with beta_daily
, but a small number of differences with beta_monthly
… when I spot-checked these numbers, I couldn't work out why the book code is giving different answers.
I'm still working through the book, but there are other places where keeping data in the database (rather than using collect()
) dramatically improves performance---in some cases with minimal changes to the code). But in some cases, one needs to use DuckDB, as SQLite does strange things to dates and the like (because you seem to always use collect()
, the book use SQLite pretty much as a dumb data store and these issues don't crop up with that approach).
Add it to preface or getting started?
Why not have the WRDS PostgreSQL server do all the work?
Steps:
collect()
until the endnrow()
function to handle remote data frames.window_order()
in place of arrange()
when used for window functionstrd_exctn_tm
(PostgreSQL wants a time zone … this could be sorted out later)Based on output in comment below, code seems to work fine. (Not sure if it delivers much performance benefit, but it saves on readers' electricity bills.)
Note that code in comment below does data aggregation in the database and is about 10 times faster than code using collect()
.
clean_enhanced_trace <- function(cusips,
connection,
start_date = as.Date("2002-01-01"),
end_date = today()) {
nrow <- function(df) {
df %>% count() %>% pull(n)
}
# Packages (required)
library(tidyverse)
library(lubridate)
library(dbplyr)
library(RPostgres)
# Function checks ---------------------------------------------------------
# Input parameters
## Cusips
if(length(cusips) == 0 | any(is.na(cusips))) stop("Check cusips.")
## Dates
if(!is.Date(start_date) | !is.Date(end_date)) stop("Dates needed")
if(start_date < as.Date("2002-01-01")) stop("TRACE starts later.")
if(end_date > today()) stop("TRACE does not predict the future.")
if(start_date >= end_date) stop("Date conflict.")
## Connection
if(!dbIsValid(connection)) stop("Connection issue.")
# Enhanced Trace ----------------------------------------------------------
# Main file
trace_all <- tbl(connection,
in_schema("trace", "trace_enhanced")) |>
filter(cusip_id %in% cusips) |>
filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |>
select(cusip_id, msg_seq_nb, orig_msg_seq_nb,
entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,
trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,
pr_trd_dt, trc_st, asof_cd, wis_fl,
days_to_sttl_ct, stlmnt_dt, spcl_trd_fl)
# Enhanced Trace: Post 06-02-2012 -----------------------------------------
# Trades (trc_st = T) and correction (trc_st = R)
trace_post_TR <- trace_all |>
filter((trc_st == "T" | trc_st == "R"),
trd_rpt_dt >= as.Date("2012-02-06"))
# Cancelations (trc_st = X) and correction cancelations (trc_st = C)
trace_post_XC <- trace_all |>
filter((trc_st == "X" | trc_st == "C"),
trd_rpt_dt >= as.Date("2012-02-06"))
# Cleaning corrected and cancelled trades
trace_post_TR <- trace_post_TR |>
anti_join(trace_post_XC,
by = c("cusip_id", "msg_seq_nb", "entrd_vol_qt",
"rptd_pr", "rpt_side_cd", "cntra_mp_id",
"trd_exctn_dt", "trd_exctn_tm"))
# Reversals (trc_st = Y)
trace_post_Y <- trace_all |>
filter(trc_st == "Y",
trd_rpt_dt >= as.Date("2012-02-06"))
# Clean reversals
## match the orig_msg_seq_nb of the Y-message to
## the msg_seq_nb of the main message
trace_post <- trace_post_TR |>
anti_join(trace_post_Y,
by = c("cusip_id", "msg_seq_nb" = "orig_msg_seq_nb",
"entrd_vol_qt", "rptd_pr", "rpt_side_cd",
"cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"))
# Enhanced TRACE: Pre 06-02-2012 ------------------------------------------
# Cancelations (trc_st = C)
trace_pre_C <- trace_all |>
filter(trc_st == "C",
trd_rpt_dt < as.Date("2012-02-06"))
# Trades w/o cancelations
## match the orig_msg_seq_nb of the C-message
## to the msg_seq_nb of the main message
trace_pre_T <- trace_all |>
filter(trc_st == "T",
trd_rpt_dt < as.Date("2012-02-06")) |>
anti_join(trace_pre_C,
by = c("cusip_id", "msg_seq_nb" = "orig_msg_seq_nb",
"entrd_vol_qt", "rptd_pr", "rpt_side_cd",
"cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"))
# Corrections (trc_st = W) - W can also correct a previous W
trace_pre_W <- trace_all |>
filter(trc_st == "W",
trd_rpt_dt < as.Date("2012-02-06"))
# Implement corrections in a loop
## Correction control
correction_control <- nrow(trace_pre_W)
correction_control_last <- nrow(trace_pre_W)
## Correction loop
while(correction_control > 0) {
# Corrections that correct some msg
trace_pre_W_correcting <- trace_pre_W |>
semi_join(trace_pre_T,
by = c("cusip_id", "trd_exctn_dt",
"orig_msg_seq_nb" = "msg_seq_nb"))
# Corrections that do not correct some msg
trace_pre_W <- trace_pre_W |>
anti_join(trace_pre_T,
by = c("cusip_id", "trd_exctn_dt",
"orig_msg_seq_nb" = "msg_seq_nb"))
# Delete msgs that are corrected and add correction msgs
trace_pre_T <- trace_pre_T |>
anti_join(trace_pre_W_correcting,
by = c("cusip_id", "trd_exctn_dt",
"msg_seq_nb" = "orig_msg_seq_nb")) |>
union_all(trace_pre_W_correcting)
# Escape if no corrections remain or they cannot be matched
correction_control <- nrow(trace_pre_W)
if(correction_control == correction_control_last) {
correction_control <- 0
}
correction_control_last <- nrow(trace_pre_W)
}
# Clean reversals
## Record reversals
trace_pre_R <- trace_pre_T |>
filter(asof_cd == 'R') |>
group_by(cusip_id, trd_exctn_dt, entrd_vol_qt,
rptd_pr, rpt_side_cd, cntra_mp_id) |>
window_order(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>
mutate(seq = row_number()) |>
ungroup()
## Remove reversals and the reversed trade
trace_pre <- trace_pre_T |>
filter(is.na(asof_cd) | !(asof_cd %in% c('R', 'X', 'D'))) |>
group_by(cusip_id, trd_exctn_dt, entrd_vol_qt,
rptd_pr, rpt_side_cd, cntra_mp_id) |>
window_order(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>
mutate(seq = row_number()) |>
ungroup() |>
anti_join(trace_pre_R,
by = c("cusip_id", "trd_exctn_dt", "entrd_vol_qt",
"rptd_pr", "rpt_side_cd", "cntra_mp_id", "seq")) |>
select(-seq)
# Agency trades -----------------------------------------------------------
# Combine pre and post trades
trace_clean <- trace_post |>
union_all(trace_pre)
# Keep angency sells and unmatched agency buys
## Agency sells
trace_agency_sells <- trace_clean |>
filter(cntra_mp_id == "D",
rpt_side_cd == "S")
# Agency buys that are unmatched
trace_agency_buys_filtered <- trace_clean |>
filter(cntra_mp_id == "D",
rpt_side_cd == "B") |>
anti_join(trace_agency_sells,
by = c("cusip_id", "trd_exctn_dt",
"entrd_vol_qt", "rptd_pr"))
# Agency clean
trace_clean <- trace_clean |>
filter(cntra_mp_id == "C") |>
union_all(trace_agency_sells) |>
union_all(trace_agency_buys_filtered)
# Additional Filters ------------------------------------------------------
trace_add_filters <- trace_clean |>
mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |>
filter(is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,
is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,
wis_fl == "N",
is.na(spcl_trd_fl) | spcl_trd_fl == "",
is.na(asof_cd) | asof_cd == "")
# Output ------------------------------------------------------------------
# Only keep necessary columns
trace_final <-
trace_add_filters |>
select(cusip_id, trd_exctn_dt, trd_exctn_tm,
rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id)
# Return
return(trace_final |>
arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) %>%
collect())
}
wrds <- dbConnect(
Postgres(),
host = "wrds-pgdata.wharton.upenn.edu",
dbname = "wrds",
port = 9737,
sslmode = "require",
bigint = "integer")
tidy_finance <- dbConnect(
duckdb::duckdb(),
"data/tidy_finance.duckdb",
read_only = FALSE
)
mergent_cusips <-
tbl(tidy_finance, "mergent") |>
pull(complete_cusip)
mergent_parts <- split(
mergent_cusips,
rep(1:100,
length.out = length(mergent_cusips))
)
for (j in 1:length(mergent_parts)) {
trace_enhanced <- clean_enhanced_trace(
cusips = mergent_parts[[j]],
connection = wrds,
start_date = ymd("2014-01-01"),
end_date = ymd("2016-11-30")
)
print(paste("Starting", j))
dbWriteTable(
conn = tidy_finance,
name = "trace_enhanced",
value = trace_enhanced,
overwrite = ifelse(j == 1, TRUE, FALSE),
append = ifelse(j != 1, TRUE, FALSE))
print(paste("Finishing", j))
}
dbDisconnect(wrds)
dbDisconnect(tidy_finance, shutdown = TRUE)
Brulee now has "sigmoid" implemented as an activation function.
Replicating Accessing and Managing Financial Data and WRDS Dummy Data to apply Parametric Portfolio Policies creates an error when computing evaluate_portfolio(weights_crsp)
.
In evaluate_portfolio() at evaluation_capm
line :
ValueError: zero-size array to reduction operation maximum which has no identity
It likely comes from a mis-alignement of date between evaluation and factors_ff_monthly.
evaluation.merge(factors_ff_monthly, how="left", on="month")
to not have any common month
with factors_ff_monthly
.evaluation
is constructed from crsp_monthly
which I created following WRDS Dummy Data.
From WRDS Dummy Data :
crsp_monthly_dummy = (stock_panel_monthly
.assign(
date=lambda x: x["month"]+pd.offsets.MonthEnd(-1),
ret=lambda x: np.fmax(np.random.normal(size=len(x)), -1),
I am unsure at this point but it may be x["month"]+pd.offsets.MonthEnd(-1) who is incorrect here.
I don't have access to WRDS data to compare with dummy data.
Pandas version : 2.2.2
Database : running fine
Full replicate, no change to code, only diff is freshly installed environment.
In the introduction under the heading Scaling up the analysis you have this part of the code. Starting at line 130
ticker <- tq_index("DOW")
index_prices <- tq_get(ticker,
get = "stock.prices",
from = "2000-01-01",
to = "2022-03-30"
) %>%
filter(symbol != "DOW") # Exclude the index itself
It is wrong to exclude the symbol "DOW" as this is not the index but the company Dow inc.
Your code should be:
ticker <- tq_index("DOW")
index_prices <- tq_get(ticker,
get = "stock.prices",
from = "2000-01-01",
to = "2022-03-30")
Of course this means the subsequent graphs and results need to be rechecked.
Sir,
In this part of code risk_premiums <- data_fama_macbeth |>
nest(data = c(ret_excess_lead, beta, log_mktcap, bm, permno)) |>
mutate(estimates = map(
data,
~ tidy(lm(ret_excess_lead ~ beta + log_mktcap + bm, data = .x))
)) |>
unnest(estimates)
i see that the result obtained are general. We dont get the results for each id or permno. So if for each permno we need then what modifications are needed?
Because we received a request, we need to add:
I feel like exercise 3. in the Fama-MacBeth chapter would benefit from a little more explaining. Maybe it would make sense to mention in the chapter that one can use the characteristics or a factor exposures derived from time series regressions as input for the cross-sectional regressions to clearly where one would use the rolling windows regressions.
When running the melt() function in the index_prices.
.melt(ignore_index=False, var_name=["variable", "symbol"])
it returns a ValueError:
ValueError: var_name=['variable', 'symbol'] must be a scalar.
Not evaluated the code in detail yet but I see that the input to the figure is
tibble(
`No short-sale` = w_no_short_sale$solution,
`Minimum Variance` = w_mvp,
`Efficient portfolio` = compute_efficient_weight(Sigma, mu),
`Regulation-T` = w_reg_t$par,
Industry = colnames(industry_returns)
)
At the same time, the resulting figure omits the "Regulation-T" column.
Niklas alerted me that a missing na.rm = TRUE
rendered the computation of the effective spread NA.
The calculation of dy is correct? or it should be
dy=lambda x: np.log(x["D12"])-np.log(x["Index"].shift(1)),
macro_predictors = (
pd.read_csv(macro_predictors_link, thousands=",")
.assign(
month=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
dp=lambda x: np.log(x["D12"])-np.log(x["Index"]),
dy=lambda x: np.log(x["D12"])-np.log(x["D12"].shift(1)),
ep=lambda x: np.log(x["E12"])-np.log(x["Index"]),
de=lambda x: np.log(x["D12"])-np.log(x["E12"]),
tms=lambda x: x["lty"]-x["tbl"],
dfy=lambda x: x["BAA"]-x["AAA"]
)
.rename(columns={"b/m": "bm"})
.get(["month", "dp", "dy", "ep", "de", "svar", "bm",
"ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
.query("month >= @start_date and month <= @end_date")
.dropna()
)
The equation for the efficient portfolio weight in the book is written as:
\omega_{\text{eff}}(\Bar{\mu}) = \text{arg min
When it should rather be:
\omega_{\text{eff}}(\Bar{\mu}) = \text{arg min
Workflow:
Idea is to use the portfolio snapshots of wikifolio traders and discover which stocks are often held together
Include reference to WRDS package.
Function yields warning (without impacting expected behavior): "tz(): Don't know how to compute timezone for object of class hms/difftime; returning "UTC"."
Proposed by Hermann Elendner via Linkedin
https://developers.google.com/search/docs/appearance/favicon-in-search
Let's also get a nicer icon like the other website :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.