lrberge / fixest Goto Github PK
View Code? Open in Web Editor NEWFixed-effects estimations
Home Page: https://lrberge.github.io/fixest/
Fixed-effects estimations
Home Page: https://lrberge.github.io/fixest/
Thank you for writing this fantastic package. This is great contribution to empirical economist community.
While I was working with etable(), I found that the order option in etable() does not work when dict option is specified. Could you look into this please?
library(fixest)
model <- feols(mpg ~ cyl + disp, data= mtcars)
etable(model, order = c("disp", "cyl"))
# order does not work with dict, whether it uses variable name or label.
etable(model, order = c("disp", "cyl"), dict = c(disp = "$DISP$"))
etable(model, order = c("$DISP$", "cyl"), dict = c(disp = "$DISP$"))
The fixest package is great and has improved my workflow and saved me a lot of computation time. But I was wondering if there is any way to estimate a probit binary response model? At the moment I am using the R package bife. But it would be nice to be able to do this using fixest, as I can do with logit models.
Furthermore, I was wondering if it was possible to change the default standard errors when estimating a model. For example
set.seed(12)
Bond = runif(100, 0, 0.55)
Day = runif(100, 0, 1.45)
Intervention = round(Day)
Type = round(Bond)
Agency = round(Bond + Day)
Data <- data.table(Agency, Bond, Day, Intervention, Type)
feols(Agency ~ Bond + Day | Intervention + Type , Data)
Gives the following output
OLS estimation, Dep. Var.: Agency
Observations: 100
Fixed-effects: Intervention: 2, Type: 2
Standard-errors: Clustered (Intervention)
Estimate Std. Error t value Pr(>|t|)
Bond 1.339600 0.324546 4.1276 0.000079 ***
Day 0.956392 0.269115 3.5538 0.000593 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-likelihood: -13.21 Adj. R2: 0.65434
R2-Within: 0.52428
Where the standard errors are clustered at the intervention level . However, with this is not what I necessarily want. In large datasets it also would seem to have a large effect on the computation time relative to the usual standard errors. Could you provide a way of overriding this behavior so I can calculate only the usual standard errors?
Thanks for the great package!
It would be convenient (for me, hopefully for others) if fixest had tidy.fixest
, glance.fixest
, and augment.fixest
methods. The broom docs provide recommendations for adding these methods: https://broom.tidyverse.org/articles/adding-tidiers.html
library(broom)
library(fixest)
tidy(lm(mpg ~ wt, data=mtcars))
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 37.3 1.88 19.9 8.24e-19
#> 2 wt -5.34 0.559 -9.56 1.29e-10
tidy(feols(mpg ~ wt, data=mtcars))
#> Error: No tidy method for objects of class fixest
Created on 2019-11-20 by the reprex package (v0.3.0)
What do you think?
If anyone else wants to tackle this, please feel free. If not, I can submit a PR sometime in the next few weeks.
Love this package. The performance is an absolute game changer. However I noticed some large differences between lfe and fixest when clustering. Here’s some sample code I wrote generated from a thread examining how lfe was too conservative relative to Stata, which is now 'fixed'. sgaure/lfe#1
I wonder if it's worth aligning the standard errors here.
Here's sample code that replicates the issue
# Generate 500 FE levels, with 5 obs each, nested within 50 clusters
n_clusters <- 50
n_obs_per_fe_level <- 5
n_fe_levels <- 500
n_obs <- n_obs_per_fe_level * n_fe_levels
n_fe_per_cluster <- n_fe_levels / n_clusters
n_obs_per_cluster <- n_obs / n_clusters
# Make 500 FE levels
fe <- sort(rep(1:n_fe_levels, length.out = n_obs))
# Make clusters that nest FE levels
cl <- ceiling(fe / n_fe_per_cluster)
draw_one_cluster <- function(N) {
# Each cluster has its own mean and se for epsilon.
cluster_eps_mean <- rnorm(1)
cluster_eps_sd <- 10 * runif(1)
rnorm(N, mean=cluster_eps_mean, sd = cluster_eps_sd)
}
eps <- unlist(
replicate(n_clusters, draw_one_cluster(n_obs_per_cluster), simplify = FALSE)
)
x <- rnorm(n_obs, 0, 10)
y <- 0.02 * x + fe / 100 + eps
DF <- data.frame(cl, y, x, fe)
# haven::write_dta(DF, "~/test_data.dta")
mod = lfe::felm(y ~ x | fe | 0 | cl, data=DF)
summary(mod)
mod = fixest::feols(y ~ x | fe , data=DF)
summary(mod,cluster=DF$cl)
Hi and thanks for such an amazing package. There seems to be an issue with escaping special characters in the etable function when the TEX option is set to true.
#Load the libraries
library(data.table)
library(fixest)
#Set the seed
set.seed(1)
X <- rep(c(1,1,2,3), 10) + rnorm(10*4,0,1)
a = rep(c("b","b","a", "c"), 10)
Y = rep(c(3,3,2,1.1), 10) + 0.5*X + rnorm(10*4,0,0.01)
#Create the data
DT = data.table(
ID_var = a,
Y,
X
)
#Some regressions
Reg_1 <- feols(Y~X, DT)
Reg_2 <- feols(Y~X|ID_var, DT)
#Export the regressions
etable(Reg_1, Reg_2, tex = TRUE, fixef_sizes = T, se = "cluster", cluster = "ID_var")
I receive the following output.
\begin{table}[htbp]\centering
\caption{no title}
\begin{tabular}{lcc}
& & \tabularnewline
\hline
\hline
Dependent Variable:&\multicolumn{2}{c}{Y}\\
Model:&(1)&(2)\\
\hline
\emph{Variables}\tabularnewline
(Intercept)&3.1110$^{***}$& \\
&(0.2996)& \\
X&0.0468&0.5024$^{***}$\\
&(0.0975)&(0.0023)\\
\hline
\emph{Fixed-Effects}& & \\
ID\_var&No&Yes\\
\hline
\emph{Fit statistics}& & \\
Observations& 40&40\\
# ID\_var&--&3\\
R$^2$ & 0.01063&0.99975\\
Within R$^2$ & --&0.99958\\
\hline
\hline
\multicolumn{3}{l}{\emph{One-way (ID_var) standard-errors in parentheses.}}\\
\multicolumn{3}{l}{\emph{Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
\end{tabular}
\end{table}
Notice that ID_var
in the clustered error variable in the fourth last line of the table should be escaped as ID\_var
. More importantly the #
in the fit statistics section of the table, under the Observations
variable, also not escaped. This bug persists even if one supplies a file instead of setting tex = TRUE
.
Thanks again.
As promised, a second esttex
issue...
Is it possible for esttex
output to respect common (i.e. unique) dictionary aliases?
I often want to output similar variables on the same row in a regression table. For instance, to make them immediately comparable even if they are named differently in my actual R model calls. The way this works for other regression outputting packages in R (e.g. huxtable or stargazer) is to give them the same alias. However, that strategy does not appear to work here.
Consider the following example, where I want to compare full vs partial marginal effects, using different model syntax. The coefficient on Wind in mod_2
is actually the same as Month5:Wind in mod_1
. Yet, even if I explicitly give it the same alias, the two coefficients are still printed on separate rows.
library(fixest)
aq = airquality
aq$Month = factor(aq$Month)
## Full marginal effects
mod_1 = feols(Ozone ~ Month / Wind, data = aq)
#> NOTE: 37 observations removed because of NA values (Breakup: LHS: 37, RHS: 0).
## Partial marginal effects
mod_2 = feols(Ozone ~ Month * Wind, data = aq)
#> NOTE: 37 observations removed because of NA values (Breakup: LHS: 37, RHS: 0).
## "Wind" in mod_2 is actual "Month5:Wind"...
myDict = c("Wind" = "Month5:Wind")
## Doesn't work as expected: Get separate "Month5:Wind" rows instead of one.
esttex(
mod_1, mod_2, titles = c("Full MEs", "Partial MEs"),
dict = myDict,
drop = c("Int", paste0("Month", 6:9)) ## Optional: Just to highlight rows of interest
)
#> \begin{table}[htbp]\centering
#> \caption{no title}
#> \begin{tabular}{lcc}
#> & & \tabularnewline
#> \hline
#> \hline
#> Dependent Variable:&\multicolumn{2}{c}{Ozone}\\
#> Model:&(1)&(2)\\
#> \hline
#> \emph{Variables}\tabularnewline
#> Month5:Wind&-2.3681$^{*}$& \\
#> &(1.3162)& \\
#> Month5:Wind& &-2.3681$^{*}$\\
#> & &(1.3162)\\
#> \hline
#> \emph{Fit statistics}& & \\
#> Observations& 116&116\\
#> R$^2$ & 0.54732&0.54732\\
#> Adjusted R$^2$ & 0.50889&0.50889\\
#> \hline
#> \hline
#> \multicolumn{3}{l}{\emph{Normal standard-errors in parentheses. Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
#> \end{tabular}
#> \end{table}
Created on 2019-12-29 by the reprex package (v0.3.0)
I noticed that fixest.predict
drops observations if newdata
contains NAs. Could It instead have NA values for the output vector?
Demo:
new_data <- mtcars
new_data[1, ] <- NA
library(fixest)
x <- feols(wt ~ cyl, mtcars)
pred <- predict(x, newdata=new_data)
nrow(new_data)
#> [1] 32
length(pred)
#> [1] 31
Created on 2019-11-22 by the reprex package (v0.3.0)
Howdy,
I have two specifications, A and B, that use different identification strategies on slightly different cuts of data. For A I want to cluster on C_a and for B I want to cluster on C_b. Is there a way using etable
and esttex
that can have specifications {A,B} side-by-side with my desired clustering?
Currently, I do the following example (or esttex
) and combine later.
Example:
etable( feols(yvar ~ xvar | FE1A + FE2A), df[df$spec_a==1,]) ,cluster = "C_a", digits = 3)
etable( feols(yvar ~ xvar | FE1B + FE2B), df[df$spec_b==1,]) ,cluster = "C_b", digits = 3)
Looking for (?):
etable( list(
feols(yvar ~ xvar | FE1A + FE2A), df[df$spec_a==1,]),
feols(yvar ~ xvar | FE1B + FE2B), df[df$spec_b==1,])
),
cluster = c("C_a", "C_b"), digits = 3)
Thanks for all help!
Best,
Luke
First thanks for the interest in the package and for wanting to submit a bug report. This feedback is invaluable since it significantly improves software quality, benefiting to all users.
Here's a few guidelines before submitting a bug:
Bugs are corrected at almost every release. So ensure you have the latest version of the software running. Currently it's .
Every time a bug is corrected, it's reported in the NEWS file. Please have a look at it to see whether it isn't already fixed in the development version.
I know it's some work, but please provide a minimal working example in your report with reproducible code and data. If your bug only happens for some large data set and you're not able to make it appear for data of smaller size, you can DM me the data. A minimal working example ensures that:
There are many very good examples of bug reports in past issues (see #4, #10, or #17 to name a few).
Hi,
First of all, thanks a lot for your incredibly useful package. I have a minor issue with it that I would like to report. Any help is much appreciated.
I am using the function feglm and I need bootstrap standard errors (BSE). I have programmed up the bootstrap routine myself. My problem is that I don't manage to display the BSE with etable(). Below my code:
n.Boot <- 1000
frm <- y ~ x | i + j
out <- feglm(frm, data = df, notes = FALSE)
B.se <- Run_bootstrap(n.Boot,frm) # My bootstrap routine
where df
contains my data. I have then tried to substitute my own BSE with those provided by feglm as follows:
out[["se"]] <- B.se
out[["coeftable"]][["Std. Error"]] <- B.se
and then do
etable(out)
However, this is not producing the desired result. I would like to ask you if there is any way to address my point (maybe another function for displaying results besides etable or stargazer?)
I thank you in advance for your help.
Below my system and package info:
package: fixest 0.4.0 2020-03-29 [1] CRAN (R 3.6.3)
version: R version 3.6.3 (2020-02-29)
os : Windows 10 x64
system: x86_64, mingw32
ui : RStudio
Hi! Thanks for the great package! I'm using demean
function for my analysis and I think it works fine but throws some random error message that changes every time (or sometimes not even showing up). Practically, when I implement this within Rmarkdown
it stops the compiling so as a get-around I save it to a separate file using R script and load the data into the rmd
but obviously this is not the most efficient way. Here's a reproducible example, which is a subset of my actual data.
test <- structure(list(ln_dmg = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8.59600720786391,
0, 9.71410729937478, 0, 0, 12.1270302478998, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10.2449188370322,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 7.60692111895402, 0, 0, 0, 0, 0, 0, 0, 0,
0), comm_id = c("060001", "060001", "060001", "060001", "060001",
"060001", "060001", "060001", "060001", "060001", "060001", "060001",
"060001", "060001", "060001", "060001", "060001", "060001", "060001",
"060001", "060001", "060001", "060001", "060001", "060001", "060002",
"060002", "060002", "060002", "060002", "060002", "060002", "060002",
"060002", "060002", "060002", "060002", "060002", "060002", "060002",
"060002", "060002", "060002", "060002", "060002", "060002", "060002",
"060002", "060002", "060002", "060003", "060003", "060003", "060003",
"060003", "060003", "060003", "060003", "060003", "060003", "060003",
"060003", "060003", "060003", "060003", "060003", "060003", "060003",
"060003", "060003", "060003", "060003", "060003", "060003", "060003",
"060004", "060004", "060004", "060004", "060004", "060004", "060004",
"060004", "060004", "060004", "060004", "060004", "060004", "060004",
"060004", "060004", "060004", "060004", "060004", "060004", "060004",
"060004", "060004", "060004", "060004"), trans_year = c(1983,
1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002,
2003, 2004, 2005, 2006, 2007, 1983, 1984, 1985, 1986, 1987, 1988,
1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 1983, 1984, 1985,
1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007
), event_id = c(9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999,
9999, 9999, 949, 9999, 968, 9999, 9999, 611, 9999, 9999, 9999,
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999,
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 611,
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999,
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999,
9999, 9999, 611, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999,
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999,
949, 9999, 9999, 9999, 9999, 611, 9999, 9999, 9999, 9999, 9999,
9999, 9999, 9999, 9999)), row.names = c(NA, -100L), class = "data.frame")
Y_demean = fixest::demean(X = test[, "ln_dmg"],
fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then -9
Error in (function (env, objName, computeSize = TRUE) :
R_Reprotect: only 2 protected items, can't reprotect index -3
As mentioned earlier, the error message changes each time I run this code. Here are a few of them. As you can see, sometimes it runs without any error message. Oh and importantly, they produce the same results though.
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+ fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then -8
Error in (function (env, objName, computeSize = TRUE) :
R_Reprotect: only 3 protected items, can't reprotect index -2
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+ fe = test[, c("comm_id", "trans_year", "event_id")])
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+ fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then 5
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+ fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then 1
I would have ignored this if it wasn't for rmd
so this is not as important issue for others (esp who don't use rmd
) but thought that it wouldn't hurt to report. Please let me know any other info is needed. Thanks!!
Hi I have noticed another (new?) bug with the esttex function. Namely that it no longer displays the title or fixed_sizes when set to true. I have cannot in fact get it to display this for any combinations that I have tried. The following is an example that I ran using R 3.6.2 and fixest 0.3.0.
#This will clear all objects includes hidden objects
rm(list = ls(all.names = TRUE))
gc()
#Load the libraries
library(ggplot2)
library(tsibble)#yearweek function.
library(data.table)
library(zoo)
library(Hmisc) #For weighted medians functions.
library(fixest)#For the fixed effects regressions
setFixest_nthreads(6)#Set the number of threads to use
#Load back in the data.
Reg_data <- fread(paste0(data_analysis_path, "Trade_cost_reg.csv"))
Reg_data[, `:=`(TRD_EXCTN_DT = as.IDate(TRD_EXCTN_DT), Year_month = as.yearmon(Year_month), Year_week = yearweek(Year_week))]
#Want to set the dictionary for exporting tables nicely to latex
setFixest_dict(c("CUSIP_ID" = "Bond", "Year_month" = "Time (monthly)", "Year_week" = "Time (weekly)", "log(ENTRD_VOL_QT)" = "log(Volume)", "Actual_trade_party" = "Dealer",
"Contra_S" = "Selling party", "Contra_B" = "Buying party", "Investment_gradeYes" = "Investment grade", "log(Contra_trade_volume)" = "log(Mean contra volume)",
"Last_month_portion_trds" = "Relationship measure", "Last_month_portion_trds_S" = "Relationship measure_S", "Last_month_portion_trds_B" = "Relationship measure_B", "Trade_costs" = "Trade costs",
"CUSIP_ID^Year_month" = "Bond x Time (monthly)", "CUSIP_ID^Year_week" = "Bond x Time (weekly)", "Actual_trade_party^CUSIP_ID" = "Dealer x Bond",
"Actual_trade_party^Year_month" = "Dealer x Time (monthly)", "Actual_trade_party^Year_week" = "Dealer x Time (weekly)", "Actual_trade_party^Year_month^CUSIP_ID" =
"Dealer x Time (monthly) x Bond", "Actual_trade_party^Year_week^CUSIP_ID" = "Dealer x Time (weekly) x Bond", "Contra_S^Year_month" = "Contra_S x Time (monthly)",
"Contra_B^Year_month" = "Contra_B x Time (monthly)", "Contra_S^Year_month^CUSIP_ID" = "Contra_S x Time (monthly) x Bond", "Contra_B^Year_month^CUSIP_ID" = "Contra_B x Time (monthly) x Bond",
"Contra_S^Year_week^CUSIP_ID" = "Seller x Time (weekly) x Bond", "Contra_B^Year_week^CUSIP_ID" = "Buyer party x Time (weekly) x Bond", "Contra_S^CUSIP_ID^Actual_trade_party^Year_month" =
"Seller x Bond x Dealer x Time (monthly)", "Contra_B^CUSIP_ID^Actual_trade_party^Year_month" = "Buyer x Bond x Dealer x Time (monthly)", "Contra_B^CUSIP_ID^Actual_trade_party^Year_week" =
"Buyer x Bond x Dealer x Time (weekly)", "Contra_S^CUSIP_ID^Actual_trade_party^Year_week" = "Seller x Bond x Dealer x Time (weekly)", "log(Contra_trade_volume_B)" = "log(Buyer volume)",
"log(Contra_trade_volume_S)" = "log(Seller volume)", "CUSIP_ID & Year_month" = "Bond + Time (monthly)"))
#Pre-crisis regressions.
Pre_crisis_data <- Reg_data[Year_month >= "Jan 2006"][Year_month <= "Jun 2007"]
#Here we just do bond*dealer*Year_month
dealer_times_bond_times_month_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume) | Actual_trade_party^Year_month^CUSIP_ID, data = Pre_crisis_data)
deal_times_bond_times_week_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) +log(Contra_trade_volume) |
Actual_trade_party^Year_week^CUSIP_ID, data = Pre_crisis_data)
Contra_S_times_bond_times_dealer_times_month_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume)| Contra_S^CUSIP_ID^Actual_trade_party^Year_month, data = Pre_crisis_data)
Contra_B_times_bond_times_dealer_times_month_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume)| Contra_B^CUSIP_ID^Actual_trade_party^Year_month, data = Pre_crisis_data)
esttex(dealer_times_bond_times_month_fe_Pre_crisis, deal_times_bond_times_week_fe_Pre_crisis, Contra_S_times_bond_times_dealer_times_month_fe_Pre_crisis,
Contra_B_times_bond_times_dealer_times_month_fe_Pre_crisis,
se = "twoway", cluster = c("CUSIP_ID", "Year_month"),
title = "A regression of inter-dealer matched trade costs on our relationship measure for the pre-crisis period as defined in the text.
Variables with B or S at the end indicates that the variable is for the buying or selling counterparty only.
Investment grade is an indicator variable that equals one if the bond is investment grade and zero if not.
Contra volume is the total volume of the contra parties from all trades they undertake in the same calendar month.",
fixef_sizes = TRUE )
The following output is returned to the console
\begin{table}[htbp]\centering
\caption{no title}
\begin{tabular}{lcccc}
& & & & \tabularnewline
\hline
\hline
Dependent Variable:&\multicolumn{4}{c}{Trade costs}\\
Model:&(1)&(2)&(3)&(4)\\
\hline
\emph{Variables}\tabularnewline
Relationship measure&-0.2605$^{***}$&-0.2110$^{*}$&-0.4044$^{**}$&-0.6253$^{***}$\\
&(0.0810)&(0.1098)&(0.2001)&(0.1113)\\
log(Volume)&-0.0392$^{***}$&-0.0282$^{***}$&-0.0322$^{***}$&-0.0224$^{***}$\\
&(0.0044)&(0.0046)&(0.0057)&(0.0049)\\
log(Mean contra volume)&0.0056$^{***}$&0.0050$^{**}$&0.0081$^{*}$&0.0138$^{***}$\\
&(0.0016)&(0.0022)&(0.0044)&(0.0044)\\
\hline
\emph{Fixed-Effects}& & & & \\
Dealer x Time (monthly) x Bond&Yes&No&No&No\\
Dealer x Time (weekly) x Bond&No&Yes&No&No\\
Seller x Bond x Dealer x Time (monthly)&No&No&Yes&No\\
Buyer x Bond x Dealer x Time (monthly)&No&No&No&Yes\\
\hline
\emph{Fit statistics}& & & & \\
Observations& 397,891&397,891&397,891&397,891\\
R$^2$ & 0.91507&0.96558&0.96396&0.97514\\
Within R$^2$ & 0.03447&0.02522&0.02639&0.02169\\
\hline
\hline
\multicolumn{5}{l}{\emph{Two-way (Bond \& Time (monthly)) standard-errors in parentheses.}}\\
\multicolumn{5}{l}{\emph{Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
\end{tabular}
\end{table}
Notice that the table does not contain the title or the number of "individuals" per fixed-effect dimension.
Thank you for giving us such a fantastic package!
I think I found a minor bug in l function: it does not take a number (rather than a vector) for its lag argument.
library(fixest)
data("base_did")
pdat = panel(base_did, ~id+period)
The following does not work.
feols(y ~ l(x1, 2), pdat)
But this works.
feols(y ~ l(x1, 2:2), pdat)
Hi (bonjour),
fixef returns an error when the varying slope variable is an integer. See the code below :
##your example
base_vs = iris
names(base_vs) = c(paste0("x", 1:4), "species")
est_vs = feols(x1 ~ x2 | species[x3], base_vs)
fixef(est_vs)
##the example which leads to an error
base_vs$x4=as.integer(base_vs$x3)
est_vs = feols(x1 ~ x2 | species[x4], base_vs)
fixef(est_vs)
##one way to solve it
base_vs$x5=as.numeric(as.integer(base_vs$x3))
est_vs = feols(x1 ~ x2 | species[x5], base_vs)
fixef(est_vs)
Bonne journée,
Having NA
in fixed effects normally produces a warning and drops the NA observations, but this does not happen when the fixed effects are specified with ^
. Here is an example:
library(fixest)
data(base_did)
base_did <- as_tibble(base_did) %>%
mutate(
# create a category with some NAs
icat = ifelse(id < 20, 0, ifelse(id < 30, NA, 1)),
# create another "fixed effect"
period2 = period %% 2
)
table(base_did %>% select(icat, period2), useNA = 'always')
# period2
# icat 0 1 <NA>
# 0 95 95 0
# 1 395 395 0
# <NA> 50 50 0
mdl <- feols(y ~ treat:post | id + period, data = base_did)
summary(mdl)
mdl$nobs # 1,080
# the following produces a nice warning due to missing values of icat
# and drops the 100 rows with missing data
mdl <- feols(y ~ treat:post | id + period + icat, data = base_did)
# NOTE: 100 observations removed because of NA values (Breakup: LHS: 0, RHS: 0, Fixed-effects: 100).
summary(mdl)
mdl$nobs # 980
# this surprised me; I expected the 100 observations with NAs to be dropped again
mdl <- feols(y ~ treat:post | id + period + icat^period2, data = base_did)
summary(mdl)
mdl$nobs # 1,080 again
I'm running version 0.6.0 with R 4.0.1 on Ubuntu 20.04.
When you run the following in R:
options("warnPartialMatchDollar" = TRUE)
And the you use the feglm() function in this package to fit a model which you assign to, say, m1.
Then when you call:
summary(m1)
you get the following warning message:
Warning message:
partial match of 'score' to 'scores'
Hello,
Thanks for the great package; the speed gains and well-documented vignettes have been a boon.
I was wondering if there are any plans to / there be any interest in an implementation of the Ferndandez-Val - Weidner split-sample/debiasing approach to two-way FEs in probit and logit models in the package.
Hi Laurent,
One nice thing that the reghdfe package in Stata does is display the number of clusters. For example, here's what you get if you run the model from your "standard errors" vignette. (Note that I'm clustering twoways here just to demonstrate a slightly more complicated case.)
So we actually see quite a bit of information about how things were clustered and the associated DoF adjustments (including the bottom panel, which is pretty detailed). For the purposes of this issue, though I'm mostly interested in the line that tells us:
(Std. Err. adjusted for **4** clusters in id time)
.
Sticking with the formal definition, I guess we would refer to the number 4 as min(G).
Is it possible to retrieve this same information with fixest? I know that we can get quite a bit of information with, say, attributes(vcov(est))
, but I can't see how to get this G number. (Apologies if I missed it!)
FWIW, I'm mostly asking because I frequently see this information presented at the bottom of regression tables in econ journals... and even more frequently get asked "how big my effective sample is" during presentations. It would be very helpful have it available in the return model object... or even added to etable()
.
Thanks, as ever.
Hi,
I am trying to plot a model with coefplot and I get this error:
Error in coefplot_prms(object = object, ..., sd = sd, ci_low = ci_low, :
Internal error regarding the lengths of vectors of coefficients. Could you report to the author of the fixest package?
Do you have any idea what is going on?
This is not an issue. But I am wondering do you have any plan to offer a python package for fixest. I feel that most econometric estimation packages are provided in R.
But when it comes to large companies that manage big data analytics, support for python is much greater than that for R.
Just a thought.
I noticed that there is a bug with the esttex function when there are multiple clustered standard errors. The table it creates does not compile without modification in Latex.
rm(list = ls(all.names = TRUE))
gc()
#Load the libraries
library(ggplot2)
library(tsibble)#yearweek function.
library(data.table)
library(zoo)
library(Hmisc) #For weighted medians functions.
library(fixest)#For the fixed effects regressions
setFixest_nthreads(6)#Set the number of threads to use
#Load in the data.
Reg_data <- fread(paste0(data_analysis_path, "Trade_cost_reg.csv"))
Reg_data[, `:=`(TRD_EXCTN_DT = as.IDate(TRD_EXCTN_DT), Year_month = as.yearmon(Year_month), Year_week = yearweek(Year_week))]
#Want to set the dictionary for exporting tables nicely to latex
setFixest_dict(c( "CUSIP_ID & Year_month & Actual_trade_party" = "Bond + Period (Monthly) + Intermediary", CUSIP_ID = "Bond", "Year_month" = "Monthly time", "Year_week" = "Weekly time", "log(ENTRD_VOL_QT)" = "log(Volume)", "Actual_trade_party" = "Intermediary",
"Contra_S" = "Selling party", "Contra_B" = "Buying party", "Investment_gradeYes" = "Investment grade", "log(Contra_trade_volume)" = "log(Mean contra volume)",
"Last_month_portion_trds" = "Relationship measure", "Last_month_portion_trds_S" = "Relationship measure_S", "Last_month_portion_trds_B" = "Relationship measure_B", "Trade_costs" = "Trade costs", "CUSIP_ID^Year_month" = "Bond x period (monthly)"))
Bond_times_mon_fe <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume) | CUSIP_ID^Year_month, data = Reg_data)
esttex(Bond_times_mon_fe, se = "threeway", cluster = c("CUSIP_ID", "Year_month", "Actual_trade_party"))
The output I get from this is
\begin{table}[htbp]\centering
\caption{no title}
\begin{tabular}{lc}
& \tabularnewline
\hline
\hline
Dependent Variable:&Trade costs\\
Model:&(1)\\
\hline
\emph{Variables}\tabularnewline
Relationship measure&-0.2777$^{**}$\\
&(0.1136)\\
log(Volume)&-0.0157$^{**}$\\
&(0.0069)\\
log(Mean contra volume)&0.0023\\
&(0.0031)\\
\hline
\emph{Fixed-Effects}& \\
Bond x period (monthly)&Yes\\
\hline
\emph{Fit statistics}& \\
Observations& 6,933,998\\
R$^2$ & 0.40135\\
Within R$^2$ & 0.00819\\
\hline
\hline
\multicolumn{2}{l}{\emph{Three-way (Bond & Monthly time & Intermediary) standard-errors in parentheses.}}\\
\multicolumn{2}{l}{\emph{Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
\end{tabular}
\end{table}
The issue is the
\multicolumn{2}{l}{\emph{Three-way (Bond & Monthly time & Intermediary) standard-errors in parentheses.}}\\
Here Latex uses the &
as a column separator. Ideally, I would like to change this to a +
or \&
but haven't been able to find a way to do this, other than changing the output by hand.
Furthermore, if I don't specify the "CUSIP_ID^Year_month" = "Bond x period (monthly)"
in the dictionary the default table will contain CUSIP_ID^Year_month
. This must be modified as ^
is math symbol in Latex. It is preferable that the default table the function creates complies without error. Could you make is so that this is changed to x
automatically?
R version: 3.6.2
Thanks.
Hi Mr Berge !
Loving your package, thanks for releasing it into the wild.
There are times when it useful to do a dummy encoding in the formula itself, such as I have outlined in the title of this post.
(This approach works with lm(), glm(), model.matrix(), etc)
When I do that with feglm(), we get this error message:
ClassType + FaceAmountBand + :
Error in parse(text = x, keep.source = FALSE) :
:1:115: unexpected '=='
1: lhs ~ X + Y + X == 1
^
This error was unforeseen by the author of the function feglm. If you think your call to the function is legitimate, could you report?
Merci Laurent pour ce super paquet!!
It seems fixef()
has a trouble accommodating integer variables when using varying slopes, i.e. if x3
is integer in species[x3]
?
See:
library(fixest)
base_vs = iris
names(base_vs) = c(paste0("x", 1:4), "species")
base_vs$x5 <- as.integer(round(base_vs$x3))
out_double <- feols(x1 ~ x2 | species[x3], base_vs)
out_integer <- feols(x1 ~ x2 | species[x5], base_vs)
out <- fixef(out_double)
fixef(out_integer)
#> Error in cpp_demean(y = S, X_raw = 0, r_weights = 0, iterMax = 1000L, : REAL() can only be applied to a 'numeric', not a 'integer'
Created on 2020-07-19 by the reprex package (v0.3.0)
Minor issue.
The r2
function returns an error when applied to a regression that only has fixed effects but no other covariates. I haven't checked this for all the objects returned, but adjusted r-squared is returned in the summary function in that case, so I assume it could also be returned when the r2
function is called.
library(fixest)
library(magrittr)
# create data
df <- data.frame(x = runif(1000))
df$id <- sample(20, 1000, TRUE)
df$y <- runif(1000) + 0.2 * df$x + 0.05 * df$id
# show that r2 works well in the normal case
fixest::feols(y ~ x | id, data = df) %>%
summary()
#> OLS estimation, Dep. Var.: y
#> Observations: 1,000
#> Fixed-effects: id: 20
#> Standard-errors: Clustered (id)
#> Estimate Std. Error t value Pr(>|t|)
#> x 0.204396 0.020233 10.102 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-likelihood: -151.67 Adj. R2: 0.485
#> R2-Within: 0.04161
fixest::feols(y ~ x | id, data = df) %>%
fixest::r2()
#> sq.cor r2 ar2 pr2 apr2 wr2 war2
#> 0.49531459 0.49531459 0.48500437 0.69270986 0.65218985 0.04161231 0.04063337
#> wpr2 wapr2
#> 0.12289423 0.11711136
# show that there is an adjusted r-squared measure even for cases where only fixed-effects are used
fixest::feols(y ~ 1 | id, data = df) %>%
summary()
#> OLS estimation, Dep. Var.: y
#> Observations: 1,000
#> Fixed-effects: id: 20
#> Log-likelihood: -172.92 Adj. R2: 0.46319
# error when calling r2 on this estimation
fixest::feols(y ~ 1 | id, data = df) %>%
fixest::r2()
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 0, 1000
Created on 2020-07-06 by the reprex package (v0.3.0)
Thank you for the great work in the package.
Hi Laurent,
This package has made everything (e.g. clustering, generating latex codes) much easier and faster! Thanks for making it.
I was wondering, if the function esttex has an option that allows me to report t-stats (or z-stats) in the parentheses below the coefficients, instead of sd/se, in the regression output.
And, if not, is there an easier way to customize the latex output so that it contains the t-stats , than manually copying the t-stats and replacing the standard deviations with them in the latex output?
I ask because some journals prefer t-statistics to standard errors. Thanks!
Likely not an issue, rather a question: Do you happen to know a reference for the adjusted R2 of the within model? I have this line in mind:
res[i] = 1 - cpp_ssq(resid(x)) / ssr_fe_only * (n - nb_fe) / (n - nb_fe - df_k)
Stata seems to calculate the adj. R2 in the within case as per the "standard" formula, see this post where someone reworked the Stata formula used: Stata forum
For the one-way fixed effect on Grunfeld data (200 obs) I get with fixest
:
data("Grunfeld", package = "plm")
fe <- feols(inv ~ value + capital | firm, data = Grunfeld)
r2(fe)
# war2 = 0.7642763
With package lfe
I get a different adjusted R2:
summary(felm(inv ~ value + capital | firm, data = Grunfeld))
# Adjusted R-squared: 0.7531104211
felm
seems to use the usual formula for the adjusted R2, without any adjustment in the numerator for the (missing) intercept or absorbed effects.
Thanks for a really useful package.
One error I've encountered is FeNmlm crashing R with a fatal error. I can reproduce this behavior with the example below. Thanks in advance for any help.
library(dplyr)
library(magrittr)
library(fixest)
n_id = 800
n_year = 8
b_true = 1
a_true = 0.5
set.seed(1001)
test_df = data.frame(id = rep(1:n_id, n_year),
year = rep(1:n_year, each = n_id)) %>%
group_by(id) %>%
mutate(mu = rnorm(1, 0, 1)**2) %>% # generate id fixed effect
ungroup %>%
mutate(X = rnorm(n_yearn_id, 0, 1)**2, # random X
Z = rnorm(n_yearn_id, 0, 1)**2, # random Z
Y = rpois(n_yearn_id, mu(1 + b_true*X)^a_true)*Z) # outcome Y
feNmlm(Y ~ log(Z) + log(1 + X) | id,
data = test_df,
verbose = 1) # works perfectly
feNmlm(Y ~ log(Z) | id,
data = test_df,
NL.fml = ~ a*log(1 + X),
NL.start = list(a = 0.5),
verbose = 2) # reports that R encountered a fatal error and restarts`
Hi, I like the package. Is there a way to get the marginal effects for the probit/logit models and their standard errors? I couldn't find it.
Can you add functions to more easily recover the marginal effects? I'd suggest either 1) at the mean for each variable or 2) calculate the marginal effect for every observation and then take the mean. And for the standard errors I guess use the delta method? Thanks!
Hi,
thanks for setting up the FIXEST package, it's really a huge time saver and intuitive to use!
I just wanted to update to the latest github version using devtools::install_github(), unfortunately I get the following errors:
RcppExports.o:RcppExports.cpp:(.text+0x1b26): undefined reference to `get_nb_threads()'
RcppExports.o:RcppExports.cpp:(.text+0x1d49): undefined reference to `setup_fork_presence()'
RcppExports.o:RcppExports.cpp:(.text+0x1e89): undefined reference to `is_in_fork()'
...
ERROR: compilation failed for package 'fixest'
I suspect the errors may be related to the recent changes in "RcppExports.R"?
I tried to update to the latest version on two different devices one running on windows and the other on linux, resulting in the same error.
I noticed that update.fixest
has issues when the data used to estimate is not in scope. If there are no other objects with the same name, you get a helpful error message, but if there are, it's more confusing.
Here's a demo where df
stops referring to the trade data and starts referring to the stats::df
function, causing trouble.
library(fixest)
fn <- function() {
df <- trade
feglm(Euros ~ log(dist_km)|Origin+Destination+Product+Year, data=df)
}
x <- fn()
r2(x)
#> Error in if (n_old != n_new) {: argument is of length zero
Created on 2019-11-22 by the reprex package (v0.3.0)
If I hadn't called the data df
but rather trade2
, I would have gotten the error:
Error in update.fixest(x, . ~ 1 | ., glm.tol = 0.01, fixef.tol = 0.001, :
To apply 'update.fixest', we fetch the original database in the parent.frame -- but it doesn't seem to be there anymore (btw it was trade2).
The issue is here:
Lines 6537 to 6542 in 9abf86b
This would also be a problem if people had multiple datasets with the same name, and were using function boundaries to keep them straight. One solution is to have the model object hold onto a copy of the data, which felm
and lm
do, but that can be its own headache for memory use.
Hello,
Thanks for writing such a fantastic package! My coauthor and I use it everyday!
We would like to report a bug (I think). The following two regressions differ only in that "factor(year)" term is included as a regressor or absorbed. They should generate the same results, but unfortunately they don't.
Std. Errors suggest that the first one ("t0") generates correct outcomes. To confirm this conjecture, I ran the same regression using "reghdfe" package in STATA and got the identical results to "t0".
It seems that the bug is in deciding which variables to drop. The results below show that specification "t1" has one more term than "t0". STATA also drops the term just like "t0".
We also noted that even "t0" specification generates incorrect result without adding fixef.tol argument ("fixef.tol = 10000*.Machine$double.eps"). We wonder whether it may be better to raise the default precision level.
We attach the data we used. Could you please look into this when you have time?
Thank you!
library(fixest)
df <- readRDS("data.rds")
t0 <- feols((log(depsum_cz) - log(l(depsum_cz))) ~ cb:ntr_gap:factor(year) + ntr_gap:factor(year) + cb:factor(year) + factor(year) | bank_id + cz90, data=panel(df, panel.id = c('bank_id_cz', 'year')), fixef.tol = 10000*.Machine$double.eps)
t1 <- feols((log(depsum_cz) - log(l(depsum_cz))) ~ cb:ntr_gap:factor(year) + ntr_gap:factor(year) + cb:factor(year) | factor(year) + bank_id + cz90, data=panel(df, panel.id = c('bank_id_cz', 'year')), fixef.tol = 10000*.Machine$double.eps)
summary(t0, cluster="bank_id")
#> OLS estimation, Dep. Var.: (log(depsum_cz) - log(l(depsum_cz, 1)))
#> Observations: 60,816
#> Fixed-effects: bank_id: 10481, cz90: 731
#> Standard-errors: Clustered (bank_id)
#> Estimate Std. Error t value Pr(>|t|)
#> factor(year)1998:cb 0.044212 0.045045 0.981521 0.326341
#> factor(year)1999:cb -0.069912 0.063748 -1.096700 0.272777
#> factor(year)2000:cb 0.006705 0.051751 0.129565 0.896911
#> factor(year)2001:cb 0.024838 0.034328 0.723532 0.469356
summary(t1, cluster="bank_id")
#> OLS estimation, Dep. Var.: (log(depsum_cz) - log(l(depsum_cz, 1)))
#> Observations: 60,816
#> Fixed-effects: factor(year): 5, bank_id: 10481, cz90: 731
#> Standard-errors: Clustered (bank_id)
#> Estimate Std. Error t value Pr(>|t|)
#> factor(year)1998:cb -0.024328 14.778000 -0.001646 0.998687
#> factor(year)1999:cb -0.138486 14.795000 -0.009360 0.992532
#> factor(year)2000:cb -0.061817 14.764000 -0.004187 0.996659
#> factor(year)2001:cb -0.043704 14.774000 -0.002958 0.997640
#> factor(year)2002:cb -0.068527 14.769000 -0.004640 0.996298
Created on 2020-10-12 by the reprex package (v0.3.0)
Hi, this package is great. I'm wondering if it would be possible to add standard errors/confidence intervals to the output of predict.fixest? I realize that might be quite a bit of work, though.
Related issue in lfe repo. This would also add to fixest's incorporation into the margins package, per this issue.
I'm not sure if I'm being very dense since I don't know how the document would have successfully knit without it, but I don't see the definition of the est_slopes
model in the exporting table vignette. As written it doesn't run for me since it can't find est_slopes
. I assume it was there at some point and is now missing, since the table displays 5 models and only 4 are estimated. However, I didn't see it show up in the commit history.
fixest/vignettes/exporting_tables.Rmd
Lines 36 to 50 in 40c1fe5
Hey Mr Berge !
Found another little bug. Below is a minimal reproduceable example.
Peace
Jordi
+++++++
library(data.table)
library(fixest)
tmp <- data.table(x1 = 1:100,x2 = runif(100)*100)
tmp[, y := x1 + x2]
model <- feglm(y ~ x1 + x2, data = tmp) # works fine
model <- feglm(y ~ bs(x1) + bs(x2), data = tmp) # does not work
I noticed that using ^
can cause issues with formula parsing. Here's the case I tripped over, with stats::terms.formula
called indirectly by fixest::r2
. I think this is a bug, though let me know if I'm misinterpreting.
library(fixest)
gravity_results <- feglm(
Euros ~ log(dist_km) | Origin ^ Destination + Product ^ Year,
data=trade
)
r2(gravity_results)
#> Error in terms.formula(tmp, simplify = TRUE): invalid power in formula
Created on 2019-11-22 by the reprex package (v0.3.0)
When I tried including one fix effect, I found that feols and xtreg from Stata gave totally different results (the sign of the coefficient is opposite). Can you explain why? Are they using different optimization methods?
predict.fixest returns a warning and preditcs all NAs when fixed-effects/cluster_vars are used with a "^" (observed for a feglm):
I tinkered with line 5627 of MiscFuns.R as parse did not handle the "^" by doing a gsub of ^ to , and wrapping it in "paste(...,sep='_')" to make the fixed effects "match" those produced by model fit. This caused cluster_current_unik to correctly return an empty set on line 5628. But there's obviously more to this problem as various other parses involving "^" happen down the line.
I'm rambling (hopefully helpfully). But for now, the only workaround I can think of is to create a new factor variable with paste0(factor1,"",factor2,""...).
Really enjoying this package!
It would be nice to directly include notes under the table, ala the "notes" option in stargazer.
"tidiers" are a set of methods defined by the generics
and broom
packages to extract information from model objects in a standardized fashion.
A tidy.fixest
method would return a data.frame with one row per term, and columns with standard names: term, estimate, std.error, statistic, p.value, conf.low, conf.high
A glance.method
method would return a data.frame with a single row, and columns for the most important goodness-of-fit statistics and model characteristics.
There are two main benefits of supplying tidiers in a package:
tidy
and glance
, such as my modelsummary
package, would support fixest
automatically.As a stopgap measure, I have implemented tidy
and a glance
methods inside modelsummary
: https://github.com/vincentarelbundock/modelsummary/blob/master/R/tidiers.R#L106
Ideally, these methods would be ported over to fixest
itself, as is recommended by the broom
and tidymodels
conventions.
Would you be interested in a Pull Request to add these methods to your package?
This would allow stuff like this:
remotes::install_github('vincentarelbundock/modelsummary')
library(fixest)
library(modelsummary)
url <- 'https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv'
dat <- read.csv(url)
mod <- list()
mod[[1]] <- lm(wage ~ emp + capital, dat)
mod[[2]] <- feols(wage ~ emp + capital | firm, dat)
mod[[3]] <- feols(wage ~ emp + capital | year, dat)
mod[[4]] <- feols(wage ~ emp + capital | firm + year, dat)
msummary(mod, title = "tidy and glance functions allow easy programmatic manipulation of fixest objects. In turn, this allows packages like modelsummary to support fixest out-of-the-box.")
Which produces an HTML or a LaTeX table like this one:
Howdy,
FEOLS appears not to default issue a warning about non-convergence. I only realized this was an issue when I used verbose and adjusted the iteration limit. Even this does not directly tell me it did not converge, but I can see in the output that it hit its iteration limit.
I believe making non-convergence more salient would help most users.
Thank you for your efforts,
Best,
Luke
Subtitle: Ignore variables with dictionary alias
Hi Laurent,
I've been trying to export a subset of my regression results where (a) I'm only really interested in some interaction terms, and (b) I further wish to rename these interaction terms with a dictionary alias. My problem is that in trying to drop the parent terms using the "drop" argument of esttex()
, I invariably end of excluding the interactions variables of interest too... even when I have assigned them a dictionary alias.
An example may help to better illustrate:
library(fixest)
aq = airquality
aq$Month = factor(aq$Month)
mod = feols(Ozone ~ Month / Wind, data = aq)
#> NOTE: 37 observations removed because of NA values (Breakup: LHS: 37, RHS: 0).
myDict = c("Month5:Wind"="May * Wind", "Month6:Wind"="Jun * Wind",
"Month7:Wind"="Jul * Wind", "Month8:Wind"="Aug * Wind",
"Month9:Wind"="Sep * Wind")
## Keep everything, including parent terms (but rename interaction terms)
esttex(mod, dict = myDict)
#> \begin{table}[htbp]\centering
#> \caption{no title}
#> \begin{tabular}{lc}
#> & \tabularnewline
#> \hline
#> \hline
#> Dependent Variable:&Ozone\\
#> Model:&(1)\\
#> \hline
#> \emph{Variables}\tabularnewline
#> (Intercept)&50.748$^{***}$\\
#> &(15.748)\\
#> Month6&-41.793\\
#> &(31.148)\\
#> Month7&68.296$^{***}$\\
#> &(20.995)\\
#> Month8&82.211$^{***}$\\
#> &(20.314)\\
#> Month9&23.439\\
#> &(20.663)\\
#> May * Wind&-2.3681$^{*}$\\
#> &(1.3162)\\
#> Jun * Wind&1.6825\\
#> &(2.1141)\\
#> Jul * Wind&-7.0314$^{***}$\\
#> &(1.5399)\\
#> Aug * Wind&-8.5224$^{***}$\\
#> &(1.4015)\\
#> Sep * Wind&-4.2418$^{***}$\\
#> &(1.2575)\\
#> \hline
#> \emph{Fit statistics}& \\
#> Observations& 116\\
#> R$^2$ & 0.54732\\
#> Adjusted R$^2$ & 0.50889\\
#> \hline
#> \hline
#> \multicolumn{2}{l}{\emph{Normal standard-errors in parentheses. Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
#> \end{tabular}
#> \end{table}
## Try to drop only parent terms, but ends up excluding everything (including
## interactions renamed with a dictionary alias)
esttex(mod, dict = myDict, drop = paste0("Month", 5:9))
#> \begin{table}[htbp]\centering
#> \caption{no title}
#> \begin{tabular}{lc}
#> & \tabularnewline
#> \hline
#> \hline
#> Dependent Variable:&Ozone\\
#> Model:&(1)\\
#> \hline
#> \emph{Variables}\tabularnewline
#> (Intercept)&50.748$^{***}$\\
#> &(15.748)\\
#> \hline
#> \emph{Fit statistics}& \\
#> Observations& 116\\
#> R$^2$ & 0.54732\\
#> Adjusted R$^2$ & 0.50889\\
#> \hline
#> \hline
#> \multicolumn{2}{l}{\emph{Normal standard-errors in parentheses. Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
#> \end{tabular}
#> \end{table}
Created on 2019-12-29 by the reprex package (v0.3.0)
I know that in the vignette you write: "Note that the actions performed by the arguments drop or order are performed before the renaming takes place with the argument dict."
However, I'm wondering if it is possible to switch things around so that drop respects the assigned aliases? Failing that, is there a way to adjust drop's regexp so that it isn't so aggressive (such that it would allow me to keep the child interaction terms in the above example)?
PS — I have a second question about the dict argument, but I'll file that as a separate issue shortly.
Hi Laurent,
In my regression tables, I'd like to have a row at the bottom indicating the mean of the dependent variable (along with the currently supported number of observations and r squared). Is there a way etable could support this feature?
When specifying weights with an feols
model are these weights similar to aweights
in Stata where each observation is treated as a group mean?
I looked at the estimation code here and it looks like that is the case, but I wanted to double check.
The package is great. Thanks for sharing.
I have one question about the data that is used for the estimation. Suppose I have 140K data, and during the estimation it removed a lot of data due to NA or other reasons. It ended up with around 40K data. Is there a way I can know what this 40K data is?
I just came across this package yesterday. Really impressed with both the range of models covered and the implementation speed.
On the subject of speed, I was wondering if you could add some benchmark comparisons for the FixedEffectModels.jl package in Julia. To the best of my knowledge, this is currently the cutting edge for HDFE implementations... at least as far as linear models go. My own quick benchmarks suggest that fixest::feols
compares very favourably for medium sized data sets.
In case you're not familiar with Julia code, here's a quick and dirty example for running one iteration of the largest model in your benchmark simulation. I'll cc the package owner @matthieugomez, who probably has better ideas for benchmarking and may also want to add fixest to his own benchmark comparisons.
### PRELIMINARIES ###
## Download files here: https://drive.google.com/drive/folders/1-1M_vLGduByk5P3qHl7AMBBl85RzHpv7?usp=sharing
## Then run `Data generation.R`
### END PRELIMINARIES ###
cd("/path/to/downloaded/files")
using RData
base_all = load("DATA/base_all_simulations_rect.RData", convert=true)
base_all = base_all["base_all"]
using FixedEffectModels
## Limit to the very last replication data frame (1,000,000 x 5)
base = base_all[40]
## Single thread
@time reg(base, @formula(log(y+1)~X1 + fe(dum_1) + fe(dum_2) + fe(dum_3)))
## Multi-threaded
@time reg(base, @formula(log(y+1)~X1 + fe(dum_1) + fe(dum_2) + fe(dum_3)), method = :lsmr_threads)
## GPU (requires working CUDA installation)
base.X1 = convert(Vector{Float32}, base[!, :X1]) ## Optional for performance boost
@time reg(base, @formula(log(y+1)~X1 + fe(dum_1) + fe(dum_2) + fe(dum_3)), method = :lsmr_gpu)
The interact
function is very handy for differences in differences setups and allows to quickly plot the estimated coefficients with coefplot
. However, by default the function interacts every values of the fe
parameter. This is problematic when one wants to have only some leads and lags.
The basic diff in diff setup presented in the vignette is
data(base_did)
feols(y ~ x1 + treat::period(5) | id + period, base_did)
However, one often wants to set custom number of leads and lags. For instance one would want to have only two pre-treatment coefficients (ie 2 leads) and two post-treatment coefficients (ie 2 lags).
I tried to create a third dummy (hereby called window
) that is equal to 1 if the observation is treated AND in the focus window such that the interaction between the window
and period
gives the right set of dummies. However it creates interaction even when window
is equal to 0 and I end up with collinearity error.
Code to reproduce :
library(dplyr)
df <- mutate(base_did, window = ifelse(period > 2 & period < 8 & treat, 1, 0))
est <- feols(y ~ window::period | id + period, df)
#> ...
#> "Presence of collinearity, covariance not defined. Use function collinearity() to pinpoint the problems."
collinearity(est)
# [1] "Variables 'window:period::1', 'window:period::2', 'window:period::8', 'window:period::9' and 'window:period::10' are constant, thus collinear with the fixed-effects."
The solution is simply to filter out the null dummies when I interact window
and period
.
library(dplyr)
data("base_did")
df <- base_did %>%
mutate(window = ifelse(period > 2 & period < 8 & treat, 1, 0)) %>%
mutate(long_term = ifelse(period >= 8 & treat, 1, 0))
coefs <- df %$% interact(window, period) %>% {.[,colSums(.) !=0]}
colnames(coefs) <- gsub("\\:", "_", colnames(coefs))
dummies <- colnames(coefs) %>%
paste(collapse = "+")
df <- cbind(df, coefs)
frmla <- as.formula(paste("y ~ x1+", dummies, " + long_term | id + period"))
feols(frmla, df)
result:
OLS estimation, Dep. Var.: y
Observations: 1,080
Fixed-effects: id: 108, period: 10
Standard-errors: Clustered (id)
Estimate Std. Error z value Pr(>|z|)
x1 0.975347 0.046274 21.078000 < 2.2e-16 ***
window_period__3 1.049600 0.935761 1.121700 0.262288
window_period__4 -0.473162 0.969770 -0.487912 0.625724
window_period__5 1.324300 0.947332 1.397900 0.162461
window_period__6 2.108000 0.874269 2.411200 0.016088 *
window_period__7 4.921800 0.790574 6.225600 7.17e-10 ***
long_term 6.373900 0.696438 9.152200 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-likelihood: -2,988.30 Adj. R2: 0.48591
R2-Within: 0.38541
lead
and lag
are too economics-centered and using pre
and post
is clearer.feols(y ~ treat::period(ref=5, lead=2, lag=2) | id + period, df)
# Non numeric variable
all_months = c("aug", "sept", "oct", "nov", "dec", "jan",
"feb", "mar", "apr", "may", "jun", "jul")
base_did$period_month = all_months[base_did$period]
feols(y ~ x1 + i(treat, period_month, "oct", lead=c("aug", "sep"), lag=c("nov", "dec")) | id+period, base_inter)
range
parameter to the interact
function.feols(y ~ treat::period(range=3:7) | id + period, df)
feols(y ~ window::period | id + period, df)
I can try implementing the function with a little help from the developer.
Thanks for the great package.
Is there a way to specify which objects to return from a fixest model?
For example can I choose to return the fitted values, sum of the fixed effects or matrix of the scores or not?
I am running a feols model with about 40 million observations that looks like this
est_did_emp_500_col1 = feols(log(employment) ~ small_business_500:post | small_business_500 + post, data=dfr_emp_ind_sample)
I am finding that the returned model objects take up a lot of space.
object.size(est_did_emp_500_col1)
1962264560 bytes
I looked at the size of each component:
lapply(est_did_emp_500_col1,object.size)
$nobs
56 bytes
$fml
1176 bytes
$call
1736 bytes
$method
112 bytes
$fml_full
1736 bytes
$nparams
56 bytes
$fixef_vars
200 bytes
$fixef_id
392379704 bytes
$fixef_sizes
368 bytes
$obsRemoved
358296 bytes
$iterations
56 bytes
$coefficients
304 bytes
$residuals
392378704 bytes
$multicol
56 bytes
$sumFE
392378704 bytes
$fitted.values
392378704 bytes
$scores
392378872 bytes
$hessian
224 bytes
$sigma2
56 bytes
$cov.unscaled
672 bytes
$coeftable
1456 bytes
$se
536 bytes
$sq.cor
56 bytes
$ssr_null
56 bytes
$ll_null
56 bytes
$ssr_fe_only
56 bytes
$ll_fe_only
56 bytes
Are there other good ways to reduce the size of these models?
Thanks.
Thanks for the wonderful package.
I am getting an unpredictable core dump when trying to run lots of poisson feglm
regressions.
Basically I run lots of felm
from lfe
regressions and feglm
regressions with the poisson distribution, using high dimensional fixed effects.
Sometimes my scrips runs all the way through, but after a few successful runs I get
double free or corruption (out)
[1] 10004 abort (core dumped) radian
(I'm running R using radian in the terminal).
An example of a regression is this:
target_poisson = feglm(prob ~ as.factor(counter_month):ratio | factor(stay_counter_month_dummy) + factor(home_counter_month_dummy), data = panel_work)
The data is a balanced panel of 10 months, where each observation in the panel is a "home" and a "stay" location with 275 locations. So the data set is 756,000 observations in total. As you can see, the fixed effects for stay location by month dummies and the home location by month dummies.
I think there are a lot of observations with no variation across the panel and perfectly collinear FEs. There are also a lot of NaN
s so clearly there are a lot of sources of error for this regression.
I think I'm doing a good job making sure I'm putting things in functions and not allocating extra memory. But it's hard to debug to see if that is the real issue. See this thread on discourse.
Can I email you the data set as an MWE?
Here is the full session info:
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.7.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] R.utils_2.9.2 R.oo_1.23.0 R.methodsS3_1.8.0 lfe_2.8-5.1 Matrix_1.2-18
[6] fixest_0.6.0 data.table_1.12.8 lubridate_1.7.9 forcats_0.5.0 stringr_1.4.0
[11] dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3
[16] ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] zoo_1.8-8 tidyselect_1.1.0 haven_2.3.1 lattice_0.20-41 colorspace_1.4-1
[6] vctrs_0.3.2 generics_0.0.2 blob_1.2.1 rlang_0.4.7 pillar_1.4.6
[11] glue_1.4.1 withr_2.2.0 DBI_1.1.0 dbplyr_1.4.4 modelr_0.1.8
[16] readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
[21] rvest_0.3.5 parallel_3.6.2 fansi_0.4.1 broom_0.7.0 Rcpp_1.0.5
[26] xtable_1.8-4 scales_1.1.1 backports_1.1.8 jsonlite_1.7.0 fs_1.4.2
[31] hms_0.5.3 stringi_1.4.6 numDeriv_2016.8-1.1 grid_3.6.2 cli_2.0.2
[36] tools_3.6.2 sandwich_2.5-1 magrittr_1.5 Formula_1.2-3 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1 dreamerr_1.2.0 xml2_1.3.2 reprex_0.3.0
[46] assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11 R6_2.4.1 nlme_3.1-148
[51] compiler_3.6.2
The details of clustering and degrees-of-freedom corrections are perennially issues, and challenging for all the issues Sergio pointed out in this thread.
When I run @grantmcdermott's example from that same discussion, feols
gives the same results as lfe::felm
or Stata's cgmreg
, but different than Stata's reghdfe
or Grant's proposed felm(..., cmethod="reghdfe")
.
I'm not sure you want to do anything differently, but it's helpful to document the differences across packages.
Ref: DeclareDesign/estimatr#321, sgaure/lfe#19, FixedEffects/FixedEffectModels.jl#49
Here is some texreg code (written by my PhD student). Not sure 100% compatible going forward, but works for me.
extract.feols <- function(model, include.nobs = TRUE, include.rsquared = TRUE,
include.adjrs = TRUE, include.fstatistic = FALSE, ...) {
s <- summary(model, ...)
nam <- rownames(s$coeftable)
co <- s$coeftable[, 1]
se <- s$coeftable[, 2]
pval <- s$coeftable[, 4]
gof <- numeric()
gof.names <- character()
gof.decimal <- logical()
if (include.nobs == TRUE) {
gof <- c(gof, s$nobs)
gof.names <- c(gof.names, "Num.\ obs.")
gof.decimal <- c(gof.decimal, FALSE)
}
if (include.rsquared == TRUE) {
gof <- c(gof, fixest::r2(s, "r2"), fixest::r2(s, "wr2"))
gof.names <- c(gof.names, "R$^2$ (full model)", "R$^2$ (proj model)")
gof.decimal <- c(gof.decimal, TRUE, TRUE)
}
if (include.adjrs == TRUE) {
gof <- c(gof, fixest::r2(s, "ar2"), fixest::r2(s, "war2"))
gof.names <- c(gof.names, "Adj.\ R$^2$ (full model)",
"Adj.\ R$^2$ (proj model)")
gof.decimal <- c(gof.decimal, TRUE, TRUE)
}
# TODO: fixest does not have builtin F-stat. Will have to calculate ourselves
#
# if (include.fstatistic == TRUE) {
# gof <- c(gof, s$F.fstat[1], s$F.fstat[4],
# s$P.fstat[length(s$P.fstat) - 1], s$P.fstat[1])
# gof.names <- c(gof.names, "F statistic (full model)",
# "F (full model): p-value", "F statistic (proj model)",
# "F (proj model): p-value")
# gof.decimal <- c(gof.decimal, TRUE, TRUE, TRUE, TRUE)
# }
tr <- texreg::createTexreg(
coef.names = nam,
coef = co,
se = se,
pvalues = pval,
gof.names = gof.names,
gof = gof,
gof.decimal = gof.decimal
)
return(tr)
}
``
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.