Code Monkey home page Code Monkey logo

fixedeffectmodels.jl's Introduction

Build status

This package estimates linear models with high dimensional categorical variables and/or instrumental variables.

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add FixedEffectModels.

Benchmarks

The objective of the package is similar to the Stata command reghdfe and the R packages lfe and fixest. The package is much faster than reghdfe or lfe. It also tends to be a bit faster than the more recent fixest (depending on the exact command). For complicated models, FixedEffectModels can also run on Nvidia GPUs for even faster performances (see below)

benchmark

Syntax

using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
reg(df, @formula(Sales ~ NDI + fe(State) + fe(Year)), Vcov.cluster(:State), weights = :Pop)
#                             FixedEffectModel                            
# =========================================================================
# Number of obs:                 1380   Converged:                     true
# dof (model):                      1   dof (residuals):                 45
# R²:                           0.803   R² adjusted:                  0.798
# F-statistic:                13.3382   P-value:                      0.001
# R² within:                    0.139   Iterations:                       5
# =========================================================================
#         Estimate  Std. Error    t-stat  Pr(>|t|)   Lower 95%    Upper 95%
# ─────────────────────────────────────────────────────────────────────────
# NDI  -0.00526264  0.00144097  -3.65216    0.0007  -0.0081649  -0.00236038
# =========================================================================
  • A typical formula is composed of one dependent variable, exogenous variables, endogenous variables, instrumental variables, and a set of high-dimensional fixed effects.

    dependent variable ~ exogenous variables + (endogenous variables ~ instrumental variables) + fe(fixedeffect variable)

    High-dimensional fixed effect variables are indicated with the function fe. You can add an arbitrary number of high dimensional fixed effects, separated with +. You can also interact fixed effects using & or *.

    For instance, to add state fixed effects use fe(State). To add both state and year fixed effects, use fe(State) + fe(Year). To add state-year fixed effects, use fe(State)&fe(Year). To add state specific slopes for year, use fe(State)&Year. To add both state fixed-effects and state specific slopes for year use fe(State)*Year.

    reg(df, @formula(Sales ~ Price + fe(State) + fe(Year)))
    reg(df, @formula(Sales ~ NDI + fe(State) + fe(State)&Year))
    reg(df, @formula(Sales ~ NDI + fe(State)&fe(Year)))              # for illustration only (this will not run here)
    reg(df, @formula(Sales ~ (Price ~ Pimin)))

    To construct formula programmatically, use

    reg(df, term(:Sales) ~ term(:NDI) + fe(:State) + fe(:Year))
  • The option contrasts specifies that a column should be understood as a set of dummy variables:

     reg(df, @formula(Sales ~ Price + Year); contrasts = Dict(:Year => DummyCoding()))

    You can specify different base levels

     reg(df, @formula(Sales ~ Price + Year); contrasts = Dict(:Year => DummyCoding(base = 80)))
  • The option weights specifies a variable for weights

     weights = :Pop
  • Standard errors are indicated with the prefix Vcov (with the package Vcov)

     Vcov.robust()
     Vcov.cluster(:State)
     Vcov.cluster(:State, :Year)
  • The option save can be set to one of the following: :none (default) to save nothing, :residuals to save residuals, :fe to save fixed effects, and :all to save both. Once saved, they can then be accessed using residuals(m) or fe(m) where m is the estimated model (the object returned by the function reg). Both residuals and fixed effects are aligned with the original dataframe used to estimate the model.

  • The option method can be set to one of the following: :cpu, :CUDA, or :Metal (see Performances below).

Output

reg returns a light object. It is composed of

  • the vector of coefficients & the covariance matrix (use coef, coefnames, vcov on the output of reg)
  • a boolean vector reporting rows used in the estimation
  • a set of scalars (number of observations, the degree of freedoms, r2, etc)

Methods such as predict, residuals are still defined but require to specify a dataframe as a second argument. The problematic size of lm and glm models in R or Julia is discussed here, here, here here (and for absurd consequences, here and there).

You may use RegressionTables.jl to get publication-quality regression tables.

Performances

MultiThreads

FixedEffectModels is multi-threaded. Use the option nthreads to select the number of threads to use in the estimation (defaults to Threads.nthreads()).

GPUs

The package has an experimental support for GPUs. This can make the package an order of magnitude faster for complicated problems.

If you have a Nvidia GPU, run using CUDA before using FixedEffectModels. Then, estimate a model with method = :CUDA.

using CUDA, FixedEffectModels
@assert CUDA.functional()
df = dataset("plm", "Cigar")
reg(df, @formula(Sales ~ NDI + fe(State) + fe(Year)), method = :CUDA)

The package also supports Apple GPUs with Metal.jl, although I could not find a way to get better performance

using Metal, FixedEffectModels
@assert Metal.functional()
df = dataset("plm", "Cigar")
reg(df, @formula(Sales ~ NDI + fe(State) + fe(Year)), method = :Metal)

Solution Method

Denote the model y = X β + D θ + e where X is a matrix with few columns and D is the design matrix from categorical variables. Estimates for β, along with their standard errors, are obtained in two steps:

  1. y, X are regressed on D using the package FixedEffects.jl
  2. Estimates for β, along with their standard errors, are obtained by regressing the projected y on the projected X (an application of the Frisch Waugh-Lovell Theorem)
  3. With the option save = true, estimates for the high dimensional fixed effects are obtained after regressing the residuals of the full model minus the residuals of the partialed out models on D using the package FixedEffects.jl

References

Baum, C. and Schaffer, M. (2013) AVAR: Stata module to perform asymptotic covariance estimation for iid and non-iid data robust to heteroskedasticity, autocorrelation, 1- and 2-way clustering, and common cross-panel autocorrelated disturbances. Statistical Software Components, Boston College Department of Economics.

Correia, S. (2014) REGHDFE: Stata module to perform linear or instrumental-variable regression absorbing any number of high-dimensional fixed effects. Statistical Software Components, Boston College Department of Economics.

Fong, DC. and Saunders, M. (2011) LSMR: An Iterative Algorithm for Sparse Least-Squares Problems. SIAM Journal on Scientific Computing

Gaure, S. (2013) OLS with Multiple High Dimensional Category Variables. Computational Statistics and Data Analysis

fixedeffectmodels.jl's People

Contributors

bjornerstedt avatar droodman avatar duncanhobbs avatar eloualiche avatar github-actions[bot] avatar greimel avatar jlperla avatar jmboehm avatar juliatagbot avatar junder873 avatar junyuan-chen avatar kleinschmidt avatar matthieugomez avatar maxnorton avatar menggai avatar moritzdrechselgrau avatar nilshg avatar nosferican avatar sboysel avatar tappek avatar yuyichao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fixedeffectmodels.jl's Issues

@model not defined

Using Juila 0.5.2. I get the following when running

using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:StatePooled] = pool(df[:State])
df[:YearPooled] = pool(df[:Year])
reg(df, @model(Sales ~ NDI, fe = StatePooled + YearPooled, weights = Pop, vcov = cluster(StatePooled)))

ERROR: UndefVarError: @model not defined

Instrumental Variable Interactions

When trying to run a regression where the Instrumental Variable has an interaction (which I am trying to do when replicating Chang et. al (2015) (link to paper) when I try to run

reg(dropmissing(df2, :Jun), @model(Jun ~ 1 + rc + (D~tau)*(1+rc)))

it gives the error

MethodError: no method matching length(::Expr)

I can share the data with you if needed. The data I have does not perfectly match the paper, but running the regression in R yields results that are close to the paper's results.

Overall, thank you for this package, it really is incredible.

MethodError when LHS is not a Float

I get a MethodError when the LHS is not a float. Error is as follows:
"ERROR: MethodError: convert has no method matching convert(::Array{Int64,1}, :
:Type{Float64})
in reg at C:\Users\pbaylis.julia\v0.4\FixedEffectModels\src\reg.jl:159
in reg at C:\Users\pbaylis.julia\v0.4\FixedEffectModels\src\reg.jl:64"

MWE to reproduce:

using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:SalesInt] = round(Int64, df[:Sales])
reg(SalesInt ~ Price, df) # Error
df[:SalesIntFloat] = float(df[:SalesInt])
reg(SalesIntFloat ~ Price, df) # Works

Probably has to do with some syntax change in convert(). I think this could be fixed by replacing convert( ) in reg.jl:159 with float( )? My workaround is to do that before calling reg.

(Hopefully this is an actual issue. I pulled the most recent version of the package this time.)

Understanding error involving pinvertible

Running a regression on a subset of my data (which I cannot share unfortunately), Julia produces the following error:

ERROR: UndefVarError: warn not defined
Stacktrace:
 [1] pinvertible(::Array{Float64,2}, ::Float64) at /home/me/.julia/packages/FixedEffectModels/RySyg/src/vcov/vcovcluster.jl:74
 [2] pinvertible at /home/me/.julia/packages/FixedEffectModels/RySyg/src/vcov/vcovcluster.jl:71 [inlined]
 [3] vcov!(::VcovClusterMethod, ::VcovData{LinearAlgebra.Cholesky{Float64,Array{Float64,2}},1}) at /home/me/.julia/packages/FixedEffectModels/RySyg/src/vcov/vcovcluster.jl:33
 [4] #reg#30(::Expr, ::Expr, ::Nothing, ::Nothing, ::Int64, ::Dict{Any,Any}, ::Float64, ::Int64, ::Bool, ::Symbol, ::Bool, ::typeof(reg), ::DataFrame, ::Formula) at /home/me/.julia/packages/FixedEffectModels/RySyg/src/reg.jl:342
 [5] (::getfield(FixedEffectModels, Symbol("#kw##reg")))(::NamedTuple{(:fe, :vcov, :method),Tuple{Expr,Expr,Symbol}}, ::typeof(reg), ::DataFrame, ::Formula) at ./none:0
 [6] #reg#29(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:method,),Tuple{Symbol}}}, ::Function, ::DataFrame, ::Model) at /home/me/.julia/packages/FixedEffectModels/RySyg/src/reg.jl:36
 [7] (::getfield(FixedEffectModels, Symbol("#kw##reg")))(::NamedTuple{(:method,),Tuple{Symbol}}, ::typeof(reg), ::DataFrame, ::Model) at ./none:0
 [8] top-level scope at none:0

It happens under the recent master version as well as the latest release. When I drop one (very-)high dimensional FE from the regression, everything goes through. Do you have a hint how to interpret this error message?

updates to internal representation and parsing of formulae in statsmodels v0.2.4

Version 0.2.4 of statsmodels changes when formula parsing happens. Syntactic transformations happen at "compile time", when @formula is called, rather than when constructing Terms. There are a few places here where an un-parsed formula expression could be passed to the Terms constructor, which will fail on 0.2.4. To get around this, you could do one of three things:

  1. Use@eval to call the formula macro: Terms(@eval @formula $nothing ~ $rhs)
  2. Call StatsModels.parse! on the expression, before constructing a Formula, although you may need to wrap it in a call to ~ to ensure that all the syntactic rules are applied correctly, a la StatsModels.parse!( :(~($rhs)) ).
  3. To avoid the deprecated 2-argument constructor, use the 4-argument Formula constructor, a la
    ex_orig = :($nothing ~ $rhs)
    ex = StatsModels.parse!(deepcopy(ex))
    lhs, rhs = ex.args[2:3]
    f = Formula(ex_orig, ex, lhs, rhs)
    

2 and 3 will work for now, but in the future the macro step will do more radical transformations of the formula (but the internal structure of the formula and terms structs will also change at that point).

Segfaults when clustering with large dataset

Hi,

I'm getting a segmentation fault when trying to use clustered standard errors. The dataset is fairly large (46m obs); it could be that I'm just running out of memory. What do you think?

The regression command I'm using is

reg( C ~ interaction |> outputinput , df, VcovCluster([:outputinput]))

and it yields

signal (11): Segmentation fault: 11
while loading no file, in expression starting on line 0
macro expansion at /Users/johannes.boehm/.julia/v0.5/FixedEffectModels/src/vcov/vcovcluster.jl:74 [inlined]
macro expansion at ./simdloop.jl:73 [inlined]
helper_cluster at /Users/johannes.boehm/.julia/v0.5/FixedEffectModels/src/vcov/vcovcluster.jl:73
unknown function (ip: 0x316729720)
jl_call_method_internal at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia_internal.h:189 [inlined]
jl_apply_generic at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1942
shat! at /Users/johannes.boehm/.julia/v0.5/FixedEffectModels/src/vcov/vcovcluster.jl:55
unknown function (ip: 0x316728f66)
jl_call_method_internal at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia_internal.h:189 [inlined]
jl_apply_generic at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1942
vcov! at /Users/johannes.boehm/.julia/v0.5/FixedEffectModels/src/vcov/vcovcluster.jl:31 [inlined]
#reg#48 at /Users/johannes.boehm/.julia/v0.5/FixedEffectModels/src/reg.jl:332
unknown function (ip: 0x3166e0f30)
runRegressions at /Users/johannes.boehm/Documents/IndiaCourts/verticalDistance/verticalDistanceModule.jl:256
unknown function (ip: 0x3166baf26)
jl_call_method_internal at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia_internal.h:189 [inlined]
jl_apply_generic at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1942
do_call at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/interpreter.c:66
eval at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/interpreter.c:190
jl_toplevel_eval_flex at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/toplevel.c:558
jl_toplevel_eval_in_warn at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/builtins.c:590
eval at ./boot.jl:234
jlcall_eval_19752 at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
jl_call_method_internal at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia_internal.h:189 [inlined]
jl_apply_generic at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1942
eval_user_input at ./REPL.jl:64
unknown function (ip: 0x31667b8e6)
jl_call_method_internal at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia_internal.h:189 [inlined]
jl_apply_generic at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1942
macro expansion at ./REPL.jl:95 [inlined]
#3 at ./event.jl:68
unknown function (ip: 0x3166780df)
jl_call_method_internal at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia_internal.h:189 [inlined]
jl_apply_generic at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1942
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1392 [inlined]
start_task at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/task.c:253
Allocations: 1077545854 (Pool: 1077544121; Big: 1733); GC: 866
Segmentation fault: 11

Running it without VcovCluster() works fine. Any help/thoughts are appreciated.

Best, Johannes

upgrade to dataframes v0.11.4

salut @matthieugomez

is there anything that's holding you back from upgrading to dataframes v0.11 (other than the job market :-) )
as you know v0.11 this is a major step forward, and I have FixedEffectModels.jl as a blocking dependency in a couple of projects.

cheers
f

FixedEffectModels.jl not handling more than one fixed effect

We're receiving the following error when absorbing more than one fixed effect:

ERROR: PosDefException(1)
in cholfact! at linalg/factorization.jl:38
in cholfact! at linalg/factorization.jl:32
in reg at /user/.julia/v0.3/FixedEffectModels/src/reg.jl:165
in reg at /user/.julia/v0.3/FixedEffectModels/src/reg.jl:15

To replicate this error, we use the following Stata dataset:

use http://www.stata-press.com/data/r9/nlswork.dta
export delim nlswork.csv

Sample Julia code is below. Note that a single fixed effect model works fine. You will see that in modout1 and modout2. However, the combination of fixed effects produces the error.

using SparseFactorModels, DataFrames, FixedEffectModels
df = readtable("nlswork.csv")
df[:pidcode] = pool(df[:idcode])
df[:pyear] = pool(df[:year])

testfactor = SparseFactorModel(:pidcode, :pyear, 1)

modout1 = reg(ln_wage ~ ttl_exp |> pidcode, df)
modout2 = reg(ln_wage ~ ttl_exp |> pyear, df)
modout3 = reg(ln_wage ~ ttl_exp |> pidcode + pyear, df)

No method matching setindex_shape_check

Using the update as presented in #55 , the following error is produced when a non-invertible matrix is encountered:

nohup: ignoring input
┌ Warning: estimated covariance matrix of moment conditions not of full rank.
│                  model tests should be interpreted with caution.
└ @ FixedEffectModels ~/.julia/packages/FixedEffectModels/jRwKa/src/vcov/vcovcluster.jl:74
ERROR: LoadError: MethodError: no method matching setindex_shape_check(::Int64, ::Int64)
Closest candidates are:
  setindex_shape_check(!Matched::AbstractArray{#s72,1} where #s72, ::Integer) at indices.jl:218
  setindex_shape_check(!Matched::AbstractArray{#s72,1} where #s72, ::Integer, !Matched::Integer) at indices.jl:221
  setindex_shape_check(!Matched::AbstractArray{#s72,2} where #s72, ::Integer, !Matched::Integer) at indices.jl:225
  ...
Stacktrace:
 [1] macro expansion at ./multidimensional.jl:694 [inlined]
 [2] _unsafe_setindex!(::IndexLinear, ::Array{Float64,1}, ::Int64, ::Base.LogicalIndex{Int64,BitArray{1}}) at ./multidimensional.jl:689
 [3] _setindex! at ./multidimensional.jl:684 [inlined]
 [4] setindex! at ./abstractarray.jl:1020 [inlined]
 [5] pinvertible(::Array{Float64,2}, ::Float64) at /home/me/.julia/packages/FixedEffectModels/jRwKa/src/vcov/vcovcluster.jl:76
 [6] pinvertible at /home/me/.julia/packages/FixedEffectModels/jRwKa/src/vcov/vcovcluster.jl:71 [inlined]
 [7] vcov!(::VcovClusterMethod, ::VcovData{LinearAlgebra.Cholesky{Float64,Array{Float64,2}},1}) at /home/me/.julia/packages/FixedEffectModels/jRwKa/src/vcov/vcovcluster.jl:33
 [8] #reg#30(::Expr, ::Expr, ::Nothing, ::Expr, ::Int64, ::Dict{Any,Any}, ::Float64, ::Int64, ::Bool, ::Symbol, ::Bool, ::typeof(reg), ::DataFrame, ::Formula) at /home/me/.julia/packages/FixedEffectModels/jRwKa/src/reg.jl:342
 [9] (::getfield(FixedEffectModels, Symbol("#kw##reg")))(::NamedTuple{(:subset, :fe, :vcov, :method),Tuple{Expr,Expr,Expr,Symbol}}, ::typeof(reg), ::DataFrame, ::Formula) at ./none:0
 [10] #reg#29(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:method,),Tuple{Symbol}}}, ::Function, ::DataFrame, ::Model) at /home/me/.julia/packages/FixedEffectModels/jRwKa/src/reg.jl:36
 [11] (::getfield(FixedEffectModels, Symbol("#kw##reg")))(::NamedTuple{(:method,),Tuple{Symbol}}, ::typeof(reg), ::DataFrame, ::Model) at ./none:0
 [12] top-level scope at none:0
 [13] include at ./boot.jl:326 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1038
 [15] include(::Module, ::String) at ./sysimg.jl:29
 [16] exec_options(::Base.JLOptions) at ./client.jl:267
 [17] _start() at ./client.jl:436
in expression starting at /home/me/myregression.jl:59

Add a REQUIRE file

The package seems to depend on GLM, so that should be in the REQUIRE file. Nice work!

Benchmarks should use the same data to eliminate potential confounding due to pseudorandom number generator differences

You generate three different datasets and try to "benchmark" performance using different datasets. Instead, you should generate a single data set and use the estimation commands from the three different platforms with the same data. Right now it isn't clear if the differences are due to differences in estimation performance or if the underlying data was different in a way that allowed convergence to occur more easily.

Changed meaning of .+= in Julia 0.5

Your code uses x .+= y, so you should know that in Julia 0.5 this has changed meaning to be equivalent to broadcast!(identity, x, x .+ y), so that it mutates the x array (see JuliaLang/julia#17510 … in Julia 0.6 the whole operation will occur in-place without temporaries). So .+ should only be used if the left-hand side is a mutable array, and you don't mind mutating it.

At first glance, this looks like it is okay for you, because you use it in widths .+= 1, where widths is an array that it looks okay to mutate, but if it is a problem to mutate widths you can always change it to +=.

small speed improvements

Hi,
I'm working on a somewhat larger model (~100m lines, 10 columns and IV) and I've found a few speed improvements that I use myself. Maybe you can use it for inspiration. Let me know if you think any of these could use a PR and I'll make one:

  • I preprocess my data so missing data are removed, but still on v0.6 checking completecases sometimes goes over the full dataframe (fixed for julia v0.7) here.
  • With a few 10Gb's of data, I've found that copying it takes quite some time. Even in the case of no missing values, a full copy is made here.
  • I've found I can speed up a bit the fixed effect multiplication by adding Threads.@threads in front of some of the for loops like here

On 16 threads these adaptions reduce the processing time by over half on my use case.
Cheers

Error when regressing only on a constant + fe

I'm trying to run a regression of y ~ 1 + fe but I get an error. Any ideas for a fix?

julia> r = reg(dfi, @model(clogi ~ 1, fe = age))
Error showing value of type FixedEffectModels.RegressionResultFE:
ERROR: ArgumentError: reducing over an empty collection is not allowed
Stacktrace:
 [1] _mapreduce(::Base.#identity, ::Base.#scalarmax, ::IndexLinear, ::Array{Any,1}) at .\reduce.jl:265
 [2] maximum(::Array{Any,1}) at .\reduce.jl:454
 [3] show(::IOContext{Base.Terminals.TTYTerminal}, ::FixedEffectModels.CoefTable2) at C:\Users\tbeason\.julia\v0.6\FixedEffectModels\src\RegressionResult.jl:125
 [4] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::FixedEffectModels.RegressionResultFE) at .\REPL.jl:122
 [5] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::FixedEffectModels.RegressionResultFE) at .\REPL.jl:125
 [6] display(::FixedEffectModels.RegressionResultFE) at .\multimedia.jl:194
 [7] hookless(::Media.##7#8{FixedEffectModels.RegressionResultFE}) at C:\Users\tbeason\.julia\v0.6\Media\src\compat.jl:14
 [8] render(::Media.NoDisplay, ::FixedEffectModels.RegressionResultFE) at C:\Users\tbeason\.julia\v0.6\Media\src\compat.jl:27
 [9] display(::Media.DisplayHook, ::FixedEffectModels.RegressionResultFE) at C:\Users\tbeason\.julia\v0.6\Media\src\compat.jl:9
 [10] display(::FixedEffectModels.RegressionResultFE) at .\multimedia.jl:194
 [11] eval(::Module, ::Any) at .\boot.jl:235
 [12] print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at .\REPL.jl:144
 [13] print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at .\REPL.jl:129
 [14] (::Base.REPL.#do_respond#16{Bool,Base.REPL.##26#36{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at .\REPL.jl:646

StackOverflowError

I am running a fixed effects regression:

N = 30000
M = 20000
O = 5000
data = DataFrame(pid = vcat(collect(1:N), rand(1:N, M)), 
    fid = vcat(repeat(1:O, outer = floor(Int64, N/O)), rand(1:O, floor(Int64, M))), 
    income = randn(N+M))

data[:pid] = categorical(data[:pid])
data[:fid] = categorical(data[:fid])

result = reg(data, @model(income ~ 1, fe = pid + fid, save = true));

and it goes through. But when I add a zero to each N, M, O, I get:

StackOverflowError:

and nothing else.

I am on:

  • FixedEffectModels 0.5.2+ master

and

Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

Any thoughts?

Possible bug handling variables with very large numbers

Hi,
I've been running regressions while controlling for high order polynomials on wage. When the order was too high (from the fourth), some of the coefficient were set to zero or not estimated. I rerun the regressions after scaling down the polynomial controls by a constant factor and the regression goes through.

To see the issue, you can run the code:

using DataFrames, FixedEffectModels
df = DataFrame(randn(Float64, (10000,2))*1000)
df[:x3] = map( x-> x^2,df[:x2] )
df[:x4] = map( x-> x^3,df[:x2] )
df[:x5] = map( x-> x^4,df[:x2] )

## Not all coefficients are estimated
reg(df, @model(x1 ~ x2+x3+x4+x5 ))

## Scaling down variables with very large numbers solve the issue
df[:x4] = df[:x4] * .0001
df[:x5] = df[:x5] * .0000001

reg(df, @model(x1 ~ x2+x3+x4+x5 ))

error saving residuals

I run into this issue when using save=true. The regression has fixed effects, IV and clustered errors.

DimensionMismatch("A has size (27404765,5), B has size (6,1), C has size (27404765,)")

Stacktrace:
 [1] gemm!(::Char, ::Char, ::Float64, ::Array{Float64,2}, ::Array{Float64,1}, ::Float64, ::Array{Float64,1}) at ./linalg/blas.jl:1025
 [2] #reg#63(::Expr, ::Expr, ::Void, ::Void, ::Int64, ::Float64, ::Int64, ::Bool, ::Symbol, ::Bool, ::FixedEffectModels.#reg, ::DataFrames.DataFrame, ::StatsModels.Formula) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/reg.jl:299
 [3] (::FixedEffectModels.#kw##reg)(::Array{Any,1}, ::FixedEffectModels.#reg, ::DataFrames.DataFrame, ::StatsModels.Formula) at ./<missing>:0
 [4] reg(::DataFrames.DataFrame, ::FixedEffectModels.Model) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/reg.jl:50
 [5] include_string(::String, ::String) at ./loading.jl:522

Related question, do I need to save all fixed effects to obtain the average fixed effect or is there another way? (I'm doing regressions on age subgroups for which I want to compare these averages).

Cheers, Ken.

Get number of clusters in model metadata

Apologies if I missed it, but is there a way to get/view the number of clusters from a model that's been run?

This would be a nice feature to add to the model metadata, especially given the "smallest-length" adjustment for multiway clusters (https://github.com/matthieugomez/FixedEffectModels.jl/pull/50).

Ideally, this information could then referenced by RegressionTables.jl too, since most journals like to see the number of clusters reported in your regression tables.

Use a larger data set for benchmark

For the comparison of performance between FixedEffectModels, reghdfe and felm, could you use a larger data? I use the felm a lot on data with more than 1 million rows and 10 columns, I am curious how FixedEffectModels performs on larger data.

Covariates collinear with fixed effects not correctly dropped

Using a standard unbalanced panel data set such as:

qui webuse nlswork, clear
xtset idcode year
encode race, g(Race)

The correct output for a two-way fixed effects is given by:

reghdfe ln_wage hours i.Race, a(idcode year)

or

xtreg ln_wage hours i.Race i.year

The same estimated coefficients can be obtained through R's plm package

model = plm(ln_wage ~ hours + race, data, model = 'within', effect = 'twoways')

Currently the package computes the wrong estimates. I believe the issue lies in using a quasi-demeaning transformation for multi-way fixed effects. I have an implementation of the two-ways fixed effects in Econometrics.jl which uses a slightly more efficient way of the original Wansbeek and Kapteyn (1989) estimator. However, I would love to have Correia 2016 for a more efficient multi-way fixed effects implementation.

As a feature request, could the implementation have an API:

fun(fe::AbstractMatrix, X::AbstractMatrix, y::AbstractVector)

so it may be used more flexibly for other packages as well? For example, for error component models of unobserved effects models an efficient way for one-way and two-way fixed effects (panel and temporal).

bug when clustering vcov with IV

I'm getting a weird bug when I use an IV and try to cluster my errors. Robust errors work fine, but not with person and/or period.

Unfortunately, I've been unable to recreate this bug with a synthetic MWE and I cannot share the data, so I'll post the error in case it rings a bell:

BoundsError: attempt to access 766034×1 Array{Float64,2} at index [766036, 1]

Stacktrace:
 [1] macro expansion at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/vcov/vcovcluster.jl:141 [inlined]
 [2] macro expansion at ./simdloop.jl:73 [inlined]
 [3] helper_cluster(::Array{Float64,2}, ::Array{Float64,2}, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Int64) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/vcov/vcovcluster.jl:140
 [4] shat!(::FixedEffectModels.VcovClusterMethod, ::FixedEffectModels.VcovData{Array{Float64,2},2}) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/vcov/vcovcluster.jl:111
 [5] ranktest!(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2}, ::FixedEffectModels.VcovClusterMethod, ::Int64, ::Int64) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/vcov/ranktest.jl:56
 [6] #reg#67(::Expr, ::Expr, ::Void, ::Void, ::Int64, ::Float64, ::Int64, ::Bool, ::Symbol, ::Bool, ::FixedEffectModels.#reg, ::DataFrames.DataFrame, ::StatsModels.Formula) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/reg.jl:352
 [7] (::FixedEffectModels.#kw##reg)(::Array{Any,1}, ::FixedEffectModels.#reg, ::DataFrames.DataFrame, ::StatsModels.Formula) at ./<missing>:0
 [8] reg(::DataFrames.DataFrame, ::FixedEffectModels.Model) at /appl/almd01/install/julia/v0.6/FixedEffectModels/src/reg.jl:50

The formula looks something like

@model(sav_flow_log ~ 1 + inc_log + (rate~shock_cum), fe=catpers+catymonth, vcov=cluster(catpers))

Any pointers for this? Thanks!

Benchmark of FixedEffectModels versus lfe in MRO

I was wondering if you could add a benchmark of FixedEffectModels versus lfe in Microsoft R Open instead of GNU R. Microsoft R Open uses MKL and I would like to see if FixedEffectModels is still faster than lfe with MKL support.

missing type leads to errors

On julia-1.2.0, FixedEffectModels.jl version [9d5cd8c9] FixedEffectModels v0.8.2 I get the following error messages when including a variable which allows for missings. Take the following example:

using DataFrames, FixedEffectModels

df1 = DataFrame(a=[1.0, 2.0, 3.0, 4.0], b=[5.0, 7.0, 11.0, 13.0])
df2 = DataFrame(a=[1.0, missing, 3.0, 4.0], b=[5.0, 7.0, 11.0, 13.0])

rr1a = reg(df1,@model(a ~ b))
rr1b = reg(df1,@model(b ~ a))
rr2a = reg(df2,@model(a ~ b))
rr2b = reg(df2,@model(b ~ a))

The first two regressions work well, while the third produces

julia> rr2a = reg(df2,@model(a ~ b))

ERROR: MethodError: no method matching Array{Float64,1}(::Array{Float64,2})
Closest candidates are:
  Array{Float64,1}(::AbstractArray{S,N}) where {T, N, S} at array.jl:482
  Array{Float64,1}() where T at boot.jl:423
  Array{Float64,1}(::UndefInitializer, ::Int64) where T at boot.jl:404
  ...
rr2a = reg(df2,@model(a ~ b))

ERROR: MethodError: no method matching Array{Float64,1}(::Array{Float64,2})
Closest candidates are:
  Array{Float64,1}(::AbstractArray{S,N}) where {T, N, S} at array.jl:482
  Array{Float64,1}() where T at boot.jl:423
  Array{Float64,1}(::UndefInitializer, ::Int64) where T at boot.jl:404
  ...
Stacktrace:
 [1] convert(::Type{Array{Float64,1}}, ::Array{Float64,2}) at ./array.jl:474
 [2] #reg#38(::Nothing, ::Expr, ::Nothing, ::Nothing, ::Int64, ::Dict{Symbol,Any}, ::Float64, ::Int64, ::Bool, ::Symbol, ::Bool, ::typeof(reg), ::DataFrame, ::StatsModels.FormulaTerm{StatsModels.Term,StatsModels.Term}) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:158
 [3] reg(::DataFrame, ::StatsModels.FormulaTerm{StatsModels.Term,StatsModels.Term}) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:49
 [4] #reg#37(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}, ::typeof(reg), ::DataFrame, ::Model) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:36
 [5] reg(::DataFrame, ::Model) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:36
 [6] top-level scope at none:0

and the fourth produces

rr2b = reg(df2,@model(b ~ a))

ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
Stacktrace:
 [1] macro expansion at /home/holub/.julia/packages/Distributions/tfkz4/src/utils.jl:6 [inlined]
 [2] Type at /home/holub/.julia/packages/Distributions/tfkz4/src/univariate/continuous/fdist.jl:30 [inlined]
 [3] Type at /home/holub/.julia/packages/Distributions/tfkz4/src/univariate/continuous/fdist.jl:35 [inlined]
 [4] Type at /home/holub/.julia/packages/Distributions/tfkz4/src/univariate/continuous/fdist.jl:36 [inlined]
 [5] compute_Fstat(::Array{Float64,1}, ::LinearAlgebra.Symmetric{Float64,Array{Float64,2}}, ::Int64, ::Bool, ::VcovSimpleMethod, ::VcovData{LinearAlgebra.Cholesky{Float64,Array{Float64,2}},1}) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/vcov/utils.jl:14
 [6] #reg#38(::Nothing, ::Expr, ::Nothing, ::Nothing, ::Int64, ::Dict{Symbol,Any}, ::Float64, ::Int64, ::Bool, ::Symbol, ::Bool, ::typeof(reg), ::DataFrame, ::StatsModels.FormulaTerm{StatsModels.Term,StatsModels.Term}) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:363
 [7] reg(::DataFrame, ::StatsModels.FormulaTerm{StatsModels.Term,StatsModels.Term}) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:49
 [8] #reg#37(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}, ::typeof(reg), ::DataFrame, ::Model) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:36
 [9] reg(::DataFrame, ::Model) at /home/holub/.julia/packages/FixedEffectModels/kANnE/src/reg.jl:36
 [10] top-level scope at none:0

Previously, I experienced that missing where simply dropped for the regression.

Dropping observations concerning singleton category?

Why do you drop observations where one fixed effect is a singleton category, here?

I am running a regression of wage on worker and firm fixed effects. Dropping all singletons is an overkill here, and just making sure the sample is a connected set is enough. Maybe introduce an argument to reg that can switch this off?

Displaying RegressionResultIV Throws Error on 0.7

For MWE, see below (in fresh environment, on OSX):

(test3) pkg> st
    Status `Project.toml`
  [a93c6f00] DataFrames v0.13.0
  [9d5cd8c9] FixedEffectModels v0.6.0
julia> using FixedEffectModels, DataFrames
[ Info: Recompiling stale cache file /Users/jacob/.julia/compiled/v0.7/FixedEffectModels/XFTup.ji for FixedEffectModels [9d5cd8c9-2029-5cab-9928-427838db53e3]

julia> ep = randn(1000); Z = randn(1000); X = 3*ep + 5*Z; y = 2 .+ .3*X + 3*ep;

julia> sdf = DataFrame(y=y, X=X, Z=Z, ep=ep);

julia> reg(sdf, @model(y ~ 1 + (X~Z)))
Error showing value of type RegressionResultIV:
ERROR: MethodError: no method matching sprint(::typeof(show), ::Int64; contect=:compact => true)
Closest candidates are:
  sprint(::Function, ::Any...; context, sizehint) at strings/io.jl:97 got unsupported keyword argument "contect"
  sprint(::typeof(showcompact), ::Any...) at deprecated.jl:53 got unsupported keyword argument "contect"
Stacktrace:
 [1] kwerr(::NamedTuple{(:contect,),Tuple{Pair{Symbol,Bool}}}, ::Function, ::Function, ::Int64) at ./error.jl:97
 [2] (::getfield(Base, Symbol("#kw##sprint")))(::NamedTuple{(:contect,),Tuple{Pair{Symbol,Bool}}}, ::typeof(sprint), ::Function, ::Int64) at ./none:0
 [3] top(::RegressionResultIV) at /Users/jacob/.julia/packages/FixedEffectModels/gqD6w/src/RegressionResult.jl:251
 [4] coeftable(::RegressionResultIV) at /Users/jacob/.julia/packages/FixedEffectModels/gqD6w/src/RegressionResult.jl:71
 [5] show(::IOContext{REPL.Terminals.TTYTerminal}, ::RegressionResultIV) at /Users/jacob/.julia/packages/FixedEffectModels/gqD6w/src/RegressionResult.jl:92
 [6] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::RegressionResultIV) at ./sysimg.jl:195
 [7] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:131
 [8] display(::REPL.REPLDisplay, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:135
 [9] display(::RegressionResultIV) at ./multimedia.jl:287
 [10] #invokelatest#1 at ./essentials.jl:691 [inlined]
 [11] invokelatest at ./essentials.jl:690 [inlined]
 [12] print_response(::IO, ::Any, ::Any, ::Bool, ::Bool, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:154
 [13] print_response(::REPL.AbstractREPL, ::Any, ::Any, ::Bool, ::Bool) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:139
 [14] (::getfield(REPL, Symbol("#do_respond#40")){Bool,getfield(REPL, Symbol("##50#59")){REPL.LineEditREPL,REPL.REPLHistoryProvider},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:708
 [15] (::getfield(REPL, Symbol("##55#64")))(::Any, ::Any, ::Vararg{Any,N} where N) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:971
 [16] #invokelatest#1 at ./essentials.jl:691 [inlined]
 [17] invokelatest at ./essentials.jl:690 [inlined]
 [18] (::getfield(REPL.LineEdit, Symbol("##27#28")){getfield(REPL, Symbol("##55#64")),String})(::Any, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/LineEdit.jl:1319
 [19] prompt!(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/LineEdit.jl:2353
 [20] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/LineEdit.jl:2256
 [21] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:1029
 [22] run_repl(::REPL.AbstractREPL, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:191
 [23] (::getfield(Base, Symbol("##831#833")){Bool,Bool,Bool,Bool})(::Module) at ./logging.jl:311
 [24] #invokelatest#1 at ./essentials.jl:691 [inlined]
 [25] invokelatest at ./essentials.jl:690 [inlined]
 [26] macro expansion at ./logging.jl:308 [inlined]
 [27] run_main_repl(::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./client.jl:340
 [28] exec_options(::Base.JLOptions) at ./client.jl:252
 [29] _start() at ./client.jl:432

DOF adjustment for nested FE's in cluster

Hi,

I guess this is more a feature request than a bug report: could we adjust the degrees of freedom when fixed effects are nested within clusters?
reghdfe is doing that, see this section in the FAQ.

I guess this is what is accounting for the different SE's in here:

reg(dfPosSales, @model(logSales ~ IS_NN + ISDR_NN + IST_NN , 
	fe = NUMIDPooled&YEARPooled + CATEGORYPooled&YEARPooled, vcov = cluster(NUMIDPooled)))
                        Fixed Effect Model                        
==================================================================
Number of obs:            459909  Degrees of freedom:       309761
R2:                        0.846  R2 within:                 0.059
F-Statistic:             943.846  p-value:                   0.000
Iterations:                  172  Converged:                  true
         Estimate Std.Error  t value Pr(>|t|) Lower 95%  Upper 95%
IS_NN     3.07387  0.102905   29.871    0.000   2.87218    3.27556
ISDR_NN  0.856638  0.383811  2.23193    0.026  0.104376     1.6089
IST_NN   -1.97663   0.97583 -2.02559    0.043  -3.88924 -0.0640214

With reghdfe in Stata:

. reghdfe logsales is_nn isdr_nn ist_nn, absorb(numidyear catyear) vce(cluster numid)
(dropped 208946 singleton observations)
(converged in 49 iterations)

HDFE Linear regression                            Number of obs   =    250,963
Absorbing 2 HDFE groups                           F(   3,  46962) =    1007.81
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.7619
                                                  Adj R-squared   =     0.6021
                                                  Within R-sq.    =     0.0589
Number of clusters (numid)   =     46,963         Root MSE        =     1.8340

                              (Std. Err. adjusted for 46963 clusters in numid)
------------------------------------------------------------------------------
             |               Robust
    logsales |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       is_nn |   3.073868   .0590806    52.03   0.000      2.95807    3.189667
     isdr_nn |   .8566382   .2203572     3.89   0.000     .4247349    1.288541
      ist_nn |  -1.976629   .5602523    -3.53   0.000    -3.074732   -.8785263
------------------------------------------------------------------------------

Absorbed degrees of freedom:
---------------------------------------------------------------+
 Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     | 
-------------+-------------------------------------------------|
   numidyear |            0           98400          98400 *   | 
     catyear |         2393            2394              1     | 
---------------------------------------------------------------+
* = fixed effect nested within cluster; treated as redundant for DoF computation

Storing via save = true

I'm failing to access stored fixed effects. But also for example predictions do not work. Do I apply save=true incorrectly?

using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:StatePooled] =  pool(df[:State])
df[:YearPooled] =  pool(df[:Year])
model =  @model(Sales ~ NDI, fe = StatePooled + YearPooled, save = true)
result = reg(df,model)
show(result)
predict(result,df)

yields
predict is not defined by default for fixed effect models. Run reg with the the option save = true

Deviation from reghdfe std errors with multiway & interacted clustering

@yywingliang and I identified the cause of the divergence between the standard errors generated for these multiway clustered se test models and the results given for the same models in Stata using reghdfe. Thought I would detail it here because the current algorithm seems deliberately chosen, so I wanted to check whether it would be most helpful for us to document but leave code as is, submit a PR overwriting the current algorithm with the reghdfe approach, or somehow add an optional argument switching between the two.

The current algorithm in src/vcov/vcovcluster.jl estimates the variance-covariance matrix as the sum of several component variance estimates which are individually weighted by C / C - 1, where C is the number of clusters in the dimension of the component, only if the number of clusters in that dimension is less than the number of observations.

In contrast, reghdfe weights the sum of the component variances by C' / C' - 1, where C' is the number of clusters in the smallest cluster dimension.

Would it be preferable to document the difference, replace with the reghdfe algorithm, or implement the reghdfe algorithm as an alternative option (and if the last, is there an approach you'd like to see us follow)?

Thanks!

Estimate constant in case of FixedEffects

Like Stata areg or reghdfe does. More generally, make sure the same coefficients are estimate with fixed effects.

Actually, ivreghdfe does not gives coefficients, so I'm not sure it's worth copying this.

error with partial_out in 0.7

partial_out returns an error on julia 0.7, but works on 0.6:

using DataFrames, RDatasets, FixedEffectModels, StatsModels
iris = dataset("datasets", "iris")
iris[:SpeciesCategorical] =  categorical(iris[:Species])
partial_out(iris, @formula(SepalLength + SepalWidth ~ 1), fe = :SpeciesCategorical)
ERROR: LoadError: MethodError: no method matching iterate(::FixedEffect{UInt8,Ones{Float64},Ones{Float64}})
Closest candidates are:
  iterate(::Core.SimpleVector) at essentials.jl:578
  iterate(::Core.SimpleVector, ::Any) at essentials.jl:578
  iterate(::ExponentialBackOff) at error.jl:171
  ...
Stacktrace:
 [1] iterate at ./generator.jl:44 [inlined]
 [2] collect(::Base.Generator{FixedEffect{UInt8,Ones{Float64},Ones{Float64}},getfield(FixedEffectModels, Symbol("##72#73"))}) at ./array.jl:619
 [3] #partial_out#71(::Symbol, ::Nothing, ::Bool, ::Int64, ::Float64, ::Symbol, ::Function, ::DataFrame, ::Formula) at /Users/pascalgolec/.julia/packages/FixedEffectModels/J5qW4/src/partial_out.jl:87
 [4] (::getfield(FixedEffectModels, Symbol("#kw##partial_out")))(::NamedTuple{(:fe,),Tuple{Symbol}}, ::typeof(partial_out), ::DataFrame, ::Formula) at ./none:0

F-Statistic Calculation

When running a regression, I get very different F-Stats with FixedEffectModels vs. other programs. Running a simple OLS (in one set of data) get an F-Stat of about 2000, in R or using the Julia GLM package, the result is about 900. I looked into this issue, and I think the calculation for the F-Stat is wrong. In vcov/utils.jl currently:

    F = (Diagonal(coefF) * (matrix_vcov \ Diagonal(coefF)))[1]
    df_ans = df_FStat(vcov_method_data, vcov_data, hasintercept)
    dist = FDist(nobs - hasintercept, max(df_ans, 1))

But I have never seen F-stats calculated this way. Most methods estimate a new model (with only the intercept) and compare the standard error. This is not ideal, so based on this stackoverflow question and answer, I think these three lines should be

    F = ((coefF' * matrix_vcov^(-1) * coefF)/length(coefF))
    df_ans = df_FStat(vcov_method_data, vcov_data, hasintercept)
    dist = FDist(length(coefF), df_ans)

The first part is essentially


F = β′(σ(XX) − 1) − 1β

Doing this makes it match the GLM package and regressions in R (including for felm).

Changing the distribution is also needed. I think what I have now is correct, but I was testing this on unclustered data which might change the results.

Sparse exogeneous model matrix support

Now that DataFrames.jl supports generating sparse model matrices, it would be great to put them to use in this package, for cases when you have very high-dimensional fixed effects but also moderately high-dimensional exogenous categorical variables of interest (where the resulting dense covariance matrix would be large but still fit in memory) when working with a large number of observations.

After starting working on this, most of the issues I've run into so far have just been a matter of generalizing method signatures, but my lack of experience working with Cholesky factorizations is tripping me up when it comes to reproducing the right pivots and permutations in some of the more numerical code. Specifically, basecol is implemented as

function basecol(X::Matrix{Float64}...)
    chol = cholfact!(crossprod(X...), :U, Val{true})
    ipermute!(diag(chol.factors) .> 0, chol.piv)
end

so chol is the upper part of the pivoted factorization I think? The sparse equivalent (I believe) is just chol = cholfact(crossprod(X...)), since the sparse factorization is pivoted by default. From there though, there are options to extract the factorization with or without the effects of the permutation matrix (chol[:U] vs chol[:UP]) - is this choice related to the inverse permutation on the second line of the current basecol implementation? What would a version of this function with support for sparse X look like (given crossprod were sparse-compatible)?

My apologies for the fact that I don't really know what I'm talking about! I hope this is at least somewhat coherent, and a reasonable thing to be asking about.

Update to StatsModels 0.6.0

When I run using FixedEffectssModels I get the following error:

WARNING: could not import StatsModels.Formula into FixedEffectModels
WARNING: could not import StatsModels.Terms into FixedEffectModels
WARNING: could not import StatsModels.evalcontrasts into FixedEffectModels
WARNING: could not import StatsModels.check_non_redundancy! into FixedEffectModels

It happens just after the last update of StatsModels

Please help

Implementation in JuliaDB for Parallelism

I need to run many heavy fe regressions using exactly the same large dataset 5GB+. I’ve wrote a code to run each regression over different workers. Essentially, I’m copying the dataframe into each worker() and running a loop with parallel. However, this process consumes a lot of memory and sometimes Julia run out of memory.

I've checked on the possibility of loading the dataframe on a single worker and allowing other workers to read it, in order to avoid loading the data in each worker. Apparently, dataframe cannot handle that yet, but JuliaDB has such a functionality.

Do you see any simple way to implement this code on JuliaDB instead of Dataframe?

Perhaps this could be an easy way to exploit simple parallelism efficiently when needing to run many regression with different specifications, on the very same data.

Thanks
Diogo

NaN estimates?

On parts of my sample (~80 million observations) I get NaN estimates, on others not, the full sample throws an OutOfMemoryError().

The model is like @model(float ~ cat1 + cat2&cat3), LHS is Float64 and the RHS is full of CategoricalValue variables.

Is FixedEffectModels capable of handling that big samples?
I guess one of the factorizations fails, but what could be the reason?

Can we choose the 'Leave out' category for FixedEffectModels.jl?

Hi Matthieu and FixedEffectModels dev team,

As a resource-constrained graduate student, I think being able to run fixed effects models in Julia is a wonderful idea. Thank you so much for your time and effort putting this together.

My question is, "Is there a way to choose the value of a Categorical variable when using FixedEffectModels.jl as the 'leave out' category?" I am running an event study regression and I would like to leave out the -1 time period, but FixedEffectsModels.jl always chooses the earliest value (in my case -79).

Sorry for the StackOverflow-like question, I just wanted to know if there was functionality for choosing the leave out category.

Thanks so much for your help.

Best,

Dave

Add more tests

  • Add more tests that residuals and fixedeffects are correct in the case of IV, weights, subsets in RegressionResult.

  • Check that standard errors and F-stat are similar to ivreg2, smallin reg.

test fail randomly when using threads

My suggestion to use threads in #26 was apparently not such a good idea. When I run julia with more than 1 thread , the tests always fail and this at a different coefficient test (haven't seen it go beyond the simple tests). By looking at the code I don't see any race condtions, but I'm no expert.

I don't know if travis allows it, but perhaps you can add JULIA_NUM_THREADS=2 julia to the travis configuration here?

I will run single threaded for now.

Coefficient p-values and CIs with nested clusters

Over at the R lfe repo, there's an ongoing discussion about differences between lfe:felm() and Stata's reghdfe. See: sgaure/lfe#1 (comment)

(Summary version: The differences appear to be related to the same nested-clusters-degrees-of-freedom issue that @maxnorton appears to have addressed in various PRs for this Julia implementation back in December. https://github.com/matthieugomez/FixedEffectModels.jl/pull/48 and https://github.com/matthieugomez/FixedEffectModels.jl/pull/50)

I mention this here, because in comparing the results across different software packages, I noticed that FixedEffectModels.jl still gives different p-values and confidence intervals to Stata. That is, despite the fact that the standard errors are in agreement.

Do the DOF for the model coefficients still need to be adjusted for cluster nesting?

For convenience, I'll repost the MWE from the other thread below. Again, the coefficient point estimates and SEs are in agreement. But the p-vals and CIs differ.


Julia

using CSV, DataFrames, FixedEffectModels

DF = DataFrame(load("https://gist.githubusercontent.com/grantmcdermott/d56142f2a1cdab3ad3466e642c6ceedc/raw/ffba2f7872ced7c0f3853c349251aa7d62e160a4/fe-cluster-sim.csv"))

DF.fe =  categorical(DF.fe)
DF.cl =  categorical(DF.cl)

reg(DF, @model(y ~ x, fe = fe, vcov = cluster(cl)))
#=
                     Fixed Effect Model                     
=============================================================
Number of obs:           2500   Degrees of freedom:         2
R2:                     0.237   R2 within:              0.002
F-Statistic:          4.56065   p-value:                0.038
Iterations:                 1   Converged:               true
      Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
x    0.0283565 0.0132782 2.13557    0.033 0.0023191 0.0543939
=#

Stata (version 13)

import delimited https://gist.githubusercontent.com/grantmcdermott/d56142f2a1cdab3ad3466e642c6ceedc/raw/ffba2f7872ced7c0f3853c349251aa7d62e160a4/fe-cluster-sim.csv

reghdfe y x, absorb(fe) vce(cluster cl)
/*
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =      2,500
Absorbing 1 HDFE group                            F(   1,     49) =       4.56
Statistics robust to heteroskedasticity           Prob > F        =     0.0377
                                                  R-squared       =     0.2373
                                                  Adj R-squared   =     0.0466
                                                  Within R-sq.    =     0.0018
Number of clusters (cl)      =         50         Root MSE        =     6.4882

                                    (Std. Err. adjusted for 50 clusters in cl)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .0283565   .0132782     2.14   0.038      .001673    .0550401
       _cons |   2.372491   .0022073  1074.83   0.000     2.368056    2.376927
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          fe |       500         500           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
*/

Horizontal lines in REPL prompt in Juno

FixedEffectModels changes the prompt in the REPL in Juno. I get the prompt with three horizontal lines:

julia> ======================================================================
----------------------------------------------------------------------
======================================================================

When run in a terminal this does not happen. I believe this might be due to lines 175, 178 and 186 in RegressionResult.jl, where println("=" ^totalwidth) should be println(io, "=" ^totalwidth).

Two-way clustering error when negative elements on vcov diagonal?

Just discovered this project today. It's fantastic. Unbelievably fast.

I'm getting the following error (full error attached as errortext.txt) when doing two-way clustering.
DOMAINERROR: sqrt will only return a complex result if called with a complex argument.

I suspect it's due to the vcov matrix not being positive semi-definite, as sometimes occurs with two way clustering. CGM (2011) mention this issue and describe a fix (pp. 241-242). I think reghdfe implements the fix but includes a warning message? Can't quite remember. I would give it a shot but I just started using Julia so I'm pretty sure I'd muck it up.

Here's a link to a gist that has the MWE and dataset, plus the full error.
https://gist.github.com/pbaylis/66a8ef7869bb23ca5b90

Sorry, this is a silly way to attach these files.

Reliance on SuiteSparse

FixedEffectModels currently relies on SuiteSparse (via Base.SparseArrays.CHOLMOD.Factor{Float64} in src/fixedeffect/FixedEffectProblem_Factorization.jl). This means that when JuliaPro with MKL is installed, Julia is compiled without SuiteSparse and the FixedEffectModels.jl tests fail.

I guess a proper solution would be to use MKLSparse.jl in case the MKL is available. In the meantime, the only workaround is to use non-MKL builds of Julia.

This isn't really high priority, but I wanted to document it in case someone runs into the same problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.