edunford / tidysynth Goto Github PK

View Code? Open in Web Editor NEW

97.0 97.0 14.0 8.81 MB

A tidy implementation of the synthetic control method in R

License: Other

R 100.00%

tidysynth's People

Contributors

Stargazers

Watchers

Forkers

arlionn wzhao0619 jmzarco arjeneve alex-deafonseca etiennebacher qmzheng09work unme621 sachi-fernando fluellen shizelong1985

tidysynth's Issues

Are V weights obtained through cross-validation?

In the outer optimization of the V weights, are you minimizing the MSE within-sample, or doing some sort of train-test split, and optimizing V using the test set?

grab_signficance time range error

When specifying an incorrect time frame,grab_signficance() throws the following error. In the example below, 1993 is the intervention time, thus the range does not include post-intervention data.

This can be replicated using the example in the readme using the following date range: smoking_out %>% grab_signficance(1970:1988).

Thanks Dr. Dunford!

Small but inconvenient bug in tidysynth when using only 1 pre-treatment period

Thanks for writing this great package. I started to change my teaching material from Synth to tidysynth but then encountered a small issue. For demonstration I use a small selection of the smoking data (with Colorado and Utah as donors and 1988 cigsale and age15to24 as predictors). Here are the v and w weights when using Synth. (I attach the full replicable R syntax in this file):

$tab.v
v.weights
cigsale 1
age15to24 0

$tab.w
w.weights unit.names unit.numbers
2 0.886 2 2
3 0.114 3 3

This only replicates with tidysynth if I drop all data from before 1988 from the data.frame.

grab_unit_weights(df_out)
A tibble: 2 x 2
unit weight

1 Colorado 0.886
2 Utah 0.114

grab_predictor_weights(df_out)
A tibble: 2 x 2
variable weight

1 age15to24_1988 1.30e-25
2 cigsale_1988 1 e+ 0

That does not allow me to plot the data from 1970 onward, only from 1988 onward.

Using the full data (to be able to draw the graph across the entire range) yields a (wrong) solution that ignores pre-treatment cigsale:

grab_unit_weights(df_out)
A tibble: 2 x 2
unit weight

1 Colorado 0.668
2 Utah 0.332

grab_predictor_weights(df_out)
A tibble: 2 x 2
variable weight

1 age15to24_1988 1.00e+ 0
2 cigsale_1988 6.83e-14

This feels like a minor glitch and I hope you can fix it.
Kind regards
Hendrik

Not getting correct unit of intervention

I am using pretty much the exact same code as provided in the example for tidysynth, however, the graph I am receiving has the incorrect unit where the intervention occurred. The unit of intervention I assigned is "Colombia", however, it is returning "Argentina" for all graphs.

`code:

all_data <- correct_missing %>% synthetic_control(outcome = loghomicides, # outcome
          unit = Country, # unit index in the panel data
          time = Year, # time index in the panel data
          i_unit = "Colombia", # unit where the intervention occurred
          i_time = 2015, # time period when the intervention occurred
          generate_placebos=T # generate placebo synthetic controls (for inference)
                ) %>%

generate_predictor(time_window = 2005:2015,
          prison_pop = mean(logprison_population, na.rm = T),
          gini_ind = mean(loggini_index, na.rm = T),
          unemployment_log = mean(logunemployment, na.rm = T),
          serious_assault = mean(logserious_assault_rate, na.rm = T)
          ) %>%

generate_predictor(time_window = 2010,
                  homicides_2010 = loghomicides) %>%
generate_predictor(time_window = 2006,
                 homicides_2006 = loghomicides) %>%
generate_predictor(time_window = 2015,
                 homicides_2015 = loghomicides) %>%

generate_weights(optimization_window = 2005:2015, 
         margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6) %>% generate_control()`

Error: Input must be a vector, not a `tbl_df/tbl/data.frame/synth_tbl` object.

Error when attempting to run the test code in the README. Specifically, the generate_predictor() function throws an error: "Error: Input must be a vector, not a tbl_df/tbl/data.frame/synth_tbl object."

Placebo

require(tidysynth)
data("final")
final %>% dplyr::glimpse()
is.data.frame(final)

violent_crimes_out <-

final%>%

initial the synthetic control object

synthetic_control(outcome = final$Violent crime, # outcome
unit = final$Area, # unit index in the panel data
time = final$Year, # time index in the panel data
i_unit = "Colorado", # unit where the intervention occurred
i_time = 2013, # time period when the intervention occurred
generate_placebos=TRUE # generate placebo synthetic controls (for inference)
) %>%

Generate the aggregate predictors used to fit the weights

average log income, retail price of cigarettes, and proportion of the

population between 15 and 24 years of age from 1980 - 1988

generate_predictor(time_window = 2013:2019,
tx_desemprego = mean(Taxa de desemprego, na.rm = T),
renda_percapita = mean(Renda per Capita, na.rm = T),
tx_conservadorismo = mean(Conservadorismo, na.rm = T)) %>%

Generate the fitted weights for the synthetic control

generate_weights(optimization_window = 2007:2019, # time to use in the optimization task
margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
) %>%

Generate the synthetic control

generate_control()

Error in dplyr::mutate():
ℹ In argument: placebo = ifelse(final$Area != i_unit, 1, 0).
Caused by error:
! placebo must be size 7 or 1, not 91.
Run rlang::last_error() to see where the error occurred.

This error is happening and i've runned out of ideas on how to fix it. Can someone help me, i don't know what else to do.

Dynamically adding predictors

I am trying to create a shiny dashboard for an SCM implementation and this requires dynamically adding lag variables. I am running into the problem that I can't dynamically specify the name of the predictor, ie the cigsale_1975 in the example, since it is not a string. Is there a way of doing this, even just under the hood for the time being? It would also be really useful to pass in an array of values.

Edit: For clarification, I can't pass in a variable with a string, ie,

lagVarName = "cigsale_1975"

generate_predictor(time_window = 1975, lagVarName  = 1975)

Package was removed from CRAN

Hi,

thank you for developing this wonderful package, it really makes it very easy and fast to implement the synthetic control method. :) In case you did not yet notice, I just wanted to highlight that today the package was archived/removed from CRAN as it: "requires archived package 'LowRankQP' ". I noticed that the same problem occurred with Synth package. I hope it will be possible to get the packages back to CRAN.

Thanks again for all your hard work with the package

Memoise/cache option

This package looks super, thanks for putting it together.

Quick request as per our Twitter convo: I think it could useful to add a memoisation/caching option to synthetic_control (or generate_weights).

There are several potential advantages to this. But the particular use-case that I'm thinking of is running a large placebo trial where, say, a single placebo fails because a tolerance issue. Instead of having to rerun the entire ensemble from scratch, we could then just tweak the tolerance for that single placebo and load the remaining results from the cache.

The one tricky thing is how to pass a bespoke argument (or set of arguments) that only maps to that individual placebo and not everything else. E.g. As it it stands, tweaking the general sigf_ipopargument would propagate to the entire ensemble and thus invalidate the cache.

na.rm issue

The tidysynth package works smoothly for me, except for one issue. For one particular variable, I get this error:

Error in aux_gen_pred(data = ., time_window = time_window, ...) :
NA generated in specified predictors. Consider using rm.na= TRUE in aggregation function or specifying a larger/different time window

What is meant is na.rm=T, I think. I did specify this though. The thing is, it works for all the other variables. Just not for 'imports'. I have no idea why.

For a reproducible example:

The data set cannot be attached due to the format. It can be donwloaded here though
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/I42PPJ

load("afripanel_wdk_final.RData")
subsah_out <-

afripanel %>%

initial the synthetic control object

synthetic_control(outcome = lngdpmad, # outcome
unit = WBCode, # unit index in the panel data
time = year, # time index in the panel data
i_unit = "MLI", # unit where the intervention occurred
i_time = 1991, # time period when the intervention occurred
generate_placebos=T # generate placebo synthetic controls (for inference)
) %>%

Generate the aggregate predictors used to fit the weights

average log income, retail price of cigarettes, and proportion of the

population between 15 and 24 years of age from 1980 - 1988

generate_predictor(time_window = 1975:1990,
eximd = mean(imports, na.rm=T)

) %>%

Generate the fitted weights for the synthetic control

generate_weights(optimization_window = 1975:1991, # time to use in the optimization task
margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
) %>%

Generate the synthetic control

generate_control()

how to use year and month as time index

Hello Eric,

I am struggling how to use year and month as time index in the synthetic_control(), since (as it seems to me) it just takes numeric values.
Thank you

tidysynth 0.1.0: function spelling error grab_signficance

Probably a typo?

Most of `grab_` and `plot_` functions don't work if `generate_placebos = FALSE`

Hello, I think I found a bug that is present on the Github version but not on CRAN version. When I use generate_placebos = FALSE in synthetic_control(), then most of the grab_ and plot_ functions don't work:

library(tidysynth)

smoking_out <- smoking %>%
  synthetic_control(
    outcome = cigsale, 
    unit = state, 
    time = year,
    i_unit = "California", 
    i_time = 1988, 
    generate_placebos = FALSE 
  ) %>%
  
  generate_predictor(
    time_window = 1980:1988,
    ln_income = mean(lnincome, na.rm = T)
  ) %>%
  generate_weights(optimization_window = 1970:1988) %>%
  generate_control()

smoking_out %>% grab_outcome()
#> # A tibble: 0 × 1
#> # … with 1 variable: .outcome <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_predictors()
#> # A tibble: 0 × 1
#> # … with 1 variable: .predictors <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_unit_weights()
#> # A tibble: 0 × 1
#> # … with 1 variable: .unit_weights <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_predictor_weights()
#> # A tibble: 0 × 1
#> # … with 1 variable: .predictor_weights <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_loss()
#> # A tibble: 1 × 4
#>   .id     .placebo variable_mspe control_unit_mspe
#>   <chr>      <dbl>         <dbl>             <dbl>
#> 1 Alabama        1          149.         0.0000257

smoking_out %>% grab_significance()
#> # A tibble: 1 × 8
#>   unit_name type  pre_mspe post_mspe mspe_ratio  rank fishers_exact_pv…¹ z_score
#>   <chr>     <chr>    <dbl>     <dbl>      <dbl> <int>              <dbl>   <dbl>
#> 1 Alabama   Donor     149.      11.1     0.0744     1                  1      NA
#> # … with abbreviated variable name ¹fishers_exact_pvalue

smoking_out %>% grab_balance_table()
#> Error in `chr_as_locations()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `variable` doesn't exist.

smoking_out %>% grab_synthetic_control()
#> # A tibble: 0 × 1
#> # … with 1 variable: .synthetic_control <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% plot_trends()
#> Error in `dplyr::filter()`:
#> ! Problem while computing `..1 = time_unit %in% time_window`.
#> Caused by error in `time_unit %in% time_window`:
#> ! object 'time_unit' not found

smoking_out %>% plot_differences()
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `diff = real_y - synth_y`.
#> Caused by error in `mask$eval_all_mutate()`:
#> ! object 'real_y' not found

smoking_out %>% plot_weights()
#> Error in `chr_as_locations()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `variable` doesn't exist.

^{Created on 2022-08-31 by the reprex package (v2.0.1)}

Everything works correctly if generate_placebos = TRUE but it makes the function slower and placebos shouldn't be needed for these functions.

Besides this bug, even the output of generate_control() seems weird when generate_placebos = FALSE:

library(tidysynth)

smoking_out <- smoking %>%
  synthetic_control(
    outcome = cigsale, 
    unit = state, 
    time = year,
    i_unit = "California", 
    i_time = 1988, 
    generate_placebos = FALSE 
  ) %>%
  
  generate_predictor(
    time_window = 1980:1988,
    ln_income = mean(lnincome, na.rm = T)
  ) %>%
  generate_weights(optimization_window = 1970:1988) %>%
  generate_control()

smoking_out
#> # A tibble: 2 × 11
#>   .id     .placebo .type   .outcome .predi…¹ .synth…² .unit_…³ .predi…⁴ .origi…⁵
#>   <chr>      <dbl> <chr>   <list>   <list>   <list>   <list>   <list>   <list>  
#> 1 Alabama        1 treated <tibble> <tibble> <tibble> <tibble> <tibble> <tibble>
#> 2 Alabama        1 contro… <tibble> <tibble> <tibble> <tibble> <tibble> <tibble>
#> # … with 2 more variables: .meta <list>, .loss <list>, and abbreviated variable
#> #   names ¹.predictors, ².synthetic_control, ³.unit_weights,
#> #   ⁴.predictor_weights, ⁵.original_data
#> # ℹ Use `colnames()` to see all variable names

^{Created on 2022-08-31 by the reprex package (v2.0.1)}
Can you confirm that it works as expected?

Incorrect "plot_trends()" coming from "real_y": observations correspond to reversed years in "time_unit"

Hello, thank you for your work on this package, it provides great aid. The issue raised here is serious, and thus ought to be corrected urgently. The issue makes the comparison between synthetic and observed completely wrong for some operations including the very important plot_trends(). See reproduction at the bottom for quick understanding.

When

The issue occurs whenever the time variable in the supplied data is in descending order for the corresponding unit and outcome (see reproduction code at the end).

Where

The issue is created in generate_control(), specifically aux_gen_control():

As a result of tidyr::spread below, the data will be in ascending order of the time_unit column.

    # coutcome time series for the control units
    outcome_controls <-
      data %>%
      dplyr::filter(.placebo == is_placebo,.type=="controls") %>%
      dplyr::select(.original_data)  %>%
      tibble::as_tibble() %>%
      tidyr::unnest(cols = c(.original_data)) %>%
      dplyr::select(.data[[unit_index]],.data[[time_index]],.data[[outcome_name]]) %>%
      tidyr::spread(.data[[unit_index]],.data[[outcome_name]])

However , outcome treatment will be in descending order of time if the original data was in descending order:

outcome_treatment <-
      data %>%
      dplyr::filter(.placebo == is_placebo,.type=="treated") %>%
      dplyr::select(.original_data)  %>%
      tibble::as_tibble() %>%
      tidyr::unnest(cols = c(.original_data)) %>%
      dplyr::select(y=.data[[outcome_name]])

This makes synthetic_output force the real_y into the reversed time units from outcome_controls's ascending order.

synthetic_output <-
      (Y_U  %*%  W) %>%
      tibble::as_tibble() %>%
      dplyr::transmute(time_unit = outcome_controls[[time_index]],
                       real_y = outcome_treatment$y,
                       synth_y = weight)

Effect

grab_synthetic_control() and plot_trends are incorrect, and likely any object that uses .synthetic_control.

Reproduction

Using the package example and data

First sort units in descending order of the time unit:

smoking.rev <- smoking %>% group_by(state) %>% arrange(desc(year)) %>% ungroup()

Then using new data below, the the observed outcome correspond to reversed years evidenced by plot_trends()
grab_synthetic_control()

smoking_out <-
  
  smoking.rev %>%
  
  # initial the synthetic control object
  synthetic_control(outcome = cigsale, # outcome
                    unit = state, # unit index in the panel data
                    time = year, # time index in the panel data
                    i_unit = "California", # unit where the intervention occurred
                    i_time = 1988, # time period when the intervention occurred
                    generate_placebos=T # generate placebo synthetic controls (for inference)
                    ) %>%
  
  # Generate the aggregate predictors used to fit the weights
  
  # average log income, retail price of cigarettes, and proportion of the
  # population between 15 and 24 years of age from 1980 - 1988
  generate_predictor(time_window = 1980:1988,
                     ln_income = mean(lnincome, na.rm = T),
                     ret_price = mean(retprice, na.rm = T),
                     youth = mean(age15to24, na.rm = T)) %>%
  
  # average beer consumption in the donor pool from 1984 - 1988
  generate_predictor(time_window = 1984:1988,
                     beer_sales = mean(beer, na.rm = T)) %>%
  
  # Lagged cigarette sales 
  generate_predictor(time_window = 1975,
                     cigsale_1975 = cigsale) %>%
  generate_predictor(time_window = 1980,
                     cigsale_1980 = cigsale) %>%
  generate_predictor(time_window = 1988,
                     cigsale_1988 = cigsale) %>%
  
  
  # Generate the fitted weights for the synthetic control
  generate_weights(optimization_window = 1970:1988, # time to use in the optimization task
                   margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
  ) %>%
  
  # Generate the synthetic control
  generate_control()

smoking_out %>% plot_trends()
smoking_out %>% grab_synthetic_control()

Issues in R with Tidy Synth

Hi ,
I am trying to use Tidy synth to do some work and i keep getting errors and i am unable to get the final results. I am attaching my code . Any help would be much appreciated

Restriction on weights

I'd like to restrict the # of control units acting as components of the synthetic control. This could be done through regularization, either in spirit or to T (e.g., lasso). Does this package allow any regularization on weights?

Forming synthetic control without predictors

Hello and thank you for your time on making this amazing library. Just a quick question: Is it possible to generate the weights and get the synthetic control without using predictor variables? I've seen that this is not an uncommon practice, to only employ the outcome variable in the optimization problem, with the idea that this takes into account all unobserved predictors that cannot be included in the model. Thanks again!

`generate_placebos = TRUE` does not match manual permutation

Hello,

When we manually permute the tidysynth command chain over units we find that we get different results - particularly for the pre_mspe, post_mspe and mspe_ratio - than when we use generate_placebos = TRUE. It appears as though the designated treated unit is processed differently than the donor units. We see that the output for California matches exactly across both versions.

The following code uses the example from your vignette.

This code can take a while to run, i've also attached compare.csv with the relevant output so you can inspect it more easily.

library(tidysynth)
library(tidyverse)

set.seed(4810)

# Example implementation

smoking_out <-
  smoking %>%
  synthetic_control(outcome = cigsale,
                    unit = state,
                    time = year,
                    i_unit = "California",
                    i_time = 1988,
                    generate_placebos = TRUE) %>%
  generate_predictor(time_window=1980:1988,
                     lnincome = mean(lnincome, na.rm = TRUE),
                     retprice = mean(retprice, na.rm = TRUE),
                     age15to24 = mean(age15to24, na.rm = TRUE)) %>%
  generate_predictor(time_window=1984:1988,
                     beer = mean(beer, na.rm = TRUE)) %>%
  generate_predictor(time_window=1975,
                     cigsale_1975 = cigsale) %>%
  generate_predictor(time_window=1980,
                     cigsale_1980 = cigsale) %>%
  generate_predictor(time_window=1988,
                     cigsale_1988 = cigsale) %>%
  generate_weights(optimization_window =1970:1988,
                   Margin.ipop=.02,Sigf.ipop=7,Bound.ipop=6) %>%
  generate_control()

smoking_mspe <- smoking_out %>% 
  grab_signficance() 

# Manually run the permutation test ---------------------------------------

smoking_wrap <- function(i_unit, generate_placebos = F) {
  smoking %>%
    synthetic_control(outcome = cigsale,
                      unit = state,
                      time = year,
                      i_unit = i_unit, # iterate the treated unit
                      i_time = 1988,
                      generate_placebos= generate_placebos) %>%
    generate_predictor(time_window=1980:1988,
                       lnincome = mean(lnincome, na.rm = TRUE),
                       retprice = mean(retprice, na.rm = TRUE),
                       age15to24 = mean(age15to24, na.rm = TRUE)) %>%
    generate_predictor(time_window=1984:1988,
                       beer = mean(beer, na.rm = TRUE)) %>%
    generate_predictor(time_window=1975,
                       cigsale_1975 = cigsale) %>%
    generate_predictor(time_window=1980,
                       cigsale_1980 = cigsale) %>%
    generate_predictor(time_window=1988,
                       cigsale_1988 = cigsale) %>%
    generate_weights(optimization_window =1970:1988,
                     Margin.ipop=.02,Sigf.ipop=7,Bound.ipop=6) %>%
    generate_control()
}
 
units <- unique(smoking$state)

manual_permute <- units %>% 
  map(., smoking_wrap,
      generate_placebos = FALSE) # for faster processing - we are highlighting that 
                                 # the when processed as the treated unit the mspe 
                                 # values are different than when it is permutted

manual_mspe <- manual_permute %>% 
  map_dfr(grab_signficance)


# -------------------------------------------------------------------------

compare <- manual_mspe %>% 
  left_join(smoking_mspe, by = "unit_name", 
            suffix = c(".manual", ".gen_pla")) 

# Big differences in pre_mspe
compare %>% mutate(diff = pre_mspe.manual - pre_mspe.gen_pla) %>% 
  pull(diff) %>% summary()

# California matches
compare %>% 
  filter(unit_name == "California") %>% 
  select(pre_mspe.gen_pla, pre_mspe.manual)
  
write_csv(compare, "compare.csv")

cc @davidnathanlang

grab_significance() - fewer controls question

Dear @edunford,
I'm sorry if this question is trivial, but I have struggled to find an answer elsewhere. In your R documentation, when talking about grab_significance() you mention:
One needs at least 20 control case to use the conventional .05 level. With fewer cases, significance levels need to be adjusted to accommodate the low total rank. This is a bug of rank based significance metrics.

Could you please point out to a resource that explains how this adjustment is done (to e.g., 9 instead of 20 units)?

Thank you in advance!

Spelling error in grab_significance

The function for pulling significance is currently named grab_signficance() where it should be grab_significance()

This is true in both the README and the package code

Multiple treatment units

Hello,

My apologies if this is not the right place to ask this, but I was wondering if we are able to use this package for multiple treatment units yet? I know it's something that was in the works but wanted to check to see if it is implementable yet.

Thank you!

How to use `tidysynth` functions in a custom function?

Hello, thank you for this package!

I have a question about embedding tidysynth functions in a custom function. For example, suppose that I define custom_scm() that runs the tidysynth workflow but allows the user to provide a specific outcome:

custom_scm <- function(outcome) {
  
  smoking %>%
    synthetic_control(outcome = outcome, 
                      unit = state, 
                      time = year,
                      i_unit = "California", 
                      i_time = 1988,
                      generate_placebos=T 
    ) %>% 
    generate_predictor(time_window = 1980:1988,
                       ln_income = mean(lnincome, na.rm = T)) %>% 
    generate_weights(optimization_window = 1970:1988) %>% 
    generate_control()
} 

custom_scm("cigsale")

#> Error in `dplyr::select()`:
#> ! Column `outcome` not found in `.data`.
#> Run `rlang::last_error()` to see where the error occurred.

I tried with dplyr's {{}} but it also doesn't work:

custom_scm <- function(outcome) {
  
  smoking %>%
    synthetic_control(outcome = {{outcome}}, 
                      unit = state, 
                      time = year,
                      i_unit = "California", 
                      i_time = 1988,
                      generate_placebos=T 
    ) %>% 
    generate_predictor(time_window = 1980:1988,
                       ln_income = mean(lnincome, na.rm = T)) %>% 
    generate_weights(optimization_window = 1970:1988) %>% 
    generate_control()
} 

custom_scm("cigsale")

#> Error in `dplyr::select()`:
#> ! Column `"cigsale"` not found in `.data`.
#> Run `rlang::last_error()` to see where the error occurred.

Same thing if I use !!outcome. Do you know how to do something like that?

Edit: the problem comes from synthetic_control(), and more particularly from the meta information where the name of the outcome is "outcome" instead of "cigsale":

x <- smoking %>%
    synthetic_control(outcome = outcome, 
                      unit = state, 
                      time = year,
                      i_unit = "California", 
                      i_time = 1988,
                      generate_placebos=T 
    )

x$.meta[[1]]

#> # A tibble: 1 x 5
#>   unit_index time_index treatment_unit treatment_time outcome
#>   <chr>      <chr>      <chr>                   <dbl> <chr>  
#> 1 state      year       California               1988 outcome

plot_trend() keeps crashing

plot_trend() and other plot_ functions keep crashing: "R encountered a fatal error"...

Other functions seem to be fine. I updated all the libraries, but it is still crashing.

Here is sessionInfo()

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5 tidyverse_1.3.1 dplyr_1.0.7
[10] tidysynth_0.1.0

loaded via a namespace (and not attached):
[1] tidyselect_1.1.1 xfun_0.25 kernlab_0.9-29 haven_2.4.3 colorspace_2.0-2 vctrs_0.3.8 generics_0.1.0
[8] htmltools_0.5.2 yaml_2.2.1 utf8_1.2.2 rlang_0.4.11 pillar_1.6.2 glue_1.4.2 withr_2.4.2
[15] DBI_1.1.1 dbplyr_2.1.1 optimx_2021-6.12 modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.0 munsell_0.5.0
[22] gtable_0.3.0 cellranger_1.1.0 rvest_1.0.1 evaluate_0.14 knitr_1.34 tzdb_0.1.2 fastmap_1.1.0
[29] fansi_0.5.0 broom_0.7.9 Rcpp_1.0.7 scales_1.1.1 backports_1.2.1 jsonlite_1.7.2 fs_1.5.0
[36] hms_1.1.0 digest_0.6.27 stringi_1.7.4 numDeriv_2016.8-1.1 grid_4.1.1 cli_3.0.1 tools_4.1.1
[43] magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.2 reprex_2.0.1 lubridate_1.7.10
[50] rstudioapi_0.13 assertthat_0.2.1 rmarkdown_2.10 httr_1.4.2 R6_2.5.1 compiler_4.1.1

Improve speed?

Hello, this is a super useful package, I find it much more convenient to use than {Synth}. However, it is also slower. Below is a small benchmark comparing the performances of Synth and tidysynth on the basque data from Abadie & Gardeazabal (2003). I tried to adapt the code in Synth documentation to work with tidysynth (there are some steps from Synth::dataprep() that I didn't adapt but they shouldn't have a large effect, and even if they did it would reduce the gap between Synth and tidysynth):

library(bench)
library(Synth)
#> ##
#> ## Synth Package: Implements Synthetic Control Methods.
#> ## See https://web.stanford.edu/~jhain/synthpage.html for additional information.
library(tidysynth)

data(basque)

##############################
## Synth ##
##############################

synth_ <- function() {
  
  dataprep.out <-
    dataprep(
      foo = basque
      ,predictors= c("school.illit",
                     "school.prim",
                     "school.med",
                     "school.high",
                     "school.post.high"
                     ,"invest"
      )
      ,predictors.op = c("mean")
      ,dependent     = c("gdpcap")
      ,unit.variable = c("regionno")
      ,time.variable = c("year")
      ,special.predictors = list(
        list("gdpcap",1960:1969,c("mean")),                            
        list("sec.agriculture",seq(1961,1969,2),c("mean")),
        list("sec.energy",seq(1961,1969,2),c("mean")),
        list("sec.industry",seq(1961,1969,2),c("mean")),
        list("sec.construction",seq(1961,1969,2),c("mean")),
        list("sec.services.venta",seq(1961,1969,2),c("mean")),
        list("sec.services.nonventa",seq(1961,1969,2),c("mean")),
        list("popdens",1969,c("mean")))
      ,treatment.identifier  = 17
      ,controls.identifier   = c(2:16,18)
      ,time.predictors.prior = c(1964:1969)
      ,time.optimize.ssr     = c(1960:1969)
      ,unit.names.variable   = c("regionname")
      ,time.plot            = c(1955:1997) 
    )
  
  
  dataprep.out$X1["school.high",] <- 
    dataprep.out$X1["school.high",] + 
    dataprep.out$X1["school.post.high",]
  dataprep.out$X1                 <- 
    as.matrix(dataprep.out$X1[
      -which(rownames(dataprep.out$X1)=="school.post.high"),])
  dataprep.out$X0["school.high",] <- 
    dataprep.out$X0["school.high",] + 
    dataprep.out$X0["school.post.high",]
  dataprep.out$X0                 <- 
    dataprep.out$X0[
      -which(rownames(dataprep.out$X0)=="school.post.high"),]
  
  # 2. make total and compute shares for the schooling catgeories
  lowest  <- which(rownames(dataprep.out$X0)=="school.illit")
  highest <- which(rownames(dataprep.out$X0)=="school.high")
  
  dataprep.out$X1[lowest:highest,] <- 
    (100 * dataprep.out$X1[lowest:highest,]) /
    sum(dataprep.out$X1[lowest:highest,])
  dataprep.out$X0[lowest:highest,] <-  
    100 * scale(dataprep.out$X0[lowest:highest,],
                center=FALSE,
                scale=colSums(dataprep.out$X0[lowest:highest,])
    )
  
  # run synth
  synth(data.prep.obj = dataprep.out)
  
}


##############################
## Tidysynth ##
##############################

tidysynth_ <- function() {
  basque %>% 
    synthetic_control(
      outcome = gdpcap,
      unit = regionno,
      time = year,
      i_unit = 17,
      i_time = 1970
    ) %>% 
    generate_predictor(
      time_window = 1964:1969,
      across(
        all_of(c("school.illit", "school.prim", "school.med", "school.high",
                 "school.post.high", "invest")),
        ~ {
          mean(.x, na.rm = TRUE)
        }
      )
    ) %>% 
    generate_predictor(
      time_window = seq(1961, 1969, by = 2),
      across(
        all_of(c("gdpcap", "sec.agriculture", "sec.energy", "sec.industry",
                 "sec.construction", "sec.services.venta", "sec.services.nonventa",
                 "popdens")),
        ~ {
          mean(.x, na.rm = TRUE)
        }
      )
    ) %>% 
    generate_weights(
      optimization_window = 1960:1969
    ) %>%
    generate_control()
}

bench::mark(
  synth = synth_(),
  tidysynth = tidysynth_(),
  iterations = 10,
  check = FALSE
)
#> [...]
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 synth         3.02s    3.43s    0.285      135MB    0.485
#> 2 tidysynth    14.79s   15.87s    0.0630     598MB    0.687

^{Created on 2022-06-21 by the reprex package (v2.0.1)}

tidysynth is about 4-5x slower than Synth and uses more than 4x more memory. I don't know if it's possible to improve the speed while keeping the same output, but I think it would be useful to make tidysynth faster.

Anyway, thanks again for this package

One-tailed permutation tests

Thanks for this wonderful tool! I have a quick question: I'm running a synthetic control in which the plausible treatment effects are almost certainly negative. With that in mind, I'd like to modify the permutation test to be one-tailed rather than two-tailed. Do you know of a good way to do this, either via tidysynth or another SCM package? I can't seem to find anything, but it seems like a need that would come up frequently enough to be worth making available, if not already.

grab_significance function not found

Hello-- thanks for the great package. I'm getting the following error message:

Error in grab_significance(.) : could not find function "grab_significance"

Thanks for any help.

edunford / tidysynth Goto Github PK

tidysynth's People

Contributors

Stargazers

Watchers

Forkers

tidysynth's Issues

initial the synthetic control object

Generate the aggregate predictors used to fit the weights

average log income, retail price of cigarettes, and proportion of the

population between 15 and 24 years of age from 1980 - 1988

Generate the fitted weights for the synthetic control

Generate the synthetic control

initial the synthetic control object

Generate the aggregate predictors used to fit the weights

average log income, retail price of cigarettes, and proportion of the

population between 15 and 24 years of age from 1980 - 1988

Generate the fitted weights for the synthetic control

Generate the synthetic control

When

Where

Effect

Reproduction

Recommend Projects

Recommend Topics

Recommend Org