ngreifer / cobalt Goto Github PK

View Code? Open in Web Editor NEW

70.0 70.0 11.0 146.62 MB

Covariate Balance Tables and Plots - An R package for assessing covariate balance

Home Page: https://ngreifer.github.io/cobalt/

R 94.24% TeX 5.76%

causal-inference propensity-scores r

cobalt's People

Contributors

Stargazers

Watchers

Forkers

ganesh-krishnan sumtxt xiaosongz evamaerey wxwx1993 charpignonml guhjy zoe187419 jessecambon jbarsotti zhangkaicr

cobalt's Issues

bal.tab after weightitMSM shows Max.Corr.Adj values > 1

After creating weights with weightitMSM, bal.tab yields values of Max.Corr.Adj that are > 1, which does not make any sense.

What could be the cause? (just reporting a subset of covariates in the table)

Many thanks!

psDRYEXTRA<-weightitMSM(formula.list =FormulaListDRYEXTRA,
                        data=FBS,
                        method = "ps",verbose = T)

bal.tab(psDRYEXTRA, r.threshold = .05, disp.ks = TRUE, which.time = .none)

Balance summary across all time points
                                Times    Type Max.Corr.Adj         R.Threshold Max.KS.Adj
pop                        1, 2, 3, 4 Contin.       0.0500     Balanced, <0.05     0.2977
city_tt                    1, 2, 3, 4 Contin.       1.5959 Not Balanced, >0.05     0.5807
capdist                    1, 2, 3, 4 Contin.       1.1841 Not Balanced, >0.05     0.6695
distnearestcountry         1, 2, 3, 4 Contin.       1.7098 Not Balanced, >0.05     0.6100
distownborders             1, 2, 3, 4 Contin.       3.0464 Not Balanced, >0.05     0.2960

Balance tally for treatment correlations
                    count
Balanced, <0.05         6
Not Balanced, >0.05    36

Variable with the greatest treatment correlation
       Variable Max.Corr.Adj         R.Threshold
 distownborders       3.0464 Not Balanced, >0.05

Effective sample sizes
 - Time 1
             Total
Unadjusted 5000.  
Adjusted      6.57
 - Time 2
             Total
Unadjusted 5000.  
Adjusted      6.57
 - Time 3
             Total
Unadjusted 5000.  
Adjusted      6.57
 - Time 4
             Total
Unadjusted 5000.  
Adjusted      6.57

Make sure color defaults are consistent with docs in `love.plot`

NAs in covariates

Sorry, everything is Ok for version 4.0.0 with R 3.6.2, the errors appeared for v. 3.9.0 with R 3.5.2.

I have installed cobalt 3.9.0 (for R 3.5.2).
NAs are mistreated in the covariates by bal.tab.formula().
When I have some NAs in a covariate, I get the error message from bal.tab.formula():

Error in `[.data.frame`(C, !vapply(C, all_the_same, logical(1L))) : 
  undefined columns selected

Besides, when a covariate contains only one value, I also get the error message:

Error: All variables in formula must be variables in data or objects in the global environment.

Data not of class 'mids' error when running cobalt::bal.tab with mimids output

I tried to run the following code using the test dataset within the cobalt package:

library("mice"); library("MatchThem")
data("lalonde_mis", package = "cobalt")

#Generate imputed data sets
m <- 10 #number of imputed data sets
imp.out <- mice(lalonde_mis, m = m, print = FALSE) 

#Matching for balance on covariates
mt.out <- matchthem(treat ~ age + educ + married +
                       race + re74 + re75, 
                     datasets = imp.out,
                     approach = "within", 
                     method = "nearest",
                     link = "logit",
                     estimand = "ATT")

bal.tab(mt.out)

However, I get the following error:

Error in imp.complete(mimids$others$source) : 'data' not of class 'mids'

Is there something I'm missing? I tried tracing the error but couldn't exactly find where the check happens for imp.complete.

Access mean SMD value on longitudinal treatments

Hello,

As far as I know, there is no easy way to access the mean covariate balance across times, only the max is available. I'm guessing that could be an easy thing to add to the function?

Minimal example:

library(cobalt)
data("iptwExWide", package = "twang")
library(WeightIt)
Wmsm <- weightitMSM(list(tx1 ~ use0 + gender + age,
                         tx2 ~ use0 + gender + age + use1 + tx1,
                         tx3 ~ use0 + gender + age + use1 + tx1 + use2 + tx2),
                    data = iptwExWide,
                    method = "ps")
baltab <- bal.tab(Wmsm, un = T)

baltab$Balance.Across.Times
             Times     Type Max.Diff.Un Max.Diff.Adj
prop.score 1, 2, 3 Distance   0.7862446  0.025135867
use0       1, 2, 3  Contin.   0.2667626  0.055835400
gender     1, 2, 3   Binary   0.2944634  0.026293838
age        1, 2, 3  Contin.   0.3798713  0.070253208
use1          2, 3  Contin.   0.1662348  0.031572818
tx1           2, 3   Binary   0.1694514  0.017114709
use2             3  Contin.   0.1086601  0.031463385
tx2              3   Binary   0.2422819  0.008532322

So we get the Max here, but not the mean value. Are you aware of a way to compute those values? It seems it is possible to plot them with love.plot but not to get them directly from bal.tab.

Cheers!

How to know who adjusted samples are?

Hello, Noah Greifer

I am learning how to use 'cobalt' package for balancing samples from tutorials (https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt.html#using-cobalt-with-multi-category-treatments) and I have a question.

With the following command, I gave each ID to 614 samples in 'lalonde' example data.

lalonde$ID <- paste0("ID_", c(1:nrow(lalonde)))

According to the tutorial, we can check "Effective sample sizes" by using bal.tab() function.
The result was as follows :

Effective sample sizes
　　　　　　black　hispan　white
Unadjusted　243.　　72.　　299.
Adjusted　　138.38 　54.99　259.59

I want to get the ID of each of the Adjusted samples.
Could you tell me how to get the IDs of about 451(138+54+259) people?

Yours sincerely,
QANGFQ

Error: could not find function "str2expression"

Hi! Excited to make use of this tool, but running into some basic issues.

Cobalt version 4.2.0
MatchIt version 3.0.2
R version 3.5.3 (2019-03-11)
OS linux-gnu

The following code generates the following errror:
Error: could not find function "str2expression"

xdata <- data.frame(treat = 1 * (runif(100) <= 0.5),
                    x1 = rnorm(100, 2, 4),
                    x2 = rnorm(100, 5, 2))
matching <- MatchIt::matchit(treat ~ x1 + x2, data = xdata, distance = "mahalanobis", replace = TRUE)
cobalt::bal.plot(matching)

Let me know if I can clarify anything.

Thanks.

Saving love.plot() plots as images - wrong picture size

When I try to save love.plot() plots as WMF or EMF pictures specifying small picture size, e.g.., 3x5 inches, I get a wrong size of the WMF/EMF picture. The plot itself resides in the left upper corner of the picture.

I tried

win.metafile("loveplot1.wmf" ,height=3,width=5)
love.plot(b)
dev.off()

And I tried to save the plot as EMF file from RStudio. The result was the same.

Select variables to include in the love plot

I am trying to find a way of selecting the variables to display in the love.plot().

Following the example in the vignette on love plots, I was hoping to display only 3 variables: age, educ, and married. Is there any way to do that?

In practice, this is useful when including factor variables in the matching procedures (for example industries), and not wanting to display all these dummy variables in the love plot.

formula bal.tab for binary variables?

I was wondering about the specific formula you use to calculate balance diagnostics for binary variables? I have read and understood your explanation in the function documentation (https://www.rdocumentation.org/packages/cobalt/versions/3.7.0/topics/bal.tab). However, when I check the standardised solution of the function, it does not seem to be consistent with the solution by Austin, 2009 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472075/), which is often used.

So are you calculating a different standardised solution? If so, how?

bal.plot alpha

In the bal.plot() function I can't pass alpha argument through to geom_bar() to change the transparency of the colors. I can't pass position argument either.

Also love.plot() for sub-classification/stratification doesn't seem to work?

b_s1 <- bal.tab(f_lin, data = Algebra_dat, subclass = "quintile", 
         method = "subclassification", disp.subclass = TRUE,
         estimand = "ATT", 
         disp.v.ratio = TRUE, un = TRUE)

love.plot(b_s1)

I get this error. I tried adding facet argument to love.plot() but that doesn't work.

Error in is_not_null(facet) : object 'facet' not found

In addition, when I create a love.plot I get this warning

Standardized mean differences and raw mean differences are present in the same plot. 
Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

What is the stars argument? I can't seem to find it in vignettes.

Incorrect handling of longitudinal treatments with MI data

Either provide errors or do MI balance at each time point with no summary

Density plot using bal.plot() from CBPS object

Thanks for providing the package cobalt. I'm trying to use it with the CBPS package, but I have a problem plotting. I can't get "bal.plot()" to generate density or histogram plots of the object generated by CBPS. Even if I specify the "type" and variable name properly, it returns a scatterplot. All of the examples provided online have tried objects generated by other packages such as MatchIt, but none with CBPS. However, other functions like love.plot() are working fine with CBPS. I was wondering if you've tried density plots for objects generated by CBPS?

Getting enviroment error, after updating r to 4.1.1, I started

I started my issue somewhere else cran/cobalt#2 (comment)

Using the example in CRAN
https://www.rdocumentation.org/packages/cobalt/versions/4.3.0/topics/love.plot

Here is my code and error

library(WeightIt)
library(WeightIt); data("lalonde", package = "cobalt")
w.out1 <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde)
love.plot(w.out1, thresholds = c(m = .1), var.order = "unadjusted")

Error in as.environment(pos) :
no item called "get(".S3MethodsTable.", envir = asNamespace(i))" on the search list
In addition: Warning messages:
1: In get(".S3MethodsTable.", envir = asNamespace(i)) :
restarting interrupted promise evaluation
2: In get(".S3MethodsTable.", envir = asNamespace(i)) :
internal error -3 in R_decompress1
3: In ls(get(".S3MethodsTable.", envir = asNamespace(i)), pattern = name) :
‘get(".S3MethodsTable.", envir = asNamespace(i))’ converted to character string

I wonder how if there is a quick fix, thank you!

Using love.plot() with a list of `matchit` objects

I have a list of matchit objects and want to create love plots from them. I'm getting an error using love.plot() with purrr::map().

library(purrr)
library(cobalt)

data("lalonde")

matchits <- vector(mode = "list")
ps_form <- formula(treat ~ age + educ + black + hispan + married)
matchits$nn.wo <- matchit(ps_form, lalonde, method = "nearest", replace = FALSE)
matchits$nn.wr <- matchit(ps_form, lalonde, method = "nearest", replace = TRUE)
matchits$opt.r1 <- matchit(ps_form, lalonde, method = "optimal", ratio = 1)

# These work
love.plot(matchits$nn.wr)
love.plot(matchits[[1]])

# These do not work
map(matchits, love.plot)
# Error in .f(x = bal.tab(.x[[i]])) : could not find function ".f"

map(matchits, ~love.plot(.))
# Error in mc[["x"]][[1]] : object of type 'symbol' is not subsettable

map(matchits, function(x) love.plot(x))
#  Error: covs must be a data.frame of covariates.

Add a cran badge to your readme

If you type

    devtools::use_cran_badge()

then you should see some text about how to add a line to the top of your README document that adds a little badge for the CRAN version.

e.g.: https://github.com/bsaul/geex

problem with bal.tab(): "All weights are zero when treat = TRUE"

Hi. I try to get a bal.tab with preprocessed output from weightit.

I receive the error message: "All weights are zero when treat = TRUE".

However, this is not the case, as all weights are above 1 and none are NA or NULL or whatever.

I have traced the problem to some odd behaviour of the apply function in combination with the check_if_zero function: The check whether all is zero yields "FALSE" if called outside the apply function and "TRUE" (incorrectly) if called via the apply function.

I use latest versions of weightit, cobalt (installed today) and R.

This is from the debugger which puts me in check_if_zero_weights().

Thanks for your help!

Martin

Browse[1]> version
               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray      


Browse[1]> error
[1] "All weights are zero when treat = TRUE."
Browse[1]> problems
[1]  TRUE FALSE
Browse[1]> w.t.mat
     Var1  Var2
1 weights  TRUE
2 weights FALSE
Browse[1]> **problems <- apply(w.t.mat, 1, function(x) all(check_if_zero(weights.df[treat == 
+     x[2], x[1]])))**
Browse[1]> problems
[1]  **TRUE** FALSE
Browse[1]> check_if_zero(weights.df[treat == TRUE, "weights"])
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [65] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[113] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[129] FALSE FALSE FALSE FALSE FALSE
Browse[1]> all(check_if_zero(weights.df[treat == TRUE, "weights"]))
[1] FALSE
Browse[1]> all(check_if_zero(weights.df[w.t.mat[1,2], w.t.mat[1,1]]))
[1] **FALSE**
Browse[1]> weights.df[treat == TRUE,]
  [1]  67.307627  79.826155   1.885578 158.026877  45.196684  56.164825  48.752937  30.879905  11.674147
 [10]  26.692058  13.295941  82.674590 196.897668 149.248587  52.762289  39.684289  33.170495  39.308733
 [19]  53.171570  41.240477 160.372561 194.005269  60.413108  43.963563  15.230456  42.829538  27.463982
 [28]  19.865331  50.847341 186.199997   9.753240  68.585247  36.408196  38.549354  38.755218  40.884345
 [37]  29.620660  52.601934 111.332990  56.297451  27.031934 101.012349  34.349574 107.525278  78.727894
 [46]  11.777890  70.914950  55.277485  51.883375  71.255899  32.319254  42.992511  72.273144  28.642228
 [55] 137.954277  34.807268  60.276977  63.115426  68.655834 133.662160  17.536231  94.708934  65.562180
 [64]  60.241740 109.878914  79.942162  28.324305  74.347703  66.622288  26.406760  20.897160  69.370021
 [73]  52.737898  46.644920  96.783245  47.111526  35.341429  77.041636  77.046557  30.057204  91.398045
 [82]  46.837280  94.873180  37.793427 104.106985  21.611831  18.633768 140.601745  21.072106  84.664917
 [91] 171.780325  23.068098  65.262950  45.945273  65.830478  13.585935  14.353937  36.560600  77.410477
[100]  42.240395  11.444596  67.281186  25.100079 117.032776  66.714564 190.680325  27.129495  69.194680
[109]  74.293695  28.874397  32.587939  95.918416  27.744732  94.771610  11.792023  83.279133  31.746677
[118]  36.733866  13.132560  66.008024  40.119701  78.070225  16.603842  35.215006  57.132454  44.612056
[127]  20.949391  81.514315  47.458328  15.125913  70.443032  33.938332  54.767617

Incorrect {gridExtra} version number in DESCRIPTION file

Problem

Under imports in the package DESCRIPTION file, gridExtra (>= 2.3.0) is listed

gridExtra 2.3.0 does not actually exist, 2.3 does.

Consequences

install.packges() sees these as the same which is why this has gone unnoticed, however some package managers (yum, for example) do not, which causes issues on installation.

Solution

replace gridExtra (>= 2.3.0) with gridExtra (>= 2.3) in the DESCRIPTION file.

Issue installing on OSX 10.14.5

Hi: I need help trying to install cobalt on my mac:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace ‘grid’ 3.5.0 is already loaded, but >= 3.6.1 is required

I have just reinstalled Rstudio. What is needed to resolve this namespace 'grid' issue?

All help is appreciated!

Unusual results with CBPS and sample weights

I have constructed CBPS object using CBPS(.... , sample.weights=mydata$myweight). Afterward, cobalt's bal.tab() function gives odd results. For example, the means from bal.tab() don't match the means from balance() function from the CBPS package. I don't think bal.tab() is incorporating the sample.weights appropriately. Any suggestions?

Here's some example code to reproduce the problem:

library(CBPS)
data(LaLonde)
LaLonde$wgt <- rnorm(rep(1,nrow(LaLonde)), mean = (rep(1,nrow(LaLonde))+LaLonde$treat*.5), sd = .05)
fit <- CBPS(treat ~ age + educ + re75 + re74 + I(re75==0) + I(re74==0), data = LaLonde, ATT = TRUE, sample.weights = LaLonde$wgt)
balance(fit)
library(cobalt)
bal.tab(fit, disp.means = TRUE)

I have CBPS version 0.17 and cobalt version 3.2.0.

Interations with binary variables

I'd suggest that, for a binary variable, interactions be calculated for the both values of the variable, not just for 1 (for 0-1 variables). Let, say, we have a continuous variable a and a binary variable b. Now only the distribution of ab is assessed. It would be more correct to assess the both distributions: of ab and a(1-b).

bal.tab slow on large dataset (MatchIt)

Hi there. I'm running bal.tab on the results of a MatchIt run on a dataset of about 120,000 rows. The MatchIt process took about 2 hours to run, producing a matchit object about 229 Mb in size. I tried running bal.tab as follows:

baltab <- bal.tab(m.out1, m.threshold=0.1, binary="std")

and it's taking a long time (still running, currently at over an hour). I was able to run a practice example from the documentation on the lalonde dataset, and that worked fine. I also was able to run matchit on a sample of 2000 rows and ran bal.tab on that (which took about 5 seconds). So I'm confused about why this is taking so long.

I am using R 3.5.1, MatchIt version 3.0.2, and cobalt version 3.6.1.

Thank you!

Edit: I killed the R session after > 4 hours of running, it never seemed to finish.

Two rows of distance in bal.tab()$Balance when matching on a single binary variable

Hi, the balance table has two distance rows

library(MatchIt)
library(cobalt)
data("lalonde")
m_out <- matchit(treat ~ married, data = lalonde, method = "nearest", distance = "glm")
m_summary <- bal.tab(m_out, un = TRUE)
m_summary$Balance

And output

                               Type    Diff.Un Diff.Adj
distance_0.13725490196086  Distance -0.8240730        0
distance_0.417827298050138 Distance  0.8240730        0
married                      Binary -0.3236313        0

R version 4.1.0 (2021-05-18), cobalt_4.3.1, MatchIt_4.2.0

Fix errors in README

love.plot() adds "_1" to the names of 0-1 variables

When we try to draw a Love plot with no interactions and abs=TRUE, the "_1" are added to the names of 0-1 variables. It has no meaning, especially for abs=TRUE. May be it would be better to remove such "_1" for the variables themselves leaving "_1" for interactions only?

Unadjusted standardized difference

I check the unadjusted standardized difference.

I have the follow table:
user h02 N tp
1: control 0 131071 0.98487421
2: control 1 2013 0.01512579
3: user 0 13904 0.97929286
4: user 1 294 0.02070714

I calculated the standardized difference as Austin:
(0.02070714 - 0.01512579)/sqrt((0.015125790.98487421 + 0.020707140.97929286)/2) = 0.0420858

But bal.tab returns 0.0056

bal.tab(formula = user ~ h02, data = data)

Balance Measures
Type Diff.Un
h02 Binary 0.0056

Sample sizes
Control Treated
All 133084 14198

bal.plot() doesn't plot binary categorical variables correctly

When plotting binary categorical data (eg. sex with the values "male" & "female"), bal.plot() gives a message The dropped category for [variable] will be set to NA. leading all bars to be plotted as 100%. This doesn't occur if the variable is recoded as 0/1 or if there are 3 or more possible values. I'm using MatchIt to match, I don't know if this behaviour occurs with other packages.

I'm using cobalt v4.2.4 & MatchIt v3.0.2.

binary categorical variable (male/female)

df <- tibble(sex = sample(c("male", "female"), 100, replace = T),
             group = sample(c(0, 1), prob = c(0.7, 0.3), 100, replace = T))

m.out <- matchit(group ~ sex, data = df)
bal.plot(m.out, "sex")

The dropped category for sex will be set to NA.

binary numeric variable (0/1)

df2 <- df %>%
        mutate(sex = recode("male" = 0, "female" = 1)

m.out <- matchit(group ~ sex, data = df2)
bal.plot(m.out, "sex")

categorical variable with 3 values (male/female/unknown)

df3 <- tibble(sex = sample(c("male", "female", "unknown"), 100, replace = T),
             group = sample(c(0, 1), prob = c(0.7, 0.3), 100, replace = T))

m.out <- matchit(group ~ sex, data = df3)
bal.plot(m.out, "sex")

add github repo links to your description

When you do this then your github repo will be linked directly from your CRAN page.

I think this function would do it all for you (perhaps you need to follow a step or two, but it should explain)

devtools::use_github_links()

Add log(sd1/sd0) as balance measure

Used in Imbens & Rubin, similar scale to SMD so easy to interpret

Incorrect Processing of Sample Size (and probably others) with factor treatment

Factor treatment makes the sample sizes of the groups switch (i.e., labels are not correct), probably due to using 0 and 1 somewhere or an incorrect binarize() call.

bal.tab with mnps accepts only one stop.method

From the twang example:

library(twang)
data(AOD)
mnps.AOD <- mnps(treat ~ illact + crimjust + subprob + subdep + white,
                 data = AOD, 
                 estimand = "ATE", 
                 verbose = FALSE, 
                 stop.method = c("es.mean", "ks.mean"), 
                 n.trees = 3000)
bal.tab(mnps.AOD)

Error in rep("all", length(errors) - 1) : argument 'times' incorrect

Bug with subset when all FALSE

When subset is all FALSE, there error it provides is not informative.

Bal.tab does not work with tibbles

For some strange reason, bal.tab does not work with tibbles. Minimal reproducible example:

df <- tbl_df(lalonde)

treat <- "treat"
outcome <- "re78"
covs <- setdiff(names(df), c(treat, outcome))
covs_df <- dplyr::select(df, -treat, -re78, -nodegree, -married)
bal.tab(covs_df, treat = df[[treat]], method = "weighting")

This results in:

Note: estimand and s.d.denom not specified; assuming ATT and treated.
Error: No names in var.name are names of factor variables in data.

But this works (note that the only difference is that I'm not using tibbles):

data("lalonde", package = "cobalt")
df <- lalonde

treat <- "treat"
outcome <- "re78"
covs <- setdiff(names(df), c(treat, outcome))
covs_df <- dplyr::select(df, -treat, -re78, -nodegree, -married)
bal.tab(covs_df, treat = df[[treat]], method = "weighting")

This gives the desired result:

Note: estimand and s.d.denom not specified; assuming ATT and treated.
Balance Measures:
               Type Diff.Un
age         Contin. -0.3094
educ        Contin.  0.0550
race_black   Binary  0.6404
race_hispan  Binary -0.0827
race_white   Binary -0.5577
re74        Contin. -0.7211
re75        Contin. -0.2903

Sample sizes:
    Control Treated
All     429     185

cobalte splitfactor

Error in deparse1(substitute(x)) :
impossible de trouver la fonction "deparse1"

love.plot with aggregation for multiple matched samples from the same data

Hello,
I want to produce a love plot of the mean covariate balance adjustments from 4 matched samples of the same data (the number of observations are too large to match in their entirety)

I have tried this two ways so far, first by creating a single match object (using the matching package) that has all four samples included and matched seperately as clusters (matchby)

However when I try to use this object in a bal.tab or love.plot call I get an error:
Error in names(object) <- nm :
'names' attribute [400000] must be the same length as the vector [1]
In addition: Warning messages:
1: Deprecated
2: Deprecated

This is my script:

#Matching call with exact matching on sample no.
SP_2010_GLM <- SP_2010_samples_combined %>% glm(formula = FCL_out ~ SFCL + ELC_Dist + Pop + Slope + Precip + Elevation + Cap_dist + Border_dist + Road_dist + Soil, family = binomial())
SP_2010_covs <- subset(SP_2010_samples_combined, select = -c(T_C, Temp, FCL_out, sample_no.))

X1 <- SP_2010_GLM$fitted #the propensity score
Y1 <- SP_2010_samples_combined$FCL_out #the outcome
Tr1 <- SP_2010_samples_combined$T_C #a vector of the treatment

SP_2010_combined_match <- Matchby(Y=Y1, Tr=Tr1, X=X1, by= SP_2010_samples_combined$sample_no., M=1, replace= TRUE, caliper = 0.5, Weight=1, ties = FALSE)
summary(SP_2010_combined_match)

#Call to bal.tab
SP_2010_combined_balance <- bal.tab(SP_2010_combined_match, treat = SP_2010_samples_combined$T_C, cluster = "sample_no.",
distance = X1, covs = SP_2010_covs, un = TRUE, stats= c("mean.diffs", "ks.statistics"))
`

Alternatively I have tried creating a vector of the sperate match objects after performing matching for each of the samples seperately and then introducing this through the 'weights' specification in love.plot as you suggested in another issue:

library(cobalt)
library(purrr)

match_objects <- vector(mode = "list")
match_objects$sample1 <- Sample1_match
match_objects$sample2 <- Sample2_match
match_objects$sample3 <- Sample3_match
match_objects$sample4 <- Sample4_match

match_formula <- SP_2010_sample1 %>%formula(FCL_out ~ SFCL + ELC_Dist + Pop + Slope + Precip + Elevation + Cap_dist + Border_dist + Road_dist + Soil)

love.plot(match_formula, data = SP_2010_sample1, weights = map(match_objects, get.w))

However this call runs a long time without producing a result, where am I going wrong?

Apologies if my explanation is unclear I am still relatively new to R.
Many thanks, Ben.

Standard Deviation used in SMDs

Hello,

Not really a bug/issue with cobalt, rather a question about SMDs I'd be grateful if you could help me with.

Following a 1:1 NNM matching, some of the treated subjects are left unmatched. When computing the SMD, cobalt (with the option s.d.denom = "treated") uses the SD in all treated subjects, ie including those unmatched. This is consistent with MatchIt's behaviour.

In a similar fashion, cobalt with the option s.d.denom = "pooled" computes the denominator of the SMDs using the SD in all untreated subjects (matched and unmatched).

I understand that the denominator of a SMD is –at the end of the day– arbitrary: it's just a value used to standardise the MD (duh!) and we could use –in principle– the SD of any population.

However, I wonder if you have any reference that supports the use of just those SDs as opposed to the SDs in the subjects (treated and untreated) who are successfully matched.

> m <- bal.tab(trt ~ x, 
+         data = s, 
+         method = "weighting",
+         s.d.denom = "pooled",
+         weights=  s$w,
+         continuous = "std")

# 1:1 matching, weights are either 0 or 1
> with(s, table(w))
w
   0    1 
6468 1120 

# 560 untreated subjects matched to 560 treated subjects
> m
Balance Measures
     Type Diff.Adj
x Contin.    0.005

Effective sample sizes
           Control Treated
Unadjusted    6997     591
Adjusted       560     560

> m$Balance
     Type   M.0.Un   SD.0.Un  M.1.Un   SD.1.Un   Diff.Un M.Threshold.Un V.Ratio.Un V.Threshold.Un KS.Un KS.Threshold.Un  M.0.Adj
x Contin. 1.362974 0.5868113 2.14297 0.9667456 0.9753969             NA         NA             NA    NA              NA 2.050798
  SD.0.Adj  M.1.Adj  SD.1.Adj    Diff.Adj M.Threshold V.Ratio.Adj V.Threshold KS.Adj KS.Threshold
x 0.862962 2.054828 0.7274142 0.005040633          NA          NA          NA     NA           NA

> smd_pooled <- setNames((m$Balance["M.1.Adj"] - m$Balance["M.0.Adj"]) / sqrt(.5*m$Balance["SD.1.Un"]^2 + .5*m$Balance["SD.0.Un"]^2), nm = "SMD") 
> smd_pooled
          SMD
x 0.005040633

Constant independent variables are not allowed

If I take a constant independent variable, I get the error
Error in relevel.factor(C[[i]], levels(C[[i]])[2]) :
'ref' must be an existing level

Code to reproduce the problem:

n=20
a=sample(2,n,replace=T)-1
b=runif(n)
c=rep(0,n)
l=glm(a~b+c)
matched=sample(2,n,replace=T)-1
b1=bal.tab(a~b+c,weights=matched,method="matching",s.d.denom="pooled")

Previous versions of Cobalt just silently removed constant variables from the analysis. The most convenient way may be that constant variables are removed with a warning, so that the script is not stopped.

bal.tab not accepting numeric factor levels with mnps

Hi,
from the twang example, if treat has character levels, it works:

library(twang)
data(AOD)
mnps.AOD <- mnps(treat ~ illact + crimjust + subprob + subdep + white,
                 data = AOD, 
                 estimand = "ATE", 
                 verbose = FALSE, 
                 stop.method = c("es.mean"), 
                 n.trees = 3000)
bal.tab(mnps.AOD)

If, however, you add: levels(AOD$treat) <- 1:3 at the beginning, you get

Error in `[.data.frame`(do.call("cbind", unname(lapply(bal.tab.multi.list,  : 
  undefined columns selected

Error: object 'xxx' not found

love.plot and bal.tab are not dealing correctly with the formula object when there are inline operations applied to a variable. The bug occurs only when the raw variable is not included in the formula object.

Replicable code:

library(MatchIt)
data("lalonde", package = "cobalt")

#Works
m.out1 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde)
love.plot(m.out1, var.order = "unadjusted") 

#Does not work
m.out1 <- matchit(treat ~ log(age) + educ  + race + married + nodegree + re74 + re75,data = lalonde)
love.plot(m.out1, var.order = "unadjusted") #KO

#Works again
m.out1 <- matchit(treat ~ log(age) + age + educ  + race + married + nodegree + re74 + re75,data = lalonde)
love.plot(m.out1, var.order = "unadjusted")

The behavior is true in both versions: cobalt_4.2.2 and cobalt_4.2.1

Environment for replication (also works on 4.x).

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cobalt_4.2.1  MatchIt_3.0.2

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11  knitr_1.29       magrittr_1.5     MASS_7.3-51.6    tidyselect_1.1.0 munsell_0.5.0    colorspace_1.4-1 R6_2.4.1         rlang_0.4.6      dplyr_1.0.0     
[11] tools_3.6.1      grid_3.6.1       gtable_0.3.0     xfun_0.15        htmltools_0.5.0  ellipsis_0.3.1   yaml_2.2.1       digest_0.6.25    tibble_3.0.1     lifecycle_0.2.0 
[21] crayon_1.3.4     purrr_0.3.4      ggplot2_3.3.2    vctrs_0.3.1      glue_1.4.1       evaluate_0.14    rmarkdown_2.3    pillar_1.4.4     compiler_3.6.1   backports_1.1.8 
[31] generics_0.0.2   scales_1.1.1     pkgconfig_2.0.3

Names for interactions

I'd suggest that the names for interactions in love.plot() be taken from the names of variables in the var.names parameter. Now the names for interactions are taken from the names of variables in the code.

Difference sign reversal in love.plot()

It seems that love.plot() with abs=FALSE changes the sign of the adjusted (or of the unadjusted) differences. When I have unadjusted and adjusted differences of different signs, they are displayed as they have one sign.

bal.tab()$Max.Imbalance.Variances$V.Ratio.Adj contains the value with maximal absolute value

bal.tab()$Max.Imbalance.Variances$V.Ratio.Adj contains the value with maximal absolute value. So it can be <2 and >0.5 when there are variance ratios <0.5.

The same issue is with summary output. The variable with maximal absolute variance ratio is printed, so it can be balanced even when there are unbalanced variables with respect to variance ratio.

I would also suggest to introduce a simple function like feasible.matching(b), where b=bal.tab(...), which would check whether all variables are balanced. It would be helpful, e.g., for finding feasible calipers for matching.

WeightIt method=gbm problem

I am attempting to assess the covariate balance following propensity score weighting implemented in the WeightIt package. When I use method = "ps" in WeightIt, everything functions normally. However, when I use method = "gbm" and try to assess the balance with bal.tab I get the following love plot:

as well as the warning message, "Warning message:
Large mean differences detected; you may not be using standardized mean differences for continuous variables."

I tried to make sure standardized differences were being used with set.cobalt.options(binary = "std", continuous = "std"), but this did not resolve the problem. The difference in prop.score does seem to be sensitive to the stop.method used, but in all cases it's still way larger than I would expect. I'm not sure what else to try, and would greatly appreciate any advice. My code is below, but it's all standard stuff, and again, it works fine when method = "ps" so I'm not sure what's going on here. Thanks much.

weight.gbm <- weightit(RCTflag ~ Urbanicity + Region + GradRate + StTchRatio + TotEnroll + FARMS + StudentN + Grade10 + Grade11 + Grade12, data=mydata.complete, method="gbm", estimand="ATT", stop.method="es.mean")
weight.balance <- bal.tab(weight.gbm, un = TRUE)
weight.balance
love.plot(weight.balance, thresholds = .25, title="GBM weighting")

MatchIt summary works, but fails in bal.tab: non-numeric argument

I created a matchit object setting distance="mahalanobis" and exact=~specialty + event_month, where specialty is of character type (only two possible values) and event_month of date type.

Calling summary on the matchit object correctly returns the balanace statistics:

The event_month variable was implicitly converted to numeric, while specialty seems to be converted into 0 or 1s for each possible value. For both variables, summary.matchit is able to compute a std. mean difference.

However, calling bal.tab on the matchit object results in the error:

Any advice on how to handle this error?

Logical treatment indicator is not allowed

If I take a logical treatment indicator in the corresponding formula for bal.tab, I get the error:
Error: The argument to treat must be a vector of treatment statuses or the (quoted) name of a variable in data that contains treatment status.

Code to reproduce the problem:

n=20
a=as.logical(sample(2,n,replace=T)-1)
b=runif(n)
l=glm(a~b)
matched=sample(2,n,replace=T)-1
b1=bal.tab(as.integer(a)~b,weights=matched,method="matching",s.d.denom="pooled")
b2=bal.tab(a~b,weights=matched,method="matching",s.d.denom="pooled")

It would be convenient to have bal.tab() accepting logical values as well.

add testing framework?

I see that you have a file do_not_include/tests.R. It doesn't seem to me that this is strictly in the unit testing framework. If it's not, I think a huge improvement would be to implement a unit testing framework, like from the testthat package.

An easy way to do this is to type in devtools::use_testthat(), and the directory structure will be added. I'd be happy to help you set up your first couple of tests to get you started. If you're interested, please respond below (but don't close the issue)

Make sure code works with new ggplot2 update

Make sure inputs to aes() are correct.

geom_point now supports strings, so update that in love.plot

Update facet_grid with new syntax

bal.tab.default() does not show the results of adjusted correlation in continuous exposure.

Hello, Noah Greifer

I would like to estimate the effect of continuous exposure on binary outcomes.
For my dataset, exposure is 'ADLScaled'; covariates are 'Age', 'sex', 'HT', 'DM', 'Stroke', and 'MI'; outcomes is 'sequela'; and weights are 'swtTrimmed'.
I have run the following R code, but I cannot get adjusted correlation.

library(cobalt)
library(data.table)

dt <- fread('dt_sample.csv')
dt_covs <- dt[, .(Age, sex, HT, DM, Stroke, MI)] 
baltab <- bal.tab(x = dt_covs, 
                  data = dt,               
                  treat = 'ADLScaled',     
                  method = 'weighting',
                  weigths =  'swtTrimmed', 
                  un = T,
                  thresholds = 0.1
        )

The output is as follows, and only 'Corr.Un' is shown (Corr.Adj is not):

Balance Measures
Type Corr.Un R.Threshold.Un
Age Contin. -0.8869 Not Balanced, >0.1
sex_男 Binary 0.0013 Balanced, <0.1
HT Binary -0.5920 Not Balanced, >0.1
DM Binary -0.5876 Not Balanced, >0.1
Stroke Binary -0.5057 Not Balanced, >0.1
MI Binary -0.5535 Not Balanced, >0.1

Balance tally for treatment correlations
count
Balanced, <0.1 1
Not Balanced, >0.1 5

Variable with the greatest treatment correlation
Variable Corr.Un R.Threshold.Un
Age -0.8869 Not Balanced, >0.1

Sample sizes
Total
All 15000

I would appreciate it if you could tell me how to calculate adjusted (weighted) correlation values for continuous exposure.
The dataset is as follows: dt_sample.csv

Sincerely yours,
yohei-h

Issue installing version 3.8 from CRAN

It seems the requirement on the CRAN page for grid for version 3.8 of cobalt is what's causing the issue (grid (≥ 3.6.1))

Warning in install.packages :
dependency ‘grid’ is not available

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace 'grid' 3.5.3 is already loaded, but >= 3.6.1 is required

I am running R 3.5.3 and would prefer not to update to 3.6 yet -- is it possible to fix the dependency issue so that 3.8 can be installed, or is grid 3.6.1 or greater strictly required? If so, could you update the R version requirement from 3.3.0?

Thanks for the excellent package.

Add # of units discarded in bal.tab()$Observations when using subclassification with discard

Hi, I have a suggestion to add # of units discarded in bal.tab()$Observations when using subclassification with discard.
For example,

library(MatchIt)
library(cobalt)
data("lalonde")
m_out <- matchit(treat ~ age + educ + race + married + nodegree, 
                 data = lalonde, method = "subclass", distance = "glm", discard = "both")
m_summary <- bal.tab(m_out)
m_summary$Observations

Currently it returns

          1  2  3  4  5  6 All
Control 297 21 24 15  9 14 429
Treated  31 31 30 31 31 30 185
Total   328 52 54 46 40 44 614

Add a column of "discarded" so that the numbers add up to "All".

          1  2  3  4  5  6 Discarded All
Control 297 21 24 15  9 14       49  429
Treated  31 31 30 31 31 30        1  185
Total   328 52 54 46 40 44       50  614