kosukeimai / matchit Goto Github PK

R package MatchIt

R 78.34% C++ 11.22% C 0.04% TeX 10.40%

matchit's Introduction

MatchIt: Nonparametric Preprocessing for Parametric Causal Inference

Overview

MatchIt provides a simple and straightforward interface to various methods of matching for covariate balance in observational studies. Matching is one way to reduce confounding and model dependence when estimating treatment effects. Several matching methods are available, including nearest neighbor matching, optimal pair matching, optimal full matching, generalized full matching, genetic matching, exact matching, coarsened exact matching, cardinality matching, and subclassification, some of which rely on functions from other R packages. A variety of methods to estimate propensity scores for propensity score matching are included. Below is an example of the use of MatchIt to perform Mahalanobis distance matching with replacement and assess balance:

library("MatchIt")
data("lalonde", package = "MatchIt")

# 1:1 nearest neighbor matching with replacement on 
# the Mahalanobis distance
m.out <- matchit(treat ~ age + educ + race + married + 
                   nodegree + re74 + re75, 
                 data = lalonde, distance = "mahalanobis",
                 replace = TRUE)

Printing the MatchIt object provides details of the kind of matching performed.

m.out

#> A matchit object
#>  - method: 1:1 nearest neighbor matching with replacement
#>  - distance: Mahalanobis
#>  - number of obs.: 614 (original), 261 (matched)
#>  - target estimand: ATT
#>  - covariates: age, educ, race, married, nodegree, re74, re75

We can check covariate balance for the original and matched samples using summary():

#Checking balance before and after matching:
summary(m.out)

#> 
#> Call:
#> matchit(formula = treat ~ age + educ + race + married + nodegree + 
#>     re74 + re75, data = lalonde, distance = "mahalanobis", replace = TRUE)
#> 
#> Summary of Balance for All Data:
#>            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
#> age              25.8162       28.0303         -0.3094     0.4400    0.0813   0.1577
#> educ             10.3459       10.2354          0.0550     0.4959    0.0347   0.1114
#> raceblack         0.8432        0.2028          1.7615          .    0.6404   0.6404
#> racehispan        0.0595        0.1422         -0.3498          .    0.0827   0.0827
#> racewhite         0.0973        0.6550         -1.8819          .    0.5577   0.5577
#> married           0.1892        0.5128         -0.8263          .    0.3236   0.3236
#> nodegree          0.7081        0.5967          0.2450          .    0.1114   0.1114
#> re74           2095.5737     5619.2365         -0.7211     0.5181    0.2248   0.4470
#> re75           1532.0553     2466.4844         -0.2903     0.9563    0.1342   0.2876
#> 
#> Summary of Balance for Matched Data:
#>            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
#> age              25.8162       25.5405          0.0385     0.6524    0.0466   0.1892          0.4827
#> educ             10.3459       10.4270         -0.0403     1.1636    0.0077   0.0378          0.1963
#> raceblack         0.8432        0.8432          0.0000          .    0.0000   0.0000          0.0000
#> racehispan        0.0595        0.0595          0.0000          .    0.0000   0.0000          0.0000
#> racewhite         0.0973        0.0973          0.0000          .    0.0000   0.0000          0.0000
#> married           0.1892        0.1784          0.0276          .    0.0108   0.0108          0.0276
#> nodegree          0.7081        0.7081          0.0000          .    0.0000   0.0000          0.0000
#> re74           2095.5737     1788.6941          0.0628     1.5690    0.0311   0.1730          0.2494
#> re75           1532.0553     1087.7420          0.1380     2.1221    0.0330   0.0865          0.2360
#> 
#> Sample Sizes:
#>               Control Treated
#> All               429     185
#> Matched (ESS)      33     185
#> Matched            76     185
#> Unmatched         353       0
#> Discarded           0       0

At the top is balance for the original sample. Below that is balance in the matched sample, followed by the percent reduction in imbalance and the sample sizes before and after matching. Smaller values for the balance statistics indicate better balance. (In this case, good balance was not achieved and other matching methods should be tried). We can plot the standardized mean differences in a Love plot for a clean, visual display of balance across the sample:

#Plot balance
plot(summary(m.out))

Although much has been written about matching theory, most of the theory relied upon in MatchIt is described well in Ho, Imai, King, and Stuart (2007), Stuart (2010), and Greifer and Stuart (2021). The Journal of Statistical Software article for MatchIt can be accessed here, though note that some options have changed, so the MatchIt reference pages and included vignettes should be used for understanding the functions and methods available. Further references for individual methods are present in their respective help pages. The MatchIt website provides access to vignettes and documentation files.

Citing `MatchIt`

Please cite MatchIt when using it for analysis presented in publications, which you can do by citing the Journal of Statistical Software article below:

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software, 42(8). doi:10.18637/jss.v042.i08

This citation can also be accessed using citation("MatchIt") in R. For reproducibility purposes, it is also important to include the version number for the version used.

Installation

To download and install the latest stable version of MatchIt from CRAN, run the following:

install.packages("MatchIt")

To install a development version, which may have a bug fixed or a new feature, run the following:

install.packages("remotes") #If not yet installed

remotes::install_github("ngreifer/MatchIt")

This will require R to compile C++ code, which might require additional software be installed on your computer. If you need the development version but can’t compile the package, ask the maintainer for a binary version of the package.

matchit's People

Contributors

Stargazers

Watchers

Forkers

christopherlucas zitagao cfhammill jjchern jlegewie hj08003 simonstolz tlmcmurry alesasse caifand wangleileiy mkwatson dgrimald ngreifer captainemerson pavelpronin zoudj tanetpongc finstudent2021 four-spins owain-s galexandros arrendi gaetan-dion kenkoonwong michaelchirico libertaspatronus shicheng-guo qingbingwang manoelhortaribeiro ilovemane datawaveanalytics jbarsotti hfshr niknakk anthonyokc danialkamran csgillespie wenyiwu0111

matchit's Issues

Reproducibility Error

Hi,

I have been lately using the Matcit package. After setting a seed number, I ran the query to match 1:4 with caliper 0.25, no replace and it does contain some variable for exact matching and I re-ran again the same query using exactly the same seed number and other conditions but on a later date likely after month. It came to my surprise that my total matching data was not the same. It automatically took away 4 patients out of control and upon re-running it took 2 patients from the cases. It does not give me any error per se as the query runs smoothly. But matching sample differs.
Please let me know how I can fix this.

Priyanka

match.data does not function with {purrr}

nest_data <- lalonde %>%
group_by(race) %>%
nest()

match_nested_data <- function(df){matchit(treat ~ age + re74 + educ + re75 + re78,
data = {{df}},
method = "cem") }

cem_matchit <- map(.x = nest_data$data, .f = match_nested_data)

data <- map(cem_matchit, match.data)

Warning for reserved variables - distance

When generating matched the dataset using match.data(m.out) an error is thrown when the data contains a reserved variable (e.g. distance):
m.data<-match.data(m.out) Error in match.data(m.out) : invalid input for distance. choose a different name.
The error warning is not very intuitive. A slightly more verbose warning would be useful. I also feel that distance is likely to be a common variable name in matching so it would be good to avoid failure.

MatchIt on Big Data - Horizontal data split on exact variables

I am using MatchIt on big data (2 mio records), so it does not run in one go. I need to split up my dataset into subsets (based on values of my exact variables) and run iterations of MatchIt on these subsets to make it work. It would be really great if MatchIt could do that automatically (or allow for users to specify that) - that would be an immense improvement in terms of computing efficiency!

Also, I am looking for a way to re-combine all my different MatchIt outputs (each run on a subset of my data) and do not know how to do that - any help would be greatly appreciated!!

Thank you for your great work!

compatibility with Zelig

The following error has been reported by multiple users:

demo(analysis)

> ate.all <- c(s.out1$qi$att.ev, -s.out2$qi$att.ev)
Error in s.out1$qi$att.ev : object of type 'closure' is not subsettable

QQPlot Error with nearest neighbor, exact, and ratio>1

QQ plotting nearest neighbor matching results using the "exact" option on a factor and ratio>1 produces the error:
Error in jitter(m.xi) : 'x' must be numeric
Code to reproduce the error:

library(MatchIt)
data(lalonde)
lalonde$married<-as.factor(lalonde$married)
m.out <- matchit(treat ~ re74 + re75 + educ + age,
                 data = lalonde, ratio = 2, method = "nearest", exact=c("married"))
plot(m.out, type = "QQ", interact = F)

A modification to line 34 of matchit.qqplot.R seems to fix the issue:
Change:
m.covariates <- x$X[c(t.plot, c.plot),]
to:
m.covariates <- X[c(t.plot, c.plot),]

Travis build failures

Travis builds are failing due to dependency WhatIf not being available. The problem is a broken Amelia build (https://travis-ci.org/kosukeimai/MatchIt/builds/267596888#L3139). The real issue RcppArmadillo requires a better gcc (https://travis-ci.org/kosukeimai/MatchIt/builds/267596888#L2510). I think you can get a better gcc easily by switching to building against ubuntu trusty. This is probably safer than trying to upgrade gcc via apt on precise. There will probably be errors related to packages being built with different compilers if you go the later route.

To build against trusty you could put the following at the top of your .travis.yml

matrix:
  include:
    -  os: linux
       dist: trusty

(my package for reference https://github.com/cfhammill/RMINC/blob/master/.travis.yml#L1)

Estimating run time

A suggestion based on a personal experience - I am matching some moderately sized data (670k rows) and using MatchIt for the first time. I had no idea how long things would take so I ran small segments of my data of varying size and timed it. I then fit a 2nd degree polynomial which I used to predict the total run time (ca. 5 hours). I found this information very useful/helpful.

Would it make sense to include a "predict" option that could be used by people running large data sets (e.g. they would run it prior to running the full dataset) to give an idea of run time? If so I'd be happy to look into implementing such a feature.

Error with ratio > 1 and 'genetic' matching

I have encountered an error and possibly a bug in the MatchIt package.

The error occurs when using genetic matching with a ratio > 2.

The error message reads:

 Error in out[out[, 1] == tind[i], 2:(ratio + 1)] : 
  subscript out of bounds

Using the debugger, I have traced back the problem to the following section of code:

out <- Matching::GenMatch(tt, cbind(dd, xx), M = ratio, 
    ...)$matches
  mm <- matrix(0, nrow = n1, ncol = max(table(out[, 1])), 
    dimnames = list(tlabels, 1:max(table(out[, 1]))))
  for (i in 1:n1) {
    tmp <- labels[c(out[out[, 1] == tind[i], 2:(ratio + 
      1)])]
    if (length(tmp) < ncol(mm)) 
      tmp <- c(tmp, rep(NA, ncol(mm) - length(tmp)))
    mm[i, ] <- tmp
  }

out is an array where the first column is treatment units, the second column is matched control units, and the third column is some weight (?), i.e. 1 divided by the number of matched control units, where the number of matched control units is greater or equal to the ratio parameter.

Note that the data format is long, that is, every treatment unit has as many rows as there are matched control units, at least ratio.

mm, in contrast, is a matrix where again rows are treatment units, but columns are matched control units. The number of columns is thus equal to the largest number of control units matched to a treatment unit.

Since the data format of out is long, and ratio is larger than 2 (e.g. 5), the line

tmp <- labels[c(out[out[, 1] == tind[i], 2:(ratio + 
      1)])]

will result in an error, since it will try to access columns 2:6 of the out array, which does not exist.

I believe that this line should read

tmp <- labels[c(out[out[, 1] == tind[i], 2])]

and in my version it seems to work fine with that change.

Error: The argument to 'distance' must be a string of length 1 when using matchit with default arguments

I want to use the matchit function with the default arguments (that is, I only provide the function with the data and the formula argument):

This is how my data frame site_df looks like (only the first 14 rows are shown):

formula = as.formula("group_boolean_inversed ~ sex_boolean + age")
m.out <- matchit(formula=formula,
                 data=site_df)

site_df_matched <- get_matches(m.out,site_df)

wich produces this error:

Fehler: The argument to 'distance' must be a string of length 1.

I am using the CRAN-downloaded version of MatchIT (‘4.0.0.9000’). The devtools version as suggested in #41 didn't work on my machine due to some DLL-issues.

lack of rownames in dataframe crashes caliper matching

See demo script, below. matchit() fails in first call. Adding rownames fixes it in second call. This is due to some extraction breaking deep in the guts of matchit()

Small test case of possible bug?

library( MatchIt )

make.dat = function( p, N, beta0 = -1, beta1=1 ) {
X = matrix( rnorm( p * N ), ncol = p )
pi = arm::invlogit( beta0 + beta1 * X[,1] )
Z = 0 +( runif( N ) <= pi )

colnames(X) = paste( "X", 1:p, sep="" )
df = as.data.frame( X )
df$Z = Z
df

}

mydat = make.dat( 3, 100 )
head( mydat )
str( mydat )

m.out <- matchit(Z ~ X1 + X2 + X3, data = mydat, method = "nearest",
caliper = 0.2, mahvars = c("X1","X2") )

rownames( mydat ) = paste( "R", rownames(mydat), sep="" )
m.out <- matchit(Z ~ X1 + X2 + X3, data = mydat, method = "nearest",
caliper = 0.2, mahvars = c("X1","X2") )

Can I ask a follow-up question?

With the new behavior of caliper, does it mean that the matched controls selected when using caliper are always a subset of those selected without using caliper, given all the other parameters are the same (using method = "nearest")?
Since caliper seems only helps to remove those "outliers" but for those ones that can match well within calipers, they are always the nearest ones to the treated subjects (and this is a certain), there's no randomness any more?

Thank you so much!

Originally posted by @youblue in #48 (comment)

MatchIt update to 4.0.0 build fails

I get the following error when updating from 3.0.2 to 4.0.0:

nnm.cpp:213:29: error: expected expression
                            [&match_distance](int k, int j) {return match_distance[k] < match_distance[j];});
                            ^
1 error generated.
make: *** [nnm.o] Error 1

Any clues or possible workarounds would be very welcome.

SessionInfo() is given below in the details.

``` R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3 yaml_2.2.1

install.packages("MatchIt")

There is a binary version available but the source version is later:
binary source needs_compilation
MatchIt 3.0.2 4.0.0 TRUE

Do you want to install from sources the package which needs compilation? (Yes/no/cancel) y
installing the source package ‘MatchIt’

trying URL 'https://cran.rstudio.com/src/contrib/MatchIt_4.0.0.tar.gz'
Content type 'application/x-gzip' length 1374846 bytes (1.3 MB)

downloaded 1.3 MB

installing source package ‘MatchIt’ ...
** package ‘MatchIt’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
clang++ -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppProgress/include' -I/usr/local/include -fPIC -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppProgress/include' -I/usr/local/include -fPIC -Wall -g -O2 -c nnm.cpp -o nnm.o
nnm.cpp:213:29: error: expected expression
[&match_distance](int k, int j) {return match_distance[k] < match_distance[j];});
^
1 error generated.
make: *** [nnm.o] Error 1
ERROR: compilation failed for package ‘MatchIt’
removing ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library/MatchIt’
restoring previous ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library/MatchIt’
Warning in install.packages :
installation of package ‘MatchIt’ had non-zero exit status

The downloaded source packages are in
‘/private/var/folders/qh/6q39v0755_54rxmbl8m5ttnwy0twd7/T/RtmpcY8fRr/downloaded_packages’

</details>

set.seed et last version of MatchIt

Thank you very much for this great package.

There is maybe a bug with the last version of MatchIt.

When I change set.seed() to adjust the matching. There is no change in the matching process thereafter. I tried to update my R (4.03), I also downgraded R... no change... I tried on a Mac, on Windows.... no change...

The problem is solved when I downgraded to MatchIt_3.0.2

All my best

Antoine

PS: sorry if this issue is not correct. You do a great job thank you so much

A typo

In the last page of the manual, "nn" appears twice in the "Value" section.

NAMESPACE Issue When Compiling CBPS

Hi Noah,

When I try to build CBPS, I am now getting an error from MatchIt.

Error: package or namespace load failed for 'MatchIt' in rbind(info, getNamespaceInfo(env, "S3methods")):
number of columns of matrices must match (see arg 2)
Error : package 'MatchIt' could not be loaded

Any idea what could be causing this error?

Thanks!
Christian

CEM with replacement?

Is it possible to use CEM with sampling with replacement? I am aware that there is no argument replace when method = cem is used. I am also aware that setting k2k = TRUE means using nearest neighbor matching without replacement will take place within each stratum. Is it possible to use with replacement here? Also, what does k2k = FALSE mean?

optimal method with exact matching on some variables

I used the method = “exact” option to force exact matching on some variables and used other methods for the other variables.

The method = “nearest” approach worked as expected. However, when I tried switching the argument to method = “optimal”, it seems the matching id output by match.matrix do not show the expected matching for the variables specified in the exact option. Why is it so?

You can see in the output, subject 2 and subject 25 who are matched do not have the same age.

Code I used:
match= matchit(A ~ SEX + AGE + V1 + V2, method = "optimal",
exact = c("SEX", "AGE"), data = data_test)
match_id<-match$match.matrix
N<-9
match_id2<-c(1:N, match_id)

match_data<-match.data(match)
#reorder matched control in the same order as treated
match_data2<-match_data[match_id2, ]
match_data2$match_id<-rep(1:N, 2)

Output:

match_data2
SEX AGE V1 V2 A distance weights subclass match_id
1 F 8 5.4 6.0 1 0.19925382 1 1 1
2 F 9 5.7 2.6 1 0.52655669 1 2 2
3 M 10 5.6 1.6 1 0.25929894 1 3 3
4 F 8 5.1 3.7 1 0.19159122 1 4 4
5 M 8 4.6 2.9 1 0.03403255 1 5 5
6 M 7 5.7 4.9 1 0.30804933 1 6 6
7 F 7 5.2 5.2 1 0.21183573 1 7 7
8 F 6 5.3 5.6 1 0.30090303 1 8 8
9 F 6 4.4 4.2 1 0.06211456 1 9 9
30 M 8 5.4 4.1 0 0.15866640 1 1 1
25 F 8 5.4 2.7 0 0.41199904 1 2 2
13 F 6 5.0 3.7 0 0.26556803 1 3 3
28 M 8 5.4 3.9 0 0.16722246 1 4 4
53 F 6 4.3 5.6 0 0.03196558 1 5 5
24 M 7 5.7 4.9 0 0.30804933 1 6 6
51 M 8 5.5 3.4 0 0.23293292 1 7 7
36 F 8 5.2 2.8 0 0.28892715 1 8 8
47 F 6 4.3 3.5 0 0.05998461 1 9 9

Exact matching produces more then one case per subclass

I am using 'MatchIt' package to match cases to controls. I would like to match by age and four other binary variables.

match_exact <- matchit(case_control~ age+var1+var2+var3+var4, cutpoints = list(age = 10), data = df, method="exact") match_exact_data <- match.data(match_exact)

Then I extract the matched data and I would like to know the id's of cases to which control they were matched.
In the subclass strata I see more then one case. I would assume there would be one case and one or more controls. Does that makes sense?

The nearest.R demo doesn't generate a really matched data

Hi
I am studying to use MatchIt package to do PSM. But when I use the nearest.R in the demo fold to try to generate a mached data. Then I use the table1 package to review the data.

There are still big difference between groups, why?
`library(MatchIt)
data(lalonde)
lalonde$treat <- factor(lalonde$treat, levels=c(0, 1), labels=c("Control", "Treatment"))
lalonde$black <- factor(lalonde$black)
lalonde$hispan <- factor(lalonde$hispan)
lalonde$married <- factor(lalonde$married)
lalonde$nodegree <- factor(lalonde$nodegree)
lalonde$black <- as.logical(lalonde$black == 1)
lalonde$hispan <- as.logical(lalonde$hispan == 1)
lalonde$married <- as.logical(lalonde$married == 1)
lalonde$nodegree <- as.logical(lalonde$nodegree == 1)
m.out <- matchit(treat~age+educ+black+hispan+married+nodegree+re74+re75+re78, data = lalonde,method = "nearest")
lalonde<-match.data(m.out)
library(table1)
rndr <- function(x, name, ...) {
if (length(x) == 0) {
y <- lalonde[[name]]
s <- rep("", length(render.default(x=y, name=name, ...)))
if (is.numeric(y)) {
p <- t.test(y ~ lalonde$treat)$p.value
} else {
p <- chisq.test(table(y, droplevels(lalonde$treat)))$p.value
}
s[2] <- sub("<", "<", format.pval(p, digits=3, eps=0.001))
s
} else {
render.default(x=x, name=name, ...)
}
}

rndr.strat <- function(label, n, ...) {
ifelse(n==0, label, render.strat.default(label, n, ...))
}

table1(~ age + black + hispan + married + nodegree + re74 + re75 + re78 | treat,
data=lalonde, droplevels=F, render=rndr, render.strat=rndr.strat, overall=F)`

Issue with exact matching procedure

Hello! In the R MatchIt package using the "exact" method I always get the "Error in weights.subclass(psclass, treat) : No units were matched".

Somewhere I have read that the data would not be passed correctly to the 'optmatch' package to which 'MatchIt' refers.

Take this for a reproducible example:

library(car)
WeightLoss1 <- WeightLoss
WeightLoss1$group <- as.integer(ifelse(WeightLoss1$group == "Control", 0, 1))

library(MatchIt)
matchit(group ~ wl1 + wl2 + wl3 + se1 + se2 + se3, method = "exact", data = WeightLoss1)

I would appreciate if anybody could have a look into it, thanks!

Set unique output labels (rownames in match$matrix)

And how can we set the rownames of the $match.matrix output to uniquely identify records in the data (e.g. using the "ID" or "name" field in the data) please?

I am working on a 2mio records dataset and various subsets need to be matched, so indices are terribly error prone.

Thank you for your great work!

calculating total% reduction in bias

Though the package allows for assessing bias reduction for individual variables pre and post matching. However, it does not allow to calculate overall total bias reduction from pre matching to post matching. I was wondering if you can help me understand if this information is accessible or can be computed using the information in the MatchIt package.

Summaries of match objects have different columns and column order

The summary output differs between full matching and nearest-neighbor matching. For full matching, the standardized difference is in the 3rd column of the summary output and for nearest-neighbor, it’s in the 4th column.

Let's say that this is my code (took out excessive variables for clarity).

model1<-matchit(formula= y ~ x1 + x2, exact=c("x3", "x4"), mahvars=c("x5", "x6"), data=NS, method="nearest", distance = "logit", caliper=0.25, replace=T, calclosest=T, ratio=3, verbose=T)
df.addl=with(data.frame(x7, x8), data=NS)
m.out=summary(model1, standardize=T, addlvariables=df.addl)
names(m.out$sum.matched)
[1] "Means Treated" "Means Control" "Std. Mean Diff." "eCDF Med" "eCDF Mean"
[6] "eCDF Max"
model2<-matchit(formula= y ~ x1 + x2 +x3 + x4 + x5, data=NS, method="full", verbose=T)
df.addl=with(data.frame(x6, x7, x8), data=NS)
m.out2=summary(model2, standardize=T, addlvariables=df.addl)
names(m.out2$sum.matched)
[1] "Means Treated" "Means Control" "SD Control" "Std. Mean Diff." "eCDF Med"
[6] "eCDF Mean" "eCDF Max"

This problem was replicated many times by my students in the spring term, as many students produced Love plots for their projects or replications of National Supported Work study that looked fishy because they were not plotting the standardized differences.

very slow when using fixed effects

The MatchIt package gets very slow when I add fixed effects. Are there any ways to make this kind of operation faster?

get_matches() returns unmatched treated observations?

It appears that, at least in some cases, get_matches() will return data on unmatched treated observations and assign these observations a weight of 1. This seems undesirable to me, but perhaps I am mistaken.

In the example below, shouldn't get_matches() either return a data.frame with 190 rows, or return > 190 rows but with weights set to zero or NA for unmatched observations?

Thank you for any insight.

library(MatchIt)

# an example with replacement
m <- matchit(treat ~ age + educ + black + hispan + married + 
             nodegree + re74 + re75, 
             data = lalonde,
             method = "nearest", 
             distance = "logit",
             replace = TRUE,
             caliper = 0.005)


m$nn
#           Control Treated
# All           429     185
# Matched        64      95
# Unmatched     365      90
# Discarded       0       0

gm <- get_matches(m, lalonde)
table(gm$treat)
# 0   1 
# 64 185 

aggregate(weight ~ treat, FUN = sum, data = gm)
#  treat weight
#      0     95
#      1    185

table(gm[!(rownames(gm) %in% rownames(match.data(m))) & gm$treat == 1, "weight"])
# 1 
# 90

How to deal with missing values in variables that are used for matching ？

Hello，when we have missing values in variables that are used for matching and these variables largely influence the imbalance, how can we deal with it ? Is there any alternative method ?

match.data not working when matchit is called from within a function

Good afternoon,

I'm comparing sets of multiple matching strategies on different data sets, and so I've found it convenient to use a wrapper function for matchit to facilitate looping. Mostly it works fine, but I've run into an error when I try to use match.data() on a matchit object returned from my wrapper. See a minimal reprex below:

`library(MatchIt)
data(lalonde)
match1 = matchit(treat ~ age + educ + race + married + nodegree, data = lalonde)
match.data(match1)

match_with_varying_data = function(fmla, matchData){
matchit(fmla, data = matchData)
}

match2 = match_with_varying_data(fmla = formula("treat ~ age + educ + race + married + nodegree"),
matchData = lalonde)
match.data(match2)
`

The issue seems to be that match.data() is looking in object$call$data for the name of the dataset. For match1$call$data it returns "lalonde," which it then finds globally. However, in match2$call$data it finds matchData, the name of the data in the wrapper's environment. Interestingly, this error does not appear if I use a wrapper that only uses a data argument, and not a formula argument. For instance, this works fine:

`match_with_varying_data_no_fmla = function(matchData){
matchit(treat ~ age + educ + race + married + nodegree, data = matchData)
}

match3 = match_with_varying_data_no_fmla(matchData = lalonde)
match.data(match3)
`

I've managed a workaround, but this doesn't seem like the intended behavior. Thank you, and please see sessionInfo() below:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin18.7.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.10_1/lib/libopenblasp-r0.3.10.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] MatchIt_4.0.1

loaded via a namespace (and not attached):
[1] compiler_4.0.2 backports_1.1.8 tools_4.0.2 Rcpp_1.0.5

Caliper widths are on propensity score scale rather than logit scale for distance="logit"

Austin (Pharmaceutical Statistics, 2011) recommends using a caliper width of 0.2 of the standard deviation of the logit of the propensity score, because it is more likely to be normally distributed. It appears that the caliper width calculated in matchit(..., distance="logit") is based on the propensity scores themselves and not the logit of these propensity scores. I'm not sure if this is intentional, but if not, it could be remedied by fixing line 3 of distance2glm.R to

return(list(model = res, distance = predict(res, response="link")))

If this was intentional, I'd appreciate a link to the literature for justification.

Thanks,
Steve

Multicore Parallel Processing for Large Datasets

i am using matchit to calculate propensity scores for large data (say approx ~ 500 thousand records) it is taking approx 2 hours to get results. it will be great if this package support muticore parallel processing i.e., using all available cores in processor, so that computation time can be reduced significantly.

use caliper but get completely different results than last version

If not using caliper the results keep the same. But when I used the same program with selected caliper before, the results completely changed.

threshold in propensity score matching

I am wondering for propensity score matching, what is the threshold to get the matched data?

I understand I could use machit function with method = 'nearest' and distance = 'glm', then use match.data to get matched data.

Correct me if I am wrong. From my understanding, we need to specify a threshold for propensity score to get the matched data. Then what is the default threshold in the package? And is it possible we can choose the threshold on ourselves?

Build-In Sensitivity Test

It's difficult to conduct a sensitivity test (Rosenbaum 2007) for MatchiIt objects. Within the existing "sensitivity*" packages, only a github version (SensitivityR5) can conduct the test but apparently only with objects with method = "nearest". It would be great if the original "MatchIt" function has a similar pens function as in rbounds for `Matching" projects.

`matchit` doesn't pass k2k param to `cem:cem`

To recreate:

matchit(treat ~ age + educ + black + hispan + married + nodegree
        + re74 + re75, data = lalonde, method = "cem", k2k=T)

When debugging from inside matchit:

Browse[3]> list(...)
$k2k
[1] TRUE

Which calls matchit2cem :

Browse[5]> ls()
[1] "data"                "discarded"           "distance"            "is.full.mahalanobis" "k2k.method"          "ratio"               "treat"               "verbose"            
[9] "X"                  
Browse[5]> list(...)
list()

As you can see, we've already lost the extra parameters.

Memory errors/segfault/unsorted double list corrupted errors

I'm attempting to match a data with with roughly 1,000,000 observations. There are roughly 100,000 treated observations and 900,000 control observations. I unfortunately cannot share the data due to the data use agreement. I want to match using "glm" distance using a handful of covariates (I've tested 1 through 35 and get the same results).

When I run

matchit(treatment ~ age + charlson_index, data = mdata)

It estimates the PS model fine, tells me it is doing NN matching and then nearly immediately crashes with the error message malloc(): unsorted double linked list corrupted. Occasionally, it will return error in nn_matchC invalid type NULL for dimnames. In some instances It crashes telling me Error in nn_matchC cannot get ALTVEC DATAPTR during GC and (many) recursive gc invocations are printed out.

I have tried this code with this data on a server running CentOS Linux and two different machines, one running Ubuntu Linux and the other MacOS. I've tried this in R 3.5.2 and 4.04. I've also tried running inside a Docker container (rocker/verse). In all cases, the result is the same. I have uninstalled MatchIt and reinstalled from source instead of the binary files. That also did not fix the problem.

There is nothing wrong with the data that I can tell. If I subdivide mdata into 100 small n = 10,000 mutually exclusive and exhaustive bins, I can successfully run matchit() on each of the 100 bins.

Additionally, I am interested in 5 different treatment comparisons (D1 vs D2, D1 vs no treatment, D2 vs no treatment, D1 vs D3, D2 vs D3). All of the data is from the same database but the data for matching for D2 vs no treatment and D2 vs D3 are the only sets that do not work. D1 vs D2, D1 vs no, and D1 vs D3 all work with matchit() exactly as expected.

optmatch

@alexWhitworth CRAN doesn't like a new submission because MatchIt now requires optmatch, which does not have a FOSS license. Is it possible to make optmatch in a "Suggests" category?

suggestions about improving full matching

optimal full matching is becoming more and more popular for its unique advantages. However, the matchit function has only limited full matching. I strongly suggest that bring ATE estimation and full matching with caliper in the next package version. these two functions are really frequently needed and important in practical work.

Allow matching on list variables

When trying to match using a variable that's a list I get the error invalid type (list).

If list variables were supported using some metric like at least 1 exact match of elements between two lists more powerful analysis could be done.

There are some hacky workarounds I can think of, but would appreciate if this can be supported in a first class way or if there are any commonly-used workarounds someone can share!

Interaction terms are automatically excluded (?) when using newer version

Hi there,

I have tried to implement matching based on a specification which includes interaction terms (age:educ in this case) as follows.

data("lalonde")
m.out1 <- matchit(treat ~ age:educ + I(age^2) + age + educ + race + married + nodegree + re74 + re75, data = lalonde,distance = "glm", link = "linear.probit")
summary(m.out1,standardize=TRUE)

However, after running this code, the covariate balance for age:educ is not shown although this did not happen when using an older version of MatchIt.

Ktak

get_matches() selects wrong rows if row names are non-consecutive numbers

Dear MatchIt Team,

When the input data frame has numeric row names that aren’t consecutive (which can arise if you drop rows with missing data), get_matches() picks up the wrong rows without warning. In the example below get_matches() picks up 2 non-existant rows and gets the rest of the rows wrong as well.

Thanks for a very helpful package!
Tim McMurry

library(MatchIt)
set.seed(1234)
df <- data.frame(group = rep(c("A", "B"), times = c(10, 20)),
			  x = rnorm(30) + rep(c(0, 1), times = c(10, 20)))
df$x[c(7, 11, 12)] <- NA #make some missing data

matchit(I(group == "A") ~ x, data = df) #doesn't work because of NAs
dfcomp <- model.frame(group ~ x, data = df) #drop missings, keep rownames
dfcomp <- df[!is.na(df$x), ] #alternative to line above, same problem

mm <- matchit(I(group == "A") ~ x, data = dfcomp) #works
summary(mm)
get_matches(mm, dfcomp) #Picks up non-existant rows, no warnings

#Correct matches
df[c(rownames(mm$match.matrix), mm$match.matrix[,1]),]

What Does Matched (ESS) stands for?

Alllow for missing values in variables that are not part of the model

I downloaded the MatchIT package from CRAN (version 3.0.2). In this version, the matchit function apparently does not allow for missing values in variables, even if they are not part of the model (see this stackoverflow answer). I wonder what's the reason for this? My dataset contains a lot of 'systematic' NaNs in certain variables that are not used for matching (To make it short: I am working with neuroimaging data and I have a column that contains filepaths which point to preprocessed files. Some files still need to be preprocessed so I do not yet have a value for these rows, though I would like to work on the matching script in the meantime). Does MatchIT 4.0.0 also have this restriction and if yes, what is the reason for it?

Different results after version change

Hi there -

We have been running the same code with small variations over the course of this year and last and have begun getting different results when we run one iteration of code across different computers, since some members of the team updated R and reinstalled MatchIt. The before and after matching baseline characteristics are the same across versions, but the patient numbers matched differ. When we extract a patient list (of matched patients), we get a different number to in the MatchIt output also.

Is there some adjustment being applied to the patient numbers? We wondered about a weighting. Any ideas about why this is happening would be much appreciated. Thank you!

Augusta

handling of missing data

Currently, one cannot have missing values even in variables that are not used for matching. This should be fixed so that the function only checks the existence of missing values for the variables that are used.

Issue with optimal matching method

Using the "optimal" method with option ratio = 2 we get this warning:

Warning message:
In optmatch::fullmatch(d, min.controls = ratio, max.controls = ratio,  :
  Without 'data' argument the order of the match is not guaranteed
    to be the same as your original data.

In Ho et al. (2011:11) it is stated:

We conduct 2:1 optimal ratio matching based on the propensity score from the logistic regression.

Whereas 2013 in a mailing list Kosuke states:

I think that the problem is you have "ratio = 2". optimal matching may not work with that option.

Getting the error by exactly copy-pasting the code from the manual mentioned above and which is downloadable at Garry King's MatchIt site:

library(MatchIt)
data("lalonde")
m.out <- matchit(treat ~ re74 + re75 + age + educ, data = lalonde, 
              method = "optimal", ratio = 2)

This is quite contradictory and creates confusion, e. g. here and here.

Please could you clarify that? Thanks a lot!

match.matrix NULL?

Hi all,

Thanks for an excellent package. I am trying to extract the exact control / treatment matches made through matchit, and I thought I could get this by simply running:

m.out1 <- matchit(...)
attr(m.out1, "match.matrix")

Unfortunately, all the attributes inm.out1 are NULL. Is this expected behavior? Am I doing something horrendously silly? I'm new to R so any pointers are helpful here.

Thanks!

Implementing matching in DiD

I am writing to request your help in identifying the matched pairs in the regression.

I have used Matchit package for matching after estimating propensity scores using CBPS package. My code is as follows;

mit.out<- matchit(tre ~ fitted(fit), method = "optimal",ratio=2, data = data2)
summary(mit.out)
final_matched <- match.data(mit.out)

Sample Sizes:
Control Treated
All 18412 2450
Matched 4900 2450
Unmatched 13512 0
Discarded 0 0

The matched dataset includes distance, weights and subclass. I am confused how to use the matched pairs in my DiD regression . Can you please refer me a webpage or a paper? Thank you.

Weird error on exact matching with nearest neighbor

I'm running into an issue with applying exact matching on some covariates and nearest neighbor matching on others. Here is the call

data = read.csv("trouble_shooting.csv")
m.out = matchit(treated~ 
              adjusted_basic_pay + age_range,
            data,
    method='nearest',
    distance='mahalanobis',
    exact='age_range',
    replace=F)

This is the result:

Error in Ops.data.frame(exact[itert, k], exact[clabels, k]) : 
   ‘!=’ only defined for equally-sized data frames

Link to file here: https://www.dropbox.com/s/ukkx7rph7mmcpso/trouble_shooting.csv?dl=0

Missing levels on my covariates

I need some help, please. I used MatchIt and my matching object looks like this:

method: 1:1 optimal pair matching
distance: Propensity score
- estimated with logistic regression
number of obs.: 10444 (original), 7774 (matched)
target estimand: ATT
covariates: factor(year), factor(Site), factor(Breeder), Region

I'm struggling to interpret the coefficients from coeftest( ) because for year, site and breeder, the computation is dropping more than one level. More precisely, for year, which has 15 different years, it is dropping 2; for site, which has 31 different site, it is dropping 5. The factor breeder has 3 and it's dropping just 1 what I think makes sense.

Thanks in advance for your help.

Issue with genetic matching method

In my own analysis and even when running demo("genetic") following warning occurs (at the very end after identifying variables):

Warning messages: ----
#   1: In Matching::GenMatch(tt, cbind(dd, xx), M = ratio, ...) :
#   The key tuning parameters for optimization were are all left at their default
#   values.  The 'pop.size' option in particular should probably be increased for
#   optimal results.  For details please see the help page and
#   http://sekhon.berkeley.edu/papers/MatchingJSS.pdf

The issue is also mentioned in a mailing list w/o answer.

Since the issue also occurs in the demo it might be some kind of bug.
Please could you check that? Thanks!