ikosmidis / brglm2 Goto Github PK

Estimation and inference from generalized linear models using explicit and implicit methods for bias reduction

R 96.94% C 0.72% TeX 2.34%

estimation algorithms r bias-reduction glm rstats regression-algorithms regression statistics logistic-regression

brglm2's Introduction

brglm2

brglm2 provides tools for the estimation and inference from generalized linear models using various methods for bias reduction. brglm2 supports all generalized linear models supported in R, and provides methods for multinomial logistic regression (nominal responses), adjacent category models (ordinal responses), and negative binomial regression (for potentially overdispered count responses).

Reduction of estimation bias is achieved by solving either the mean-bias reducing adjusted score equations in Firth (1993) and Kosmidis & Firth (2009) or the median-bias reducing adjusted score equations in Kenne et al (2017), or through the direct subtraction of an estimate of the bias of the maximum likelihood estimator from the maximum likelihood estimates as prescribed in Cordeiro and McCullagh (1991). Kosmidis et al (2020) provides a unifying framework and algorithms for mean and median bias reduction for the estimation of generalized linear models.

In the special case of generalized linear models for binomial and multinomial responses (both ordinal and nominal), the adjusted score equations return estimates with improved frequentist properties, that are also always finite, even in cases where the maximum likelihood estimates are infinite (e.g. complete and quasi-complete separation). See, Kosmidis & Firth (2021) for the proof of the latter result in the case of mean bias reduction for logistic regression (and, for more general binomial-response models where the likelihood is penalized by a power of the Jeffreys’ invariant prior).

The core model fitters are implemented by the functions brglm_fit() (univariate generalized linear models), brmultinom() (baseline category logit models for nominal multinomial responses), bracl() (adjacent category logit models for ordinal multinomial responses), and brnb() for negative binomial regression.

Installation

Install the current version from CRAN:

install.packages("brglm2")

or the development version from github:

# install.packages("remotes")
remotes::install_github("ikosmidis/brglm2", ref = "develop")

Example

Estimation of binomial-response GLMs with separated data

Below we follow the example of Heinze and Schemper (2002) and fit a logistic regression model using maximum likelihood (ML) to analyze data from a study on endometrial cancer (see ?brglm2::endometrial for details and references).

library("brglm2")
data("endometrial", package = "brglm2")
modML <- glm(HG ~ NV + PI + EH, family = binomial("logit"), data = endometrial)
summary(modML)
#> 
#> Call:
#> glm(formula = HG ~ NV + PI + EH, family = binomial("logit"), 
#>     data = endometrial)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -1.50137  -0.64108  -0.29432   0.00016   2.72777  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)    4.30452    1.63730   2.629 0.008563 ** 
#> NV            18.18556 1715.75089   0.011 0.991543    
#> PI            -0.04218    0.04433  -0.952 0.341333    
#> EH            -2.90261    0.84555  -3.433 0.000597 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 104.903  on 78  degrees of freedom
#> Residual deviance:  55.393  on 75  degrees of freedom
#> AIC: 63.393
#> 
#> Number of Fisher Scoring iterations: 17

The ML estimate of the parameter for NV is actually infinite, as can be quickly verified using the detectseparation R package

# install.packages("detectseparation")
library("detectseparation")
update(modML, method = "detect_separation")
#> Implementation: ROI | Solver: lpsolve 
#> Separation: TRUE 
#> Existence of maximum likelihood estimates
#> (Intercept)          NV          PI          EH 
#>           0         Inf           0           0 
#> 0: finite value, Inf: infinity, -Inf: -infinity

The reported, apparently finite estimate r round(coef(summary(modML))["NV", "Estimate"], 3) for NV is merely due to false convergence of the iterative estimation procedure for ML. The same is true for the estimated standard error, and, hence the value 0.011 for the z-statistic cannot be trusted for inference on the size of the effect for NV.

As mentioned earlier, many of the estimation methods implemented in brglm2 not only return estimates with improved frequentist properties (e.g. asymptotically smaller mean and median bias than what ML typically delivers), but also estimates and estimated standard errors that are always finite in binomial (e.g. logistic, probit, and complementary log-log regression) and multinomial regression models (e.g. baseline category logit models for nominal responses, and adjacent category logit models for ordinal responses). For example, the code chunk below refits the model on the endometrial cancer study data using mean bias reduction.

summary(update(modML, method = "brglm_fit"))
#> 
#> Call:
#> glm(formula = HG ~ NV + PI + EH, family = binomial("logit"), 
#>     data = endometrial, method = "brglm_fit")
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.4740  -0.6706  -0.3411   0.3252   2.6123  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  3.77456    1.48869   2.535 0.011229 *  
#> NV           2.92927    1.55076   1.889 0.058902 .  
#> PI          -0.03475    0.03958  -0.878 0.379914    
#> EH          -2.60416    0.77602  -3.356 0.000791 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 104.903  on 78  degrees of freedom
#> Residual deviance:  56.575  on 75  degrees of freedom
#> AIC:  64.575
#> 
#> Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
#> Number of Fisher Scoring iterations: 6

A quick comparison of the output from mean bias reduction to that from ML reveals a dramatic change in the z-statistic for NV, now that estimates and estimated standard errors are finite. In particular, the evidence against the null of NV not contributing to the model in the presence of the other covariates being now stronger.

See ?brglm_fit and ?brglm_control for more examples and the other estimation methods for generalized linear models, including median bias reduction and maximum penalized likelihood with Jeffreys’ prior penalty. Also do not forget to take a look at the vignettes (vignette(package = "brglm2")) for details and more case studies.

Improved estimation of the exponential of regression parameters

See, also ?expo for a method to estimate the exponential of regression parameters, such as odds ratios from logistic regression models, while controlling for other covariate information. Estimation can be performed using maximum likelihood or various estimators with smaller asymptotic mean and median bias, that are also guaranteed to be finite, even if the corresponding maximum likelihood estimates are infinite. For example, modML is a logistic regression fit, so the exponential of each coefficient is an odds ratio while controlling for other covariates. To estimate those odds ratios using the correction* method for mean bias reduction (see ?expo for details) we do

expoRB <- expo(modML, type = "correction*")
expoRB
#> 
#> Call:
#> expo.glm(object = modML, type = "correction*")
#> 
#>              Estimate Std. Error     2.5 %  97.5 %
#> (Intercept) 20.671826  33.136511  0.893142 478.451
#> NV           8.496974   7.825240  1.397511  51.662
#> PI           0.965089   0.036795  0.895602   1.040
#> EH           0.056848   0.056344  0.008148   0.397
#> 
#> 
#> Type of estimator: correction* (explicit mean bias correction with a multiplicative adjustment)

The odds ratio between presence of neovasculation and high histology grade (HG) is estimated to be 8.497, while controlling for PI and EH. So, for each value of PI and EH, the estimated odds of high histology grade are about 8.5 times higher when neovasculation is present. An approximate 95% interval for the latter odds ratio is (1.4, 51.7) providing evidence of association between NV and HG while controlling for PI and EH. Note here that, the maximum likelihood estimate of the odds ratio is not as useful as the correction* estimate, because it is + ∞ with an infinite standard error (see previous section).

Solving adjusted score equations using quasi-Fisher scoring

The workhorse function in brglm2 is brglm_fit (or equivalently brglmFit if you like camel case), which, as we did in the example above, can be passed directly to the method argument of the glm function. brglm_fit implements a quasi Fisher scoring procedure, whose special cases result in a range of explicit and implicit bias reduction methods for generalized linear models for more details). Bias reduction for multinomial logistic regression (nominal responses) can be performed using the function brmultinom, and for adjacent category models (ordinal responses) using the function bracl. Both brmultinom and bracl rely on brglm_fit.

The iteration vignette and Kosmidis et al (2020) present the iteration and give mathematical details for the bias-reducing adjustments to the score functions for generalized linear models.

The classification of bias reduction methods into explicit and implicit is as given in Kosmidis (2014).

References and resources

brglm2 was presented by Ioannis Kosmidis at the useR! 2016 international conference at University of Stanford on 16 June 2016. The presentation was titled “Reduced-bias inference in generalized linear models”.

Motivation, details and discussion on the methods that brglm2 implements are provided in

Kosmidis, I, Kenne Pagui, E C, Sartori N. (2020). Mean and median bias reduction in generalized linear models. Statistics and Computing 30, 43–59.

Code of Conduct

Please note that the brglm2 project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

brglm2's People

Contributors

Stargazers

Watchers

Forkers

bingyingcoco guhjy bbolker minghao2016 roundnose gakkilovemath rick-chen-pku variani staffanbetner

brglm2's Issues

brmultinom errors

I am attempting to run a multinomial regression that has 3 levels of the categorical DV (reference category = 1) with a correction for data separation. I continue to receive an error that package ‘brglmfit’ is not available (for R version 3.6.1) and have attempted to run the analysis with package brglm2 instead. However, now with the below pasted syntax, I am getting the following error: Error in eval(extras, data, env) : object 'Freq' not found.

When I remove the weights = Freq syntax, the analysis simply doesn't run and does not give an error message.

brmultinomModel <- brmultinom(OCDPTSD3 ~ PCLintrusion+PCLavoidance+PCLcognitions+PCLarousal+DOCScontam+DOCSresponsibility+
DOCSunacceptablethts+DOCSjustright+LECSexAssault+LECCausedHarm, weights = Freq,
data = OCD_PTSD_valid_n_595_1, type = "AS_mean", ref = 1)

OCD-PTSD valid n = 595_1.xlsx

Thank you in advance for your help!

brglm v.s. brglm2 brglmFit did not converge

Hello Dr. kosmidis,

Is the algorithm of "brglm2 AS_mean" the same as "brglm"? (pseudo-data method)
I ask the question because when I used "brglm" the results always finite. However, I got warning message said "brglmFit did not converge" when I use "brglm2 AS_mean".
I hope to understand what happened. Thank you!

Profiling methods

"detect_separation" failed

glm(Histology ~ Cigarette + med1 * med2 * med3 * med4 + Age + Gender + Stage + Race, family = binomial("logit"), data = data2, method = "detect_separation")

gives：

Error in separator(x = x, y = y, linear_program = control$linear_program, : unexpected result from lpSolveAPI for primal test

data2.zip

Add `check_aliasing` argument in `brglm_control()`

This is to avoid the need for an extra QR decomposition of the model matrix prior to fitting

Support for Ordered Logistic Regression (cumulative family object)

Does brglm2 currently support ordered logistic regression? I had no luck trying to pass a cumulative() family object to brglmFit.

Thank you for the great package!

Fin

Step{stat} function does not pick the the model with the lowest AIC

Hello,

I fit a model with the glm function (setting method="brglmFit"). When I implemented the step function from the stats package, the trace showed that the model with the lowest AIC value is not chosen. I do not want to share the data but I can share the trace. Please let me know if there is more information you need.

anti-logit of "linear.predictor" does not return probability

Hi. Could you please tell me what is meant by the "linear.predictor" slot in the brmultinom object and its summary. When I reconstruct the linear predictors by hand and take the anti-logit, I get the probabilities I used to construct the data. I cannot tell what the linear predictors in the brmultinom object are. thanks.

Rebecca

brglm2 algorithm does not converge

Hi Dr. kosmidis,

I got two warning message when I fit a logistic regression with one covariate, no intercept.
1: brglmFit: algorithm did not converge
2: brglmFit: fitted probabilities numerically 0 or 1 occurred
I got no problem when fitting standard logistic regression but got this problem when I use type = "MPL_Jeffreys".

Just a dataset with 50 samples and it look like this

I hope to understand what might happen here. Thank you.

Support for quasi... families?

I was just wondering if it would also be possible to support quasipoisson & quasibinomial? Or is there any currently supported method to take into account overdispersion in binomial GLMs?

Error when control is specified explicitly

library("brglm2")      
     data("anorexia", package = "MASS")
     anorexML <- glm(Postwt ~ Prewt + Treat + offset(Prewt), family = gaussian, data = anorexia)
     update(anorexML, method = "brglmFit", type = "correction", control = brglm_control(trace = 1))
#> Error in brglmControl(epsilon = 1e-06, maxit = 100, trace = 1, response_adjustment = NULL, : unused arguments (Trans = expression(dispersion), inverseTrans = expression(transformed_dispersion))

But

library("brglm2")      
     data("anorexia", package = "MASS")
     anorexML <- glm(Postwt ~ Prewt + Treat + offset(Prewt), family = gaussian, data = anorexia)
     update(anorexML, method = "brglmFit", type = "correction", trace = 1)
#> Coefficients update: Outer/Inner iteration:  001/001
#> Dispersion update:   Outer iteration:     001 
#> max |step|: 0.141944      max |gradient|: 0.002168 
#> Coefficients update: Outer/Inner iteration:  001/001
#> Dispersion update:   Outer iteration:     001 
#> max |step|: 0.012124      max |gradient|: 0.000107
#> 
#> Call:  glm(formula = Postwt ~ Prewt + Treat + offset(Prewt), family = gaussian, 
#>     data = anorexia, method = "brglmFit", type = "correction", 
#>     trace = 1)
#> 
#> Coefficients:
#> (Intercept)        Prewt    TreatCont      TreatFT  
#>     49.7711      -0.5655      -4.0971       4.5631  
#> 
#> Degrees of Freedom: 71 Total (i.e. Null);  68 Residual
#> Null Deviance:       4525 
#> Residual Deviance: 3311  AIC: 490

The variables responsible on the separation are not shown

I want to test the separation, so I used this code :

glm(MODEL_FORM_GLM, data=train_std, family = binomial("logit"),method = "detect_separation")

As you can see below this does not show the variables responsible for the separation.

Implementation: | Linear program: | Purpose:
Separation: TRUE
Warning message:
'detect_separation' will be removed from 'brglm2' at version 0.8. A new version of 'detect_separation' is now maintained in the 'detectseparation' package.

"likelihood ratio test" with output from brmultinom

After fitting two models with brmultinom using type = "AS_mixed" to handle quasi-complete separation, is it appropriate for me to test for differences in the same manner in which I would construct a LRT?

For example, I fit models "BRM0" and "BRM", used the anova command and compared the results to a chi-squared distribution as follows:

anova(BRM0,BRM)

Resid. Df Resid. Dev Df Deviance

1 1748 592.95

2 1746 593.67 2 -0.72255

pchisq(-0.72255,df=2,lower.tail = FALSE)

Could you kindly know if this is appropriate, and if so, what to call it?

Thank you!!

Rebecca

random effects in brmultinom?

Hi. I am trying unsuccessfully to add a random effect to a model in brmultinom. I am not sure if the package does not support random effects, or if I am just coding it incorrectly. I am using the following command:

Multi15_250 <- brmultinom(Behavior ~ PrevBehav + (1 | RandWal), data = BfrAftr15_25, ref = "Swm", type = "AS_mixed")

I get the following error:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(1, RandWal) : ‘|’ not meaningful for factors

Thank you!

Rebecca

Support for dispersion covariates

confidence intervals for brmultinom?

I am using brmultinom with a small data set (120 cases). Is there a function or command to find the confidence intervals with the brmultinom function? I tried using the confint function but it gives no real output or error message.

All predictors are binary categorical data, except for "Number of Births per Year". The dependent outcome variable "Financial Sustainability" has 3 categories (best, good, poor)
###Set working directory
setwd("C:/Users/Melanie/OneDrive")

###install and library brglm2 package for analysis
install.packages("brglm2")
library(brglm2)

###install and library readxl package to read in data
install.packages("readxl")
library(readxl)

###Read in the data, make the outcome a factor
birthcenter<-read_excel("birthcenterdata")
birthcenter$Financial Sustainability<-as.factor(birthcenter$Financial Sustainability)
bccomplete<-birthcenter[complete.cases(birthcenter),]

Model D: uses "1" as baseline to compare "good" to "poor" and "excellent" to "poor."

###Model D shows each estimated coefficient and associated standard errors
modelD <- brmultinom(Financial Sustainability ~ Number of Births per Year +
Liability Insurance+ Accreditation +Urban/Suburban +
For Profit + Over 3 years +
Licensure Availability,data=bccomplete,ref=1)

summarizes the model

summary(modelD)

Odds ratios

exp(coefficients(modelD))

confidence intervals

confint(modelD,level = 0.95)

confint(modelD,level = 0.95)
2.5 % 97.5 %

Thank you for any / all thoughts + suggestions.

Multi-type fitter and diagnostics

family = "gaussian" does not seem to work

First of all, thank you for your great work! I have tried your package. It seems to work for glm with family = "binomial". For the family = "gaussian", almost the same coefficients are generated when comparing standard glm and a glm with method = "brglmFit", type="correction". The difference in the coefficients is less than 10^-13. I guess there is something wrong with the link function and the calculation of the coefficients?

Bias reduced logistic regression model estimation with one parameter produces NA

PZIET-brglm2_bug.txt

When using brglmFit() for a logistic regression with one coefficient, NA is returned. However using brglm() from the brglm package returns a real valued coefficient. The code snippet below illustrates this issue,

# Author: Patrick Zietkiewicz
# File purpose: Investigate possible bug in brglm2 package
library(brglm)
library(brglm2)
dat <- MASS::Pima.tr
X <- as.matrix(dat[,1:7])
Y <- as.matrix(dat[,'type'] == 'Yes', 1, 0)
# fit models
res_brglm <- brglm(Y ~ -1 + X[,'bmi'],
                 family = binomial(link='logit'),
                 pl = FALSE)
res_brglm2 <- brglmFit(x = X[,'bmi'], y = Y, 
        family = binomial(link='logit'),
        intercept = FALSE)
cat('brglm:', res_brglm$coefficients,'\n')
cat('brglm2:', res_brglm2$coefficients,'\n')
cat('done')

Please see the attached txt file for a copy of the code snippet above

Convergence problem

Hi, thanks for your work on this package. I am by no means an expert but I tried to read up on penalized logistic regression and encountered several times the claim that in principle Firth regression has better properties when it comes to convergence compared to standard GLM with logit link. I was therefore surprised to see the exact reverse on my data. I was hoping you could help me understand why this happens.

I narrowed down the problem in my regressions to a single variable. I have a longitudinal dataset where I use the year of the observation as a categorical variable, so I get intercepts for each year.

The data is available here: https://github.com/michalovadek/misc/blob/main/data.RDS


glm(dv ~ year, data = data, family = binomial()) # converges
glm(dv ~ year, data = data, family = binomial(), method = "brglmFit") # does not converge

Any advice would be much appreciated

Methods for anova, drop1, add1

Based on Wald and score statistics

AIC appears to be -2*loglik without parameter penalty

Is there a chance you forgot to add the parameter penalty when calculating AIC? For example, in a model I fit called Multi15_25.1, that I summarized in the object SmryMulti15_25.1, when I call SmryMulti15_25.1$AIC , it returns 543.0035. When I call SmryMulti15_25.1$logLik, it returns 'log Lik.' -271.5017 (df=6). And -2*(-271.5017) = 543.0035.

Also, can you please tell me what is the number the package calls "aic"? It is different than what the package calls "AIC."

Thank you.

Rebecca

error when trying brnb for negative binomial

Hello,

I was excited to find this package to solve my complete separation problem. My count data is overdispersed so need to use the negative binomial family, which I understand is run using brnb(). I keep getting this error: NA/NaN/Inf in foreign function call (arg 1), even though I am specifying all the necessary terms:

salmonella_fm <- production ~ Treat_comboSource_Salinity +Rarity+ Region+Treat_comboTurion_length_num+replanted.x

N_T_6_2020_nbn <- brglm2::brnb(salmonella_fm,
data=turion_harvest_2021_all_gen_initial_turion,link = "log",transformation = "inverse", type = "AS_mean")

I've noticed that the error disappears if I use type="ML" but all other types trigger the error.

Could you please advise?

Thanks!

-Carrie

install is failing

devtools::install_github("ikosmidis/brglm2")

Gives:

Error in namespaceExport(ns, exports) : undefined exports: adjustedz.glm
Error: loading failed
Execution halted
ERROR: loading failed for 'i386', 'x64'

brglmFit() and brglm_fit() broken for hello world

data = tibble(x = rnorm(n = 100), y = rbinom(n = 100, size  = 1, prob = 0.3))
brglmFit(y ~ x, data)
# Error in dim(data) <- dim : 
#  invalid first argument, must be vector (list or atomic)
brglm_fit(y ~ x, data)
# Error in dim(data) <- dim : 
#  invalid first argument, must be vector (list or atomic)
dim(data)

brglmFit(y ~ x, data, family = "binomial")
# Error: $ operator is invalid for atomic vectors
brglm_fit(y ~ x, data, family = "binomial")
# Error: $ operator is invalid for atomic vectors

# same call works in v1 even though v2 indicates it's ready to supersede v1
brglm(y ~ x, data, family = "binomial")

Call:  brglm(formula = y ~ x, family = "binomial", data = data) 

Coefficients:
(Intercept)            x  
   -0.87187     -0.08554  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Deviance:	    120.2965 
Penalized Deviance: 114.2566 	AIC: 124.2965

ikosmidis / brglm2 Goto Github PK

brglm2's Introduction

brglm2

Installation

Example

Estimation of binomial-response GLMs with separated data

Improved estimation of the exponential of regression parameters

Solving adjusted score equations using quasi-Fisher scoring

References and resources

Code of Conduct

brglm2's People

Contributors

Stargazers

Watchers

Forkers

brglm2's Issues

Resid. Df Resid. Dev Df Deviance

1 1748 592.95

2 1746 593.67 2 -0.72255

Model D: uses "1" as baseline to compare "good" to "poor" and "excellent" to "poor."

summarizes the model

Odds ratios

confidence intervals

Recommend Projects

Recommend Topics

Recommend Org