The logitr from jhelvy

Predicting out-of-sample tasks of known individuals - Mixed logit on panel data

First of all, thanks a lot for this contribution. I am usually using python, but logitr managed to solve some convergence issues I was having with xlogit using panel data.

My current question/issue revolves around the following: I have estimated a mixed logit model on a panel of individuals in a set of tasks/problems. Now suppose I have a separate panel data set containing the same individuals on which I would like to make predictions. Using the (unconditional) estimated distribution over the parameters in order to make predictions is then not optimal, since we already have additional information on them from their prior choices. To be specific, let $g(\beta|\theta)$ be the population distribution of the parameters $\beta$, let $L(i,t|\beta)=\frac{e^{\beta'X_{it}}}{\sum_j e^{\beta'X_{it}}}$ be the probability of choosing $i$ in task $t$ conditional on $\beta$. Then, by Bayes' rule, the distribution over parameters conditional on having observed a sequence of choices $y$ is given by:

$$h(\beta|y,\theta)=\frac{P(y|\beta)g(\beta|\theta)}{P(y|\theta)}$$

Where $P(y|\beta)= L(y_1,1|\beta)\times\dots\times L(y_T,T|\beta)$ is the probability of the individual's sequence conditional on $\beta$ and $P(y|\theta)=\int P(Y|\beta)g(\beta|\theta)d\theta$ the unconditional probability. Based on this, an individual's estimated probability of choosing $i$ in out-of-sample task $T+1$ is given by:

$$\tilde{P}(i, T+1|y,\theta)=\frac{\sum_{r}L(i, T+1|\beta^r)P(y|\beta^r)}{\sum_{r}P(y|\beta^r)}$$

I should note that the above notation is from Revelt & Train (2000): "Customer-Specific Taste Parameters and Mixed Logit: Households' Choice of Electricity Supplier."

From my (limited) understanding of R, your predict method uses the population distribution over parameters to make predictions and does not allow for a panelID option to use the conditional distribution, is that correct? If so, do you know of any way I could use logitr to (1) derive the conditional distribution for each individual, and (2) make predictions based on this conditional distribution?

On a unrelated note, I think that I have spotted to bugs:

If I estimate a multinomial logit using a single parameter, I get the following error when executing the summary method if I specify a clusterID:

Note that it works for two or more parameters. Furthermore, the summary method also work for a single parameter if I leave out clusterID.

If I estimate a mixed logit using a single parameter, I get the following error in the estimation if I specify clusterID:

Note that the estimation works for two or more parameters. The estimation also works for a single parameter if I leave out clusterID.

Many thanks in advance for your time.

TODO

I'm just collecting some todo items here from existing issues:

Write a function to handle encoding data with an outside good (#52).
Revise encoding to handle effects-coded data (#46).
Restrict WTP space scale parameter to only be log-normal or censored-normal to force positivity (see this comment).
Revise how the number of draws is determined based on the model (see this comment and also this SO post).
Integrate xlogit as an option for supporting large N draws and speed (#54)

Improve obsID error messaging

The obsID variable will cause an error if it's not a perfectly sequentially increasing numeric vector. This seems overly restrictive as it just needs to identify unique observations. For example, this works:

library(logitr)

head(yogurt)

model <- logitr(
    data    = yogurt,
    outcome = "choice",
    obsID   = "obsID",
    pars    = c("price", "feat", "brand")
)

But now if I modify a single observation ID to a totally different number (that is not in conflict with others) it errors:

yogurt[which(yogurt$obsID == 2000),]$obsID <- 5000

model <- logitr(
    data    = yogurt,
    outcome = "choice",
    obsID   = "obsID",
    pars    = c("price", "feat", "brand")
)

Error in checkRepeatedIDs("obsID", obsID, reps) : 
  The 'obsID' variable provided has repeated ID values.

This is a pretty misleading error because there actually aren't repeated ID values, and it's also not clear what "repeated" means (it's in a long form data structure, so there are repeated ID numbers across rows in the same observation...but that's what is expected).

This fixes the problem:

yogurt$obsID <- rep(seq(length(unique(yogurt$obsID))), each = max(yogurt$alt))

But automating that kind of over-writing is not so trivial because some data sets may not have symmetry in the number of alternatives per choice observation. It would be better to use the reps vector to create new observation IDs internally and then replace them post-estimation with the original ones. If that is done, then this problem will never occur.

But the error message should still be updated nonetheless to clarify what is meant by "repeated".

Incorporate proper panel log-likelihood calculations

Integration with xlogit

Following the conversation in #53, it'd be great to be able to call {xlogit} (python) from {logitr} to be able to access the amazing estimation speed and capabilities that {xlogit} has to offer. In particular, {logitr} (and basically all other packages) struggle to estimate mixed logit models with a large number of draws (~>1,000), but {xlogit} can easily handle orders of magnitude more draws in extremely fast time periods due to it's use of CPUs. I imagine this could be implemented via the {reticulate} package and with an additional argument like xlogit = TRUE to trigger the use of {xlogit} to estimate the model.

Feature enhancement: exporting to latex logitr results

Dear Helvy,
thanks a lot to you for your great work and package! it is really helpful!

This may be a trivial question, but is there a way to easily extract coefficients and standard errors from logitr models and export them into latex? For instance, using logitr results to create tables with stargazer? It would be helpful to compare multiple models!

thanks a lot in advance for your help and apologies if this feature is already available!

Error in X_chosen[data$obsID, ] : subscript out of bounds

Hello,
it is me again!
apologies for all these comments but I really like your program and I am using it on a number of datasets, which makes me encounter a number of errors.

I am estimating a basic model, without clustered standard errors, I have a dataset of >3500 observations.

mnl_pref <- logitr(
+   data           = dat_transformed,
+   outcome        = "choice",
+   obsID          = "tskID",
+   pars           = c("wg",myvars),
+   modelSpace     = "pref",
+   panelID        = "wrkid",
+   numMultiStarts = 100)
# and I get the following error
Error in X_chosen[data$obsID, ] : subscript out of bounds

Again I have looked at my tskID variable, and it looks correct.

Any idea of what may be causing this error?
thanks a lot for all of your help and support! you are really making a great contribution to me and to the community in general!

Release logitr 0.8.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Allow simulateShares() to make predictions on multiple alternatives

Right now, simulateShares() takes a data frame defining a single choice scenario for which to predict shares, but it could pretty easily be extended to take a data frame defining multiple sets of alternatives. That could make it much easier for users to predict a variety of scenarios all at once without needing to run a loop.

Parallelize mixed logit calculation and/or multistart runs for all models

For us to keep in mind - something I'll definitely be working on over the next month or so.

Release logitr 0.6.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Add count by obsID in summary output

Useful for a quick check that the obsID variable is correctly formatted

Inconsistent Estimator?

Hi, I am using your package frequently to estimate WTP model and saw this post https://stats.stackexchange.com/questions/624305/mlogit-logitr-packages-fail-to-recover-true-estimates-of-mixed-logit-random-co today.
Can you please comment on that?

Add support for clustered standard errors

missing z-value in logitr of preference space model

Hi
I'm Researcher who had studied on choice experiment of green infrastructure at South Korea.
when I found R package of 'logitr', I'm very excited on estimating model of WTP and preference.
Now, I'm suffered from this error of z-value missing.
How coud I figure out the errors in my data and code?
``
A tibble: 1,498 × 19
id idx Q8tt price choice alt age reg

1 1 1127 0 5000 0 1 3 7
2 1 1127 0 2000 1 2 3 7
3 2 412 0 5000 0 1 4 2
4 2 412 0 1000 1 2 4 2
5 3 418 0 5000 0 1 4 2
6 3 418 0 1000 1 2 4 2
7 4 410 0 5000 0 1 5 7
8 4 410 0 1000 1 2 5 7
9 5 218 0 5000 0 1 3 4
10 5 218 0 2000 1 2 3 4
11 more variables: edu , mar ,
bud , tran , time , area ,
satis , fee , gar , gre ,
act <int>

Using logitr version: 1.1.1

Call:
logitr(data = wtp2.7, outcome = "choice", obsID = "id", pars = c("price",
"age", "edu", "bud", "act", "gar", "gre"))

Frequencies of alternatives:
1 2
0.5514 0.4486

Exit Status: 3, Optimization stopped because ftol_rel or ftol_abs was reached.

Model Type: Multinomial Logit
Model Space: Preference
Model Run: 1 of 1
Iterations: 25
Elapsed Time: 0h:0m:0.02s
Algorithm: NLOPT_LD_LBFGS
Weights Used?: FALSE
Robust? FALSE

Model Coefficients:
Estimate Std. Error z-value Pr(>|z|)
price -0.00051319 NA NA NA
age 0.00000000 NA NA NA
edu 24.90583872 NA NA NA
bud 0.25813579 NA NA NA
act -6.32432679 NA NA NA
gar 0.00000000 NA NA NA
gre 14.52830879 NA NA NA

Log-Likelihood: -261.2535918
Null Log-Likelihood: -519.1672382
AIC: 536.5071836
BIC: 568.8384000
McFadden R2: 0.4967834
Adj McFadden R2: 0.4833002
Number of Observations: 749.0000000

accessing intercept value

Hi John, this is a great package! Much faster and stable compared to the other R packages for mixed logits. Is it possible to extract the intercept for the estimated model (or a method to specify it as a parameter)?

Usage help

Hi,

I've been trying to use nnets multinom' for my problem but logitr seems like a much better approach, but I'd be grateful to get some help on usage.

My problem comes from single-cell sequencing in biology. In our experiment we take an organ, such as the spleen, from several animals (samples: biological replicates) with covariates (e.g., a categorical variable such as age with two levels: young and 'old) and we dissociate it to its single cells, and then we profile the gene expression in each of these cells, such that the readout data (after some processing and analyzing), is a table with these columns:

cell_ID: a barcode of the cell (an integer equivalent to the id variable in the yogurt data)
cell_type: a label of the biological cell type of the cell (e.g., CD4 T cell, which is equivalent to the choice variable in the yogurt data, and hence all possible cell types would be equivalent to the alt variable in the yogurt data)
sample_ID: the animal's label, which would be a random effect variable
age: the animal's age, which would be a fixed effect variable

So these are clearly compositional data because in each sample we get a distribution of cells across the cell_types, which sum up to 100%.

My goal is to test for age effects in each of the cell_types, or in other words wether the estimated old/young ratio in each cell_type is different from 1.

In my data I have a total of 38168 cells from three young samples and four old samples, each assigned to one of five different cell_types.

I constructed the input data.frame such that id is an integer that encodes cell, obsID is identical to id because a cell is only observed once and hence assigned to a single cell_type once, alt encodes cell_type, choice has a value of 1 to the cell_type of cell and 0 for all other 4 cell_type's, and then age and sample_ID are factors.

Here are the two first ids in the input data.frame:

id obsID  alt              choice age   sample
1     1       NKT.cell      1         young young_2
1     1       CD4.T.cell   0        young young_2
1     1       CD8.T.cell   0        young young_2
1     1       Treg             0       young young_2
1     1       NK.cell         0       young young_2
2     2      NKT.cell      0        young young_2
2     2      CD4.T.cell   1        young young_2
2     2      CD8.T.cell   0       young young_2
2     2      Treg            0        young young_2
2     2      NK.cell        0        young young_2

Then I run this logitr model command:

logitr(data = df, outcome = "choice", obsID = "obsID", pars = c("age", "sample"), randPars = c(sample = 'n'), drawType = 'sobol', numDraws = 200,numMultiStarts = 10)

The output only reports on the age and sample coefficients but not on alt.

So my question is if it is actually possible to the age effect for each alt? I guess it'd be an interaction between alt and age but I don't see how that can be specified to logitr.

Thanks a lot

Function to re-code data with outside good

For experiments with outside goods ("none" options), the data need to be encoded in a particular way. I frequently see people make mistakes with this, so it's probably worth writing a function that handles this encoding for them. It needs to handle the following two conditions:

For continuous variables that don't have a 0 in them already (e.g. price), you should also subtract off the lowest value from all the values. By doing this, the value of 0 now means something (e.g. for price, it would be the lowest price), and everything different from 0 refers to the difference from the lowest value. If you don't do this, then the 0s in attributes like price are essentially saying the alternative had a price of 0, which is not correct.
For categorical variables, it is best to also manually dummy-code them and insert those dummy-coded variables into pars. Then you would also create a dummy-coded "no choice" column that is also separately included in pars. This way you'll get a separate coefficient for the "no choice" option that isn't conflated with the other categorical variables (e.g. brand in the example yogurt data).

Add a `predict()` function

simulateShares() just gives probabilities of choosing each alternative, but it could easily be extended to predict choices (e.g. for a hold out set).

apollo

Have you seen
http://www.apollochoicemodelling.com/files/Hess_Palma_Apollo.pdf
http://www.apollochoicemodelling.com/files/Apollo.pdf
install.packages("apollo")
?

Predicted names don't match

Hello,

Thanks for providing the library, I found out about it a few days ago and it is truly amazing!

I am trying to run a model that has one feature set as a factor, however it errors out when the factor levels in the newdata field provided to predict is different than the one provided in the train dataset.

I think the error comes from here:

logitr/R/inputChecks.R

Line 156 in e478e02

if (length(setdiff(modelPars, dataNames)) > 0) {

In my case, the values provided in newdata are a subset of the ones provided in the train dataset. This case should be fine right, since no observation would have coefficients undetermined.

Release logitr 0.7.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Praise 🎉

Funders and academic employers are increasingly interested in seeing evidence for the impact academic research generates. For software such as {logitr}, this is very hard to accomplish because the typical metrics for promotion and tenure that matter (publications, grants, and citations) don't really apply. The consequence is that there are increasingly fewer and fewer incentives to develop packages like {logitr} 😞

The good news is you can help! If you have found {logitr} useful in any regard, please leave feedback here for other users, funders, and employers to view. This helps the package authors show how {logitr} is being used by academic and non-academic users to increase their productivity and work quality.

So please contribute some praise 😁! Tell us how cool this package is and how you use it in your work!

And if you write a paper using {logitr}, please cite the JSS article on the package in your publications, see: citation("logitr")

Using logitr to estimate heterogeneity in customers time-preferences

Dear Prof. Helveston,

thank you for this great package, its speed and ease of use are very impressive!
I am trying to use it to fit a (mixed) logit model to analyze how customers trade-off upfront investment costs versus future savings in renewable energy investments. Specifically, I am follwoing Train, 1985 and seek to estimate a random utility model for the estimation of implicit discount rates with the following specification:

U_i = α_i + β₁ upfrontCost_i + β₂ futureSavings_i + ε_i

I am interested in the ratio β₁/β₂, particularly in its distribution within the sample. Your WTP model space seemed particularly suited for this estimation as it allows to normalize β₂ to one and interpret the coefficient associeted with β₁ as the parameter of interest.

In a standard logit specification this works very well. However, when I try to estimate a mixed logit model with normal or log-normal parameters, the sigma parameter is estimated to be close or equal to zero. This would indicate that there is no heterogeneity in customer's time preferences within the sample. However, I know that this cant be true. In fact, when I interact β₂ with other socioeconomic variables, I can capture part of this heterogeneity with a standard logit model.

I already tried increasing numMultiStarts to 100,which did not help. I suspect that I might have misspecified my model in a subtle way.
Do you have any thoughts on what I may be doing wrong or have suggestions for further reading/ model implementation advice?

I am looking forward to hearing from you.

Thanks in advance and best wishes,
Moritz

EDIT: I guess it is worth noting that the upfrontCost and futureSavings of the alternatives vary across individuals. Maybe that is violating a model assumption.

Missmatch of clustered standard errors between logitr and Stata clogit

Hello,
first of all, thank you very much for developing this package it is really amazing and of great help.

I have a question regarding the clustering of standard errors. I am comparing the results of a multinomial logit model in preference space with those that I get in Stata and something appears to go wrong once I cluster standard errors.
If I run a basic mnl model with yogurt data and cluster standard errors by respondent these appear to explode (all p values>.9)

mnl_pref_yog <- logitr(
  data    = yogurt,
  outcome = "choice",
  obsID   = "obsID",
  pars    = c("price", "feat", "brand"),
  clusterID = "id"
)
summary(mnl_pref_yog)

Model Coefficients: 
             Estimate Std. Error z-value Pr(>|z|)
price        -0.36655   10.95012 -0.0335   0.9733
feat          0.49144   10.77771  0.0456   0.9636
brandhiland  -3.71548   30.48871 -0.1219   0.9030
brandweight  -0.64114    2.51804 -0.2546   0.7990
brandyoplait  0.73452   28.27514  0.0260   0.9793

However if run the same model in Stata, using the clogit command the effect of clustering standard errors is less extreme.

use "yogurt.dta", clear
encode brand, gen (en_brand)
clogit choice price feat i.en_brand, group(obsID) robust cluster(id)


Conditional (fixed-effects) logistic regression         Number of obs =  9,648
                                                        Wald chi2(5)  = 181.89
                                                        Prob > chi2   = 0.0000
Log pseudolikelihood = -2656.8879                       Pseudo R2     = 0.2054

                                   (Std. err. adjusted for 100 clusters in id)
------------------------------------------------------------------------------
             |               Robust
      choice | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       price |  -.3665845   .0542258    -6.76   0.000    -.4728651   -.2603038
        feat |   .4914334   .1938909     2.53   0.011     .1114143    .8714525
             |
    en_brand |
     hiland  |  -3.715598   .3536449   -10.51   0.000    -4.408729   -3.022466
     weight  |  -.6411843   .4472309    -1.43   0.152    -1.517741    .2353722
    yoplait  |   .7345712   .2772028     2.65   0.008     .1912636    1.277879
------------------------------------------------------------------------------

Note coefficients are the same, and that when I do not cluster standard errors, these are almost identical regardless of whether I used logitr or clogit.

Is there something going on with the clustering or I am misspecifying something?

thanks a lot in advance for clarifying this!

Your work is truly helpful!

Is there a requirement of the amount of the data for the logitR function to predict?

Hello, I'm using the logitr function trying to calculate the WTP value of each user. Each user has around 20 observation sessions for both WTP and WTA values. However, the predicted model contains very large numbers and negative numbers which seems not right. I compare the results of the preference space model and the WTP space model, some of the user's results match but some do not. This means the result of the wtp space model is not the global solution. But even the result of the preference model is not right, so in this case, how could I solve this problem?

Thanks a lot!!

Individual part-worth estimates

Dear John,

first of all, thanks a lot for this amazing package. I was wondering if there is the chance to extract individual part worths similar to the indpar from the mlogit package?

Thanks a lot

Predicted probability labeling

Hi John! Thanks for this awesome package! I was reviewing Predicting Choice Probabilities from Estimated Models and I noticed that the predicted probabilities don't carry forward the brand assignment:

probs_mnl_pref
#>   obsID  prob_mean   prob_low  prob_high
#> 1    13 0.43684871 0.41564967 0.45746384
#> 2    13 0.03313009 0.02630457 0.04172679
#> 3    13 0.19155738 0.17621753 0.20760195
#> 4    13 0.33846382 0.31891924 0.35883234
#> 5    42 0.60763975 0.57328271 0.64024329
#> 6    42 0.02602079 0.01832816 0.03662558
#> 7    42 0.17803594 0.16244937 0.19462645
#> 8    42 0.18830353 0.16858802 0.20970439

is it safe to assume that those are in the same order of the input new data here?

alts
#>     obsID price feat   brand
#> 49     13   8.1    0  dannon
#> 50     13   5.0    0  hiland
#> 51     13   8.6    0  weight
#> 52     13  10.8    0 yoplait
#> 165    42   6.3    0  dannon
#> 166    42   6.1    1  hiland
#> 167    42   7.9    0  weight
#> 168    42  11.5    0 yoplait

And can the predictProbs function be modified to include brand as a column in the predicted probabilities data frame? This could help protect users against misspecification.

Thank you again for all of your work on this!

Unexpected error when displaying summary

Hello, i am running a standard multinomial model of this king.

mnl_pref <- logitr(
  data           = data,
  outcome        = "choice",
  obsID          = "tskID",
  pars           = c("wg","ls20","ls44","hs60","hs75"),
  modelSpace     = "pref",
  clusterID      = "wrkid",
  panelID        = "wrkid",
  numMultiStarts = 100)

the model is calculated correctly, but when I ask to summarize results I get the following error function

summary(mnl_pref)
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

I am not sure what may be causing it. Any suggestions to work around the issue?

Release logitr 1.0.0

Prepare for release:

Submit to CRAN:

usethis::use_version('major')
devtools::submit_cran()
Approve email

Wait for CRAN...

effects coding

When using logitr with effects coded categorical variables (contrast.sum) the transformation from the categorical variable to the dummy variables gives only one dummy column irrespective of how many categorical variables are in the model. To solve this, a check should be done on the contrast of the categorical variables and an alternative to fastDummies should be used.

Error in Parallelization

Fantastic work, John. I greatly enjoy exploring the package thus far.

I've run into some problem in estimating a WTP-space model. The problem appears when I requested a larger number of draws. The code below would run fine regardless of the number of cores used when numDraws = 200. However, when I set numDraws = 2000, it would produce an error regardless of the number of cores specified. Any potential solution?

Running multistart...
Random starting point iterations: 10
Number of cores: 1
Error in serialize(data, node$con) : error writing to connection

wtp <- logitr(
data = dat,
outcome = "choice",
obsID = "cs",
panelID = "id",
pars = c("l", "l160", "l320", "prov", "can", "organic",
"grass", "btest", "bfree", "cn"),
scalePar = "price",
randPars = c(l = 'n', l160 = 'n', l320 ='n', prov = 'n', can = 'n',
organic = 'n', grass= 'n', btest = 'n', bfree = 'n', cn = 'n'),
numMultiStarts = 10,
numDraws = 2000,
numCores = 1,
drawType = 'sobol'
)

Include covariance matrix in model output

Useful for computing other tests

jhelvy / logitr Goto Github PK

logitr's People

Contributors

Stargazers

Watchers

Forkers

logitr's Issues

Recommend Projects

Recommend Topics

Recommend Org