timothyb0912 / pylogit Goto Github PK

View Code? Open in Web Editor NEW

184.0 14.0 103.0 2.85 MB

A python package for estimating conditional logit models.

Home Page: https://pypi.org/project/pylogit/

License: BSD 3-Clause "New" or "Revised" License

Python 96.65% Makefile 0.02% TeX 3.33%

discrete-choice

pylogit's Introduction

PyLogit

PyLogit is a Python package for performing maximum likelihood estimation of conditional logit models and similar discrete choice models.

Main Features

It supports
- Conditional Logit (Type) Models
  - Multinomial Logit Models
  - Multinomial Asymmetric Models
    - Multinomial Clog-log Model
    - Multinomial Scobit Model
    - Multinomial Uneven Logit Model
    - Multinomial Asymmetric Logit Model
- Nested Logit Models
- Mixed Logit Models (with Normal mixing distributions)
It supports datasets where the choice set differs across observations
It supports model specifications where the coefficient for a given variable may be
- completely alternative-specific
  (i.e. one coefficient per alternative, subject to identification of the coefficients),
- subset-specific
  (i.e. one coefficient per subset of alternatives, where each alternative belongs to only one subset, and there are more than 1 but less than J subsets, where J is the maximum number of available alternatives in the dataset),
- completely generic
  (i.e. one coefficient across all alternatives).

Installation

Available from PyPi:

pip install pylogit

Available through Anaconda:

conda install -c conda-forge pylogit

conda install -c timothyb0912 pylogit

Usage

For Jupyter notebooks filled with examples, see examples.

For More Information

For more information about the asymmetric models that can be estimated with PyLogit, see the following paper

Brathwaite, T., & Walker, J. L. (2018). Asymmetric, closed-form, finite-parameter models of multinomial choice. Journal of Choice Modelling, 29, 78–112. https://doi.org/10.1016/j.jocm.2018.01.002

A free and better formatted version is available at ArXiv.

Attribution

If PyLogit (or its constituent models) is useful in your research or work, please cite this package by citing the paper above.

License

Modified BSD (3-clause). See here.

Changelog

See here.

pylogit's People

Contributors

Stargazers

Watchers

Forkers

brianhuey mindis sash-ko jgongsq ferasz yaminiramesh smmaurer profshen gboeing kinab hassanobeid mostafaharb eric-lgong synapticarbors pangyanbo ruitongzhu emilpop-au bowenwen anirband sa1 sana-rasheed jacobyan0 yusukeaoki1223 jlaz baskoes mohamedaref cgxabc algoricky gracedong92 luoyec sidewalklabs anthonysu mengyunli0220 joseangelmartinb xieliaing fkiraly meiqingli vishalbelsare kmyoyeye loweleonhardt magicjane rishabhsam tejavenkatk ssxjss bouzaghrane dongwenfu kshabahang yancheng0905 scottdtaylor95 pearl-yu parterytao zy4bvb chintanadvani rallik cristitosa grahamdaley lzx-buaa makersights mathijsvdv ryuya-ko jingweidai kbo5139 shaoxr178 vagmcs li-tingting-1021 roastedduck73 andreluizcoelho kundan31mandloi anskarl rp-linmu janeluke shahriarzame obuli92 kcyangcal chenbinluo replicahq edward-hartley andromeda0505 l2me jeevanajyothitaviti jbarsotti aviyashchin arunkumarrout palmaluisen yiyizi945 suhe0 ramuneblack xiaosanmeng jyyoon247 jackliin mohammadsh1986

pylogit's Issues

Create contributors updating process

Create or find an automated way to update contributors.rst based on the merged pull requests or git commit record of the develop branch.

For instance, I need to add danphan from #72.

Feature Request : Regularization

Not sure if this branches too far from the intended purpose of the packet, is it possible for pylogit to implement L1 or L2 regularization similar to what scikit-learn does.

What

This package is missing a CONTRIBUTING.md file.
Contributors therefore start without clarity on how to:

begin
understand pylogit internals
set up their development environment
document their architectural / coding design decisions
ensure that their code meets minimum package standards
document their changes in docstrings / the changelog / example-notebooks
prepare their contribution for review
- run tests
- prepare their PR (including what branch to target)

This has already come up here and here.

I should add a CONTRIBUTING.md to address all of the issues above.

relation to softmax

Hi,

Yesterday I reread the paper. I woke up thinking that the logit-type formulation is similar to the softmax normalization. I am a little surprised the comparison was not in the paper.

This is not a call to action, but the start of a conversation. I was going to send you an email, but thot may as well work in public.

To what degree can the paper be described as:

Probability of choice is softmax of utility.
Utility is S(data).
Where S is a nonlinear function
Justify and test 4 examples for S

Memory error when fitting model

First, thank you so much for creating this wonderful library!

I'm using pylogit for a multinomial logit model that currently, for test purposes, has just four parameters. I can successfully fit the model with up to about 40,000 rows of training data, but when I use 56,000 rows I get a Python memory error (see below).

I was previously getting a memory error at about 20,000 rows but switched to the develop branch and saw some improvement. My code looks like this:

	model = pl.create_choice_model(trainPD,
		alt_id_col='partid',
		obs_id_col='segmentid',
		choice_col='result',
		specification=spec,
		model_type='MNL')
	model.fit_mle(np.zeros(4))

Does it seem as if I’m bumping against a known limitation, or is there something else going on? Can you recommend any fix or workaround? My model will eventually have 100+ parameters which I’m assuming will make the memory situation worse.

Thanks in advance for your help.

-Barton Listick

Traceback (most recent call last):
  File "model_20.py", line 369, in <module>
    run()
  File "model_20.py", line 256, in run
    model.fit_mle(np.zeros(4))
  File "C:\ProgramData\Anaconda3\lib\site-packages\pylogit\conditional_logit.py", line 401, in fit_mle
    just_point=just_point)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pylogit\estimation.py", line 707, in estimate
    results = calc_and_store_post_estimation_results(results, estimator)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pylogit\estimation.py", line 574, in calc_and_store_post_estimation_results
    estimator.convenience_calc_fisher_approx(final_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pylogit\estimation.py", line 417, in convenience_calc_fisher_approx
    return cc.calc_fisher_info_matrix(*args)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pylogit\choice_calcs.py", line 931, in calc_fisher_info_matrix
    np.max(rows_to_obs.toarray() * weights[:, None], axis=0)
MemoryError

Estimation process unstable

Hi,

When I try run your code with my sample data, parameter coefficients are not stable, not sure why. Can you have a look? Sometimes even the coefficient sign would be changed as well. Such as my "Male" variable, could be both positive and negative coefficients. Can you explain why? Do you have some random "input" in your estimation process?

Update Paper Reference

The readme file only links to the arxiv version. It should be updated to refer to the Journal of Choice Modelling paper.

Make predict function work for mixed logit with 1 draw

Right now, calling predict when using a mixed logit model with num_draws=1 will cause an error. The following code snippet needs to included right after L1959:

if len(prob_array.shape) == 1:
    return prob_array

Refactor

Request

The current codebase should be substantially simplified and clarified. There are a number of code smells (e.g. mega classes with no separation of concerns), and the codebase could benefit from following object oriented programming principles (e.g. composition over inheritance, dependency injection, etc.). The internals of PyLogit should undergo a large architectural refactor.

Things to do include:

reflect on and make use of most appropriate design patterns whenever possible. (Helpful tools here)
systematically search for and eliminate code smells
make use of more classes internally and less reliance on passing tons of arguments around
implement composition over inheritance
create custom exceptions
use python-valid8 for argument validation in functions
use attrs for class definition and validation of class initialization arguments
use marshmallow or cattr for serialization
use pydeps to generate dependency diagrams showing how modules relate to each other
use jupytext to version control notebooks
grappa for assertions in tests
behavioral testing (in Gherkin / English) for user acceptance testing
hypothesis for property testing and finding limits to the code
papermill for testing that the example notebooks run successfully

Sparse to Dense

In choice_calcs.py line 930, it appears the library is calling rows_to_obs.toarray() which converts a sparse array to a potentially huge dense array (for my use case, the resulting sparse array is small, and the dense version is about 200 GiB).
Here is the existing code:

weights_per_obs =\
        np.max(rows_to_obs.toarray() * weights[:, None], axis=0)

Is the intended behavior given below?

M = rows_to_obs.multiply(weights.reshape(-1,1))
weights_per_obs = np.max(M, axis=0).toarray().reshape(-1)

If so, I think this is a simple fix.

UI

Request

Create a Web UI for PyLogit so users don't have to write much (if any) code in order to estimate basic (and perhaps not so basic) discrete choice models. One option is to use Streamlit to create the UI. The experience could draw inspiration from the UI of Biogeme, as a starting point.

Some desired features would be that the following workflow all be doable from the UI:

design matrix specification
model instantiation
- utility specification
- model type selection
- model argument setting (e.g. intercept reference alternative)
model estimation
model viewing (post-estimation results and summary)
model checking (interactively visualize and save model checking plots from one's model)
model serialization
model prediction (new data, batch or online)

Feature Request: Boosted Conditional Logit

Not sure if this branches too far from the intended purpose of the package, but boosted conditional logit models as described in "Shi, H., Yin, G., Boosting conditional logit model". R package located here: https://github.com/cran/clogitboost

Scikit-API compliance

Request

Numerous users have noted the desire for PyLogit to conform to the Scikit-Learn API. See for example #23 and #41. Such a change would enable the use of discrete choice models inside scikit learn pipelines.

If #51 is implemented, then using skorch or a wrapper around pytorch-lightning would be one way to achieve sklearn compatibility.

Derive and Implement Analytic Hessian for Mixed Logit

This is on the agenda for the coming months, but pull requests or contributions are always welcome.

At the moment, the sum of the outer products of the gradient are used as an approximation to the actual hessian.

LinAlgError: singular matrix (with covariant features + ridge)

I'm getting a numpy.linalg.linalg.LinAlgError: singular matrix error from line 1216 of base_multinomial_cm_v2

I'm getting these errors at what seems like random (I am setting numpy.random.seed(1) before I train the model). Note I am working with a feature matrix that has covariant features, but I'm setting the ridge parameter of fit_mle.

Here's small example of what I think is causing the issue.

import numpy as np
import pandas as pd
import pylogit as pl

from collections import OrderedDict

np.random.seed(1)
# x is a valid feature
x = np.array([1,2,3,1.5,3.5])

# x_redundant is a redundant feature
x_redundant = x * -1

fake_df = pd.DataFrame({"obs_id": [1, 1, 1, 2, 2],
                        "alt_id": [1, 2, 3, 1, 3],
                        "choice": [0, 1, 0, 0, 1],
                        "x": x,
                        "x_redundant": x_redundant,
                        "intercept": [1 for i in range(5)]})

specification = OrderedDict()
specification['x'] = 'all_same'
specification['x_redundant'] = 'all_same'


model = pl.create_choice_model(
    data = fake_df,
    obs_id_col='obs_id',
    alt_id_col='alt_id',
    choice_col='choice',
    specification=specification,
    model_type='MNL'
)

model.fit_mle(np.zeros(2))

If I set model.fit_mle(np.zeros(2), ridge=0.8) the error doesn't get raised in this example, yet in my code I'm setting the ridge=0.8 and the error still gets raised.

I read a few articles/cross-validated posts about singular matrices, but I'm still figuring out what it means. I wish I could be more descriptive in this issue.

Thanks for the help.

EDIT:
Here's a screenshot of my hessian when I'm getting the error:

PyTorch as computational backend

Request

Currently, PyLogit is built atop numpy and scipy.sparse for computational of choice probabilities, gradients, and hessians. This computational backend has at least two problems.

It restricts us to analytical derivatives that must be programmed by hand.
It practically restricts us to batch optimization since stochasatic optimization methods are currently only in libraries with automatic differentiation support.

Note, packages such as autograd and jax are of no help here because they don't support sparse matrices.

PyLogit should move to using PyTorch as its computational backend. There are almost no immediately known downsides. Upsides include resolving both problems above, allowing essentially arbitrary chocie models to be estimated through PyLogit, and providing access to a large and growing ecosystem of tools that are all designed around PyTorch (e.g. for model serialization, for scikit-learn compatibility, for standardization of model estimation code by end users, etc.).

Question about using Pylogit

I am very new to Python programming so please excuse me if my questions are naive.

I think the features of pylogit fit perfectly with what I want to do so I tried to get familiar with using it for a Random Coefficients model by simulating some data and then using the example notebooks to process the data and estimate the model parameters. So the problem I created is very simple, 300 "consumers" choose, on five occasions, between one of 4 alternative products, whose utilities are composed of Brand specific intercept and a price term (beta*Price_alternative). So I transformed these 1500 observations into the long form and tried to use Pylogit to estimate the parameters.

When I generate data from a simple MNL, no random coefficients, it works fine. But in the actual RCL, the values seem very sensitive to starting values and don't move much from the starting values. The thing that struck me the most though was the reported log-likelihood when all the parameters are zero. With 4 alternatives, and 1500 choices this should be 1500*Ln(0.25), but the weird thing is that the number that pylogit reports is very different (-1500 or so) instead of -2079, which makes me think I may be doing something else wrong.

Happy to send you the dataset and my Python notebook. I'd really appreciate any insights you might have about this problem.

Regards,
Siddarth

TypeError: invalid type comparison In[14]: of 'Mlogit Benchmark 2: Kenneth Train's Heating Data'

Hi Timothy,

Thank you so much for sharing this wonderful package! I'm a freshman about this mixed logit model. I've run this Heating dataset in R successfully. When I run the above-mentioned example with Pylogit an error arose at In[14].
I downloaded the 'mlogit_Benchmark--Heating-checkpoint.ipynb' and run it in the Jupyter notebook of Anaconda3 for Python 3.5. And I just installed Pylogit two days ago by 'conda install -c timothyb0912 pylogit'. Here is the error information from Jupyter:

TypeError Traceback (most recent call last)
in ()
6 specification=model_1_spec,
7 model_type="MNL",
----> 8 names=model_1_names)
9
10 # Estimate the given model, starting from a point of all zeros

D:\Program Files\Anaconda3\lib\site-packages\pylogit\pylogit.py in create_choice_model(data, alt_id_col, obs_id_col, choice_col, specification, model_type, intercept_ref_pos, shape_ref_pos, names, intercept_names, shape_names, nest_spec, mixing_id_col, mixing_vars)
223 choice_col,
224 specification,
--> 225 **model_kwargs)

D:\Program Files\Anaconda3\lib\site-packages\pylogit\conditional_logit.py in init(self, data, alt_id_col, obs_id_col, choice_col, specification, names, *args, **kwargs)
296 specification,
297 names=names,
--> 298 model_type=model_type_to_display_name["MNL"])
299
300 # Store the utility transform function

D:\Program Files\Anaconda3\lib\site-packages\pylogit\base_multinomial_cm_v2.py in init(self, data, alt_id_col, obs_id_col, choice_col, specification, intercept_ref_pos, shape_ref_pos, names, intercept_names, shape_names, nest_spec, mixing_vars, mixing_id_col, model_type)
877 specification,
878 alt_id_col,
--> 879 names=names)
880
881 ##########

D:\Program Files\Anaconda3\lib\site-packages\pylogit\choice_tools.py in create_design_matrix(long_form, specification_dict, alt_id_col, names)
694 else: # the group is an integer
695 # Create the variable column
--> 696 new_col_vals = ((long_form[alt_id_col] == group).values *
697 long_form[variable].values)
698 independent_vars.append(new_col_vals)

D:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
1281
1282 with np.errstate(all='ignore'):
-> 1283 res = na_op(values, other)
1284 if is_scalar(res):
1285 raise TypeError('Could not compare {typ} type with Series'

D:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1167 result = method(y)
1168 if result is NotImplemented:
-> 1169 raise TypeError("invalid type comparison")
1170 else:
1171 result = op(x, y)

TypeError: invalid type comparison

I also copied and pasted the code to Spyder in Anaconda3. It generated the same error.
Traceback (most recent call last):

File "", line 1, in
runfile('C:/Users/Larry/Google Drive/CS/Programming_languages/Python/Python programs/exer_19_03_ML_mixlogit_Pylogit.py', wdir='C:/Users/Larry/Google Drive/CS/Programming_languages/Python/Python programs')

File "D:\Program Files\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 678, in runfile
execfile(filename, namespace)

File "D:\Program Files\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 106, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/Larry/Google Drive/CS/Programming_languages/Python/Python programs/exer_19_03_ML_mixlogit_Pylogit.py", line 147, in
names=model_1_names)

File "D:\Program Files\Anaconda3\lib\site-packages\pylogit\pylogit.py", line 225, in create_choice_model
**model_kwargs)

File "D:\Program Files\Anaconda3\lib\site-packages\pylogit\conditional_logit.py", line 298, in init
model_type=model_type_to_display_name["MNL"])

File "D:\Program Files\Anaconda3\lib\site-packages\pylogit\base_multinomial_cm_v2.py", line 879, in init
names=names)

File "D:\Program Files\Anaconda3\lib\site-packages\pylogit\choice_tools.py", line 696, in create_design_matrix
new_col_vals = ((long_form[alt_id_col] == group).values *

File "D:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1283, in wrapper
res = na_op(values, other)

File "D:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1169, in na_op
raise TypeError("invalid type comparison")

TypeError: invalid type comparison

Traceback (most recent call last):