I think it's worth a description of mode choice from an implementation perspective just so we're all on the same page as to how things work now.
To that end I took the work mode choice model excel worksheet and made a csv out of it and put it here
This spreadsheet seems to be doing multiple things, including defining and computing some variables, defining the nesting structure, and creating filters and expressions both for the individual variables.
My first impression is that this isn't so bad, at least in the sense that I can still pretty much understand what is intended, and how the mode choice model is built out of this. I imagine this is sort of "bending over backwards" to put as much power in the CSVs and the people who want to edit those CSVs as possible. I really don't disagree with that approach, but would definitely do a bit more in Python and would make a few other changes.
For starters, I don't like that there are 18 columns here for each of the alternatives and mostly whitespace in the cells. I would suggest a "stacked" format for this with a column for "alternative" and fill in the name of the alternative with each row.
I'm also going to assume that it's a better idea to define the nesting structure directly as a hierarchical dictionary (probably in YAML) and that our main task is to create the correct variables, and the correct coefficients by which to multiply them.
I think if we can do this for one of the alternatives we can do this for the rest, so we can start by looking at the drive alone alternative. I extracted that and put it up as a gist here so we can see it outside of the rest of the very large file.
I think the important exercise is to look at each of the expressions used here and figure out where they come from. Below are the expressions and my best guess as to what they mean.
- sovAvailable - attribute of the tour, but depends on whether the car was taken in previous tour
- autos - attribute of households - comes form auto ownership model
- age - attribute of persons - comes from synthesized population
- tourCategoryJoint - attribute of the tour - whether it's a joint tour
- tourCategorySubtour - attribute of the tour - whether it's a work subtour
- workTourModeIsSOV - attribute of tour, but depends on whether this is work-based tour and mode was SOV to work
- c_ivt - coefficient / constant
- SOV_TIME - skim
- out_period/in_period - attribute of the tour - derived from scheduling model
- c_walkTimeShort - coefficient / constant
- terminalTime - I'm guessing this is the time from car to destination, and is an attribute of the tour because it is a lookup using the tour destination
- c_cost - coefficient / constant
- costPerMile - constant
- SOV_DIST - skim
- dailyParkingCost - is an attribute of the tour, because it is derived from the tour duration and hourly parking cost, which varies by destination zone
- SOV_BTOLL - skim
- c_age1619_da - coefficient / constant
Hopefully I got these somewhat correct. Anyway, there are many similarities with previous models. There are attributes of the households, of persons, and many more of tours, which can be represented as computed variables in the simulation framework - each one of these is likely to be 1-5 lines of Python code and will be available if specified in the CSV.
Skim management will be necessary using something like what's beginning to be described here.
The real somewhat bizarre part is that all of the coefficients are defined as variables. The reasoning for this is good though, as most of the coefficients are used multiple places - e.g. the coefficient for in-vehicle travel time is constrained to be equal wherever a person is in a vehicle (maybe even across trip purposes?), which makes a lot of sense. It also looks like some of these are manually specified multipliers - e.g. c_ivtt
for light rail is equal to .9 * c_ivtt
. For most/all previous models the coefficient is only used once so can be defined "inline" in a cell in the csv file.
So in some sense, the coefficients in this case are global constants, which could easily be defined in YAML or given that some are simple computations, I think defining directly in Python is an option.
At that point, much of the model is defined in Python, including
- computed attributes of different tables
- the dataframe of choosers - which is a merged set of tables that includes relevant computed attributes defined above
- the dataframe of alternatives
- skims that are managed by giving them names
- constants including some that are estimated (and some that are manually derived from estimated constants?)
In the CSV is then a list of simple expressions that can refer to any of the above, together with the alternatives that the associated utility should be added to.
I understand it's a bit of a cliffhanger but I think I'm going to leave it here for now. Any comments or clarifications before we discuss on this end?
I also wonder how much complication is still left out of this first pass in terms of at least 1) coordination among households and 2) coordination of schedules. When decisions are mostly independent, things are very amenable to Python/Pandas/Numpy - when they're interdependent things can become complicated. Do we need to consider coordination at this point or soon?