activitysim / activitysim Goto Github PK

View Code? Open in Web Editor NEW

189.0 42.0 96.0 212.58 MB

An Open Platform for Activity-Based Travel Modeling

Home Page: https://activitysim.github.io

License: BSD 3-Clause "New" or "Revised" License

Python 25.57% Assembly 0.40% Jupyter Notebook 73.97% R 0.04% Shell 0.02%

python travel-modeling data-science bsd-3-clause microsimulation activitysim

activitysim's Introduction

ActivitySim

The mission of the ActivitySim project is to create and maintain advanced, open-source, activity-based travel behavior modeling software based on best software development practices for distribution at no charge to the public.

The ActivitySim project is led by a consortium of Metropolitan Planning Organizations (MPOs) and other transportation planning agencies, which provides technical direction and resources to support project development. New member agencies are welcome to join the consortium. All member agencies help make decisions about development priorities and benefit from contributions of other agency partners.

🔥 The main branch of this repository contains the Consortium's latest in-development codebase. It is not necessarily what you'll get if you install released code from conda-forge or by downloading one of the "release" versions here on GitHub, but it is generally expected that code in the main branch should be usable.

Helpful Links

activitysim's People

Contributors

Stargazers

Watchers

Forkers

osplanning amos5 bhargavasana gitter-badger scdavis50 cshong9 khandkernurulhabib e-lo wusun2 terratenney psrc ual jfhawkin amit2011 medailey figo2002 vishalbelsare beelabs stevenhz pedrocamargo danielsclint bakereful ramonpereira zhang-yufeng jiaxu1024 dannyydt mattwigway yjx4131 nicoconi1123 wyk5231 jdcaicedo251 jpn-- danhphan rsginc adarwiche83 jamiecook ymavta arunakkin sandag sfcta jiaodaxiaozi cherryliu-sandag mwcog fscottfoti esanchez01 albabnoor janzill piresluciana mxndrwgrdnr wsp-sag dhensle jianyunli camsys chesterharvey gregmacfarlane saud-abdulmunem nafisabinti aletzdy nick-fournier-rsg chronial joejimflood andrewthetm lhy121lhy rls-odot 1von i-am-sijia asiripanich cherryliuliu patrick-dononvan bwentl kytringuyenmonash psrc translinkforecasting r-akemii cetinhasari nick-fournier asiripanich-dtp sap-toronto stefancoe paudelxa jihoyeo yuetongw mlhollestelle

activitysim's Issues

only calling toframe with variables used in a given model

@jiffyclub You know how in UrbanSim we only compute the variables that are actually used in the yaml specs? We will need the same thing for the csv specs. Right now it computes all the variables and in a lot of cases variables simply aren't available until later on in the model chain. No hurry as I can work around it, but wanted to make a note.

mandatory_scheduling tour choice fails if no time window for second tour

mandatory_scheduling fails with no non-zero probability alternatives for both work and school tours if the first tour chosen is the last possible tour of the day and so all alternatives are -999

For now, I simply assigned tours starting before previous tour ends -100 isntead of -999

# FIXME - Subsequent tour must start after previous tour ends,(tour_num > 1) & (start < end_previous),-999
Subsequent tour must start after previous tour ends,(tour_num > 1) & (start < end_previous),-100

duplicate column definitions?

It seems these orca virtual column definitions

@orca.column("persons_workplace")
def school_taz(persons):
    return pd.Series(1, persons.index)

@orca.column("persons")
def workplace_taz(persons):
    return pd.Series(1, persons.index)

are not needed in persons.py since they are added by school_location.py and workplace_location.py respectively

orca.add_column("persons", "workplace_taz", choices)

orca.add_column("persons", "school_taz", choices)

Discussion topics for 2/20 meeting

It's been almost 2 months since our last meeting regarding technical progress on activitysim, so it's good to get back into it. We've been a bit light on commits this week, but there are good reasons for that which I'll cover below and during our meeting tomorrow. So let's recap recent progress.

First, we are moving forward with the CSV configuration and Python coding approach which seems to work well for the models we've encountered thus far. This should come as no surprise, but I am increasingly convinced that this approach is good for this project, and we will start work to document and test this more officially.
We have taken steps to integrate with the separate OMX repository. The branch still exists in this repo should we ever need any of those changes again, but we're moving forward with the assumption that OMX will remain a separate repo. It's worth noting that there is an open pull request on the OMX repository with some cosmetic changes, and to date we've gotten no response to either the PR or the open issue that we've posted to that repository. We did however get the updates we needed to install OMX for activitysim.
After we got the necessary changes in OMX, we've been able to close some open issues and PRs on activitysim, which is good because we don't want too many outstanding PRs at one time as the changes would eventually begin to conflict with each other.
We spent a fair amount of time while we were waiting on licensing issues to resolve by trying to parallelize UrbanSim (which would have ramifications to the parallelization of activitysim). We were able to make some simple changes (mainly cacheing) which increased performance by 20-30% in some cases, but our opinion at this time is that because of UrbanSim's highly interdependent nature - e.g. each model is dependent on the previous step, and there are few subproblems (like a person can move from one side of the region to another so segmentation of the problem is difficult) which are easily amenable to parallelization.

We looked at the Multiprocessing module, Cython, Continuum's Blaze, and columnar databases, but our opinion is that the amount of work necessary to reframe the problem using these tools will be extensive, and even then the gains are likely to be modest. We've profiled the code extensively at this point, and the great portion of the time is spent in database-like operations - merges, lookups, groupbys, aggregations, and the like. This is where the time should be and there's not a great deal on our end that we can do to speed these things up. Doing these operations in C with all the data in memory (what we're doing now) is definitely the state-of-the-art for these sorts of operations, but we will continue to monitor computational frameworks for advances in this area.

That is not to say that activitysim will necessarily have the same limitations as urbansim. For instance, it's easy to imagine that if you can parallelize households in activitysim (that they're not co-dependent and don't need to synchronize with other households), then you can split batches up among computing hardware and gather results at the end. We will certainly evaluate this when the time comes - but is likely true that it will be challenging to parallelize the computation within an individual model. We can talk about this in some detail if you would like and are happy to answer questions about this.

The bulk of our time in the last 2 weeks or so has been spent on CDAP - the coordinated daily activity choice. This has been a very interesting problem! Initially I coded up a prototype of CDAP which was going well for one and two-person households, and we've even converted the csv specs over to the new framework for these households. Through several emails with Dave Ory, I realized that I had interpreted the methodology incompletely and that the real method was a bit more complicated. Basically there are contributions to the utility that come from every 1-person, 2-person, and 3-person combination of a multi-person household and so there are, for instance, 4 passes through the core MNL code for a 2 person household, etc.

At any rate, we were able to quickly get the information we needed from Dave, go to the drawing board and develop the best way to tackle the problem. Why was this so challenging you might ask? What's interesting is that in Python, all numeric computation has to be vectorized - for loops are incredibly slow - and for a problem like this vectorization is not trivial. In other words, we have to build large dataframes where a given household appears multiple times which each of the permutations of the people in that household.

Again, we can go into more depth on the call, but we do have a clear plan to move forward with this, Matt is currently working on it, and it will be completely documented and tested which requires significantly more attention than the progress we were making before break. We should also discuss why vectorization is such a problem for CDAP and why this is a challenge for the use of Python, which requires vectorized computation. I think I've said enough about this at this point and we can do the rest at the meeting tomorrow.

It's probably not hugely relevant to this group, but if you're interested, in January we put together an OpenStreetMap importer for our accessibility engine which is now ready to go. GTFS (transit) will be next when we get the chance.
We should also talk about next steps in the call as I hope to have some cycles freeing up next week to work on this. It is my hope that Matt will continue with CDAP and I will start looking at the next model in the list (which is tour generation).

@jiffyclub feel free to add anything I have overlooked. Thanks and talk to you tomorrow!

Coveralls Testing via Continuous Integration System

More detail will need to be flushed out on this item in preparation for the March 11 project meeting.

Discrete Choice Model Capabilities

UrbanSim has location choice models, but they are somewhat specialized to their application in real estate modeling. (For example, the LCMs always remove choices from the alternatives pool once they have been chosen.) As a first step toward adding models to ActibitySim we're working to generalize the discrete choice classes in UrbanSim so that they can be reused here. You can follow my work in the dcm branch in UrbanSim.

Also related to this, we've added tests of UrbanSim's MNL implementation to make sure it matches R's mlogit package. You can see that work in UDST/urbansim#132.

tracing requirements

we're going to implement tracing in addition to logging (#87) since we need this feature to track down data and expression errors related to #81. What are the requirements for this functionality?

We discussed this on our call today and came up with the following list:

easy to parse text format such as CSV
easy to import into Excel since this is the tool most users will use for exploring the traced results
maybe two outputs - one big log file like CT-RAMP does today so you can see relationships between sub-model data and expressions, and then a CSV file for each sub-model?
user will specify one HH ID to trace
output information will be similar to what CT-RAMP produces

tour_mode_choice throws runtime error for some tours

Runtime error in tour_mode_choice_simulate running the full zone dataset with 12000 HH sample and random seed of 0

settings.yaml:

preload_3d_skims: True
households_sample_size: 12000

simulation.py:

def set_random_seed():
    np.random.seed(0)

orca.add_injectable("set_random_seed", set_random_seed)

Stack trace on error:

Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 2411, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1802, in run
    launch(file, globals, locals)  # execute the script
  File "/Users/jeff.doyle/work/activitysim/sandbox/simulation.py", line 104, in <module>
    orca.run(['tour_mode_choice_simulate'])
  File "/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py", line 1876, in run
    step()
  File "/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py", line 780, in __call__
    return self._func(**kwargs)
  File "/Users/jeff.doyle/work/activitysim/activitysim/defaults/models/mode.py", line 195, in tour_mode_choice_simulate
    cache_skim_key_values=cache_skim_key_values)
  File "/Users/jeff.doyle/work/activitysim/activitysim/defaults/models/mode.py", line 130, in _mode_choice_simulate
    locals_d=locals_d)
  File "/Users/jeff.doyle/work/activitysim/activitysim/activitysim.py", line 216, in simple_simulate
    choices = make_choices(probs)
  File "/Users/jeff.doyle/work/activitysim/activitysim/mnl.py", line 67, in make_choices
    return pd.Series(choices, index=probs.index)
  File "/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py", line 228, in __init__
    data = SingleBlockManager(data, index, fastpath=True)
  File "/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/internals.py", line 3752, in __init__
    fastpath=True)
  File "/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/internals.py", line 2461, in make_block
    return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
  File "/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/internals.py", line 84, in __init__
    len(self.mgr_locs)))
ValueError: Wrong number of items passed 1277, placement implies 1278

Nonvectorized loop over full persons table in process_mandatory_tours

process_mandatory_tours has a loop that iterates over all rows in person table in python (not vectorized).

This is called when mandatory_scheduling processor runs and mandatory_tours table is loaded by orca.

This is liable to create performance problems with a full dataset.

skims management

One thing we will eventually need is to pass in some sort of skims management rather then individual skims like we do here

We definitely should not have to change the parameters to the model in order to add a new skim. Really that's easy enough to do though - the easiest thing is probably to use wrap up a dictionary of key value pairs where keys are the name of the skim and values are the 2D matrix (in dense form) or 1D matrix (in sparse form). This is already being done here but we should move this into a Pythonic object or even just an injectable.

It's also possible we could represent skims as a data frame like this and use merge and take directly...

Index	AM_TT	PM_TT
(0,0)	5	4
(0, 1)	10	8
etc	...	...

max residual window

So I don't keep bombarding Dave with these questions, I might start posting them here. I'm working my way through the non mandatory tours model for MTC for full time workers and ran into the concept of the maximum residual window and the expression ln(max(window1,max(window2,window3))+1) - are these time windows? Can someone elucidate a bit? Thanks!

mode choice

I think it's worth a description of mode choice from an implementation perspective just so we're all on the same page as to how things work now.

To that end I took the work mode choice model excel worksheet and made a csv out of it and put it here

This spreadsheet seems to be doing multiple things, including defining and computing some variables, defining the nesting structure, and creating filters and expressions both for the individual variables.

My first impression is that this isn't so bad, at least in the sense that I can still pretty much understand what is intended, and how the mode choice model is built out of this. I imagine this is sort of "bending over backwards" to put as much power in the CSVs and the people who want to edit those CSVs as possible. I really don't disagree with that approach, but would definitely do a bit more in Python and would make a few other changes.

For starters, I don't like that there are 18 columns here for each of the alternatives and mostly whitespace in the cells. I would suggest a "stacked" format for this with a column for "alternative" and fill in the name of the alternative with each row.

I'm also going to assume that it's a better idea to define the nesting structure directly as a hierarchical dictionary (probably in YAML) and that our main task is to create the correct variables, and the correct coefficients by which to multiply them.

I think if we can do this for one of the alternatives we can do this for the rest, so we can start by looking at the drive alone alternative. I extracted that and put it up as a gist here so we can see it outside of the rest of the very large file.

I think the important exercise is to look at each of the expressions used here and figure out where they come from. Below are the expressions and my best guess as to what they mean.

sovAvailable - attribute of the tour, but depends on whether the car was taken in previous tour
autos - attribute of households - comes form auto ownership model
age - attribute of persons - comes from synthesized population
tourCategoryJoint - attribute of the tour - whether it's a joint tour
tourCategorySubtour - attribute of the tour - whether it's a work subtour
workTourModeIsSOV - attribute of tour, but depends on whether this is work-based tour and mode was SOV to work
c_ivt - coefficient / constant
SOV_TIME - skim
out_period/in_period - attribute of the tour - derived from scheduling model
c_walkTimeShort - coefficient / constant
terminalTime - I'm guessing this is the time from car to destination, and is an attribute of the tour because it is a lookup using the tour destination
c_cost - coefficient / constant
costPerMile - constant
SOV_DIST - skim
dailyParkingCost - is an attribute of the tour, because it is derived from the tour duration and hourly parking cost, which varies by destination zone
SOV_BTOLL - skim
c_age1619_da - coefficient / constant

Hopefully I got these somewhat correct. Anyway, there are many similarities with previous models. There are attributes of the households, of persons, and many more of tours, which can be represented as computed variables in the simulation framework - each one of these is likely to be 1-5 lines of Python code and will be available if specified in the CSV.

Skim management will be necessary using something like what's beginning to be described here.

The real somewhat bizarre part is that all of the coefficients are defined as variables. The reasoning for this is good though, as most of the coefficients are used multiple places - e.g. the coefficient for in-vehicle travel time is constrained to be equal wherever a person is in a vehicle (maybe even across trip purposes?), which makes a lot of sense. It also looks like some of these are manually specified multipliers - e.g. c_ivtt for light rail is equal to .9 * c_ivtt. For most/all previous models the coefficient is only used once so can be defined "inline" in a cell in the csv file.

So in some sense, the coefficients in this case are global constants, which could easily be defined in YAML or given that some are simple computations, I think defining directly in Python is an option.

At that point, much of the model is defined in Python, including

computed attributes of different tables
the dataframe of choosers - which is a merged set of tables that includes relevant computed attributes defined above
the dataframe of alternatives
skims that are managed by giving them names
constants including some that are estimated (and some that are manually derived from estimated constants?)

In the CSV is then a list of simple expressions that can refer to any of the above, together with the alternatives that the associated utility should be added to.

I understand it's a bit of a cliffhanger but I think I'm going to leave it here for now. Any comments or clarifications before we discuss on this end?

I also wonder how much complication is still left out of this first pass in terms of at least 1) coordination among households and 2) coordination of schedules. When decisions are mostly independent, things are very amenable to Python/Pandas/Numpy - when they're interdependent things can become complicated. Do we need to consider coordination at this point or soon?

"other than modeled person" fomulation

@jiffyclub I think this one is for you. I'm working my way through hundreds of variable specifications, many of which are straightforward. I now see about 20 that all have the same formulation that has me a bit stumped. Something along the lines of a [retiree] in the household that is NOT the current person. What do you think is the best way (pandas and/or simulation framework) to formulate a variable like this from the perspective of the chooser?

should implement logging

It would be very useful to have python logging with log levels, redirection, etc.

sampling of alternatives

I'm curious to get a little more detail on the DestinationChoiceAlternativeSample.xls file. Am I correct in assuming this is for biased sampling of alternatives for destination choice? Is the form of these models still logit?

It's interesting because the soec of alternatives sampling is almost the same as the destination choice but not quite - e.g. the mode choice logsums are added. And in the roadmap we decided to leave the mode choice logsums out for our first iteration of this - so should I maybe just start with the alternative sampling spec and use that to do an actual destination choice for now? Thoughts here?

Large Scale Performance Test

This task needs to be flushed out more in terms of what is expected and what types of benchmarks will be assigned to this. Looking for @bstabler to provide more insight on this over the next couple of weeks.

should implement restartable, regressible data pipeling

This would be useful for a number of reasons

ability to resume jobs at a known point would facilitate debugging of problems with big datasets late in the pipeline
checkpoints would allow regression testing of results at all points in data pipeline

round trip auto times to school and work should be free flow

MTC TM1 UECs compute round trip auto time to school and work using free flow skims. The virtual columns in persons_work and persons_school tables use am,pm and am,md skims respectively.

Until and unless we decide to do otherwise, we should conform to MTC TM1.

agenda for meeting - 3/5/15

This fortnight we made progress on models 4-8 of the roadmap (list below). We have the basic flow between these models mostly operational, and a good portion of the relevant variables are defined. Generally, whenever something looked like it would take longer than a few hours I opened an issue and saved it for later. I wanted to get a pretty good idea of how the whole model system might look to know if there were any major gotchas, and there are some minor issues but nothing major as of yet.

Mandatory tour generation
Non-mandatory tour generation (no joint tours)
Non-mandatory tour location choice (no mode choice logsums)
Mandatory tour scheduling
Non-mandatory tour scheduling

Although there are a number of intermediate branches that haven't been merged yet, this is the branch that contains reorganized code and the one I will be walking us through tomorrow.

https://github.com/synthicity/activitysim/tree/file-reorg/activitysim/defaults

For reference, a run of the current modeling system is available here to show what it looks like to execute:

http://nbviewer.ipython.org/github/synthicity/activitysim/blob/file-reorg/example/simulation.ipynb

I'm also keeping a document of questions / issues which we can cover as time permits:

https://github.com/synthicity/activitysim/blob/file-reorg/example/README.md

Matt has also made significant progress with CDAP, with full testing, etc, but this is not yet complete.

school_location zone alternative sampling needs improvement

school_location was failing for university because sometimes none of the zone alternatives had any universities in them.

The quick fix was for each school type, to restrict the alternative candidate set to zones deemed to have schools of that type

alternatives_segment = alternatives[alternatives[school_type] > 0]

Ultimately we may wish to implement a more sophisticated destination zone sampling approach.

micro-agenda for meeting on 3-27

Tomorrow we'll be discussing Matt's progress on CDAP and the core functionality and go over some of the mode choice pull request.

Bug in female column calculation in persons table

In persons.py it is currently:
@orca.column("persons")
def female(persons):
return persons.sex == 1

It should be:
@orca.column("persons")
def female(persons):
return persons.sex == 2

user stories

A suggestion was made on Friday to start making a list of "user stories," or perhaps more specifically, those things that users of the modeling system might want to change for different implementations or different scenarios. These are things that we want to be able to change easily, version control those changes, and not have to change any core parts of the model system in order to do so. Perhaps this issue can be the place to collect those stories?

exit when no alternatives available

#81 identified the need to log when a choice model returns no alternatives available. When this happens, the model needs to log the results and exit.

Write protect the master branch

The master branch needs to be write protected to only allow pull requests after automated testing is completed and passed.

generic zone labels other than 'TAZ'

@wusun2 - would like support for generic zone labels other than 'TAZ'. This raises a question about the boundary between what the developer is responsible for and what the modeler is responsible for. TAZ is currently hard wired into many of the activitysim/defaults/tables and activitysim/defaults/models classes, probably because it is in the initial input HDF5 data. Is the goal for the modeler setting up a new model to not need to revise these classes? We'll come back to this issue later in the project.

update documentation to describe how CDAP model works

Getting Started Tutorial: Example Section for Documentation

In 3/25/2016 meeting notes, @bstabler indicated he would begin writing the example section for ActivitySim. This is the ticket for that action item.

Progress report and discussion topics for 12-19-14 meeting

Progress since the last meeting

Infrastructure

Pull request for (YAML-based) general choice modeling in Urbansim
Defaults for some standard data sources and variable definitions that are usable for all regions
A notebook for moving data from the directory structure into a single HDF5 file
A notebook for browsing the registered tables and running the example models
A pull request for integrating OMX Python support
A pull request to wrap a 2D matrix with a Skim object

Specific Models

We now have auto ownership and workplace location choice basically working on MTC data - this last uses the OMX and Skim objects. There are a few design decisions that can be made at this point. Here is a proposal for how the models could work.
- An example implementation directory (these are client specific)
- The configuration is still a csv file with the same basic form as the current UPCs.
- Note that this is different from the current YAML approach in UrbanSim and is discussed in this issue
- In short models are specified in the csv file, and configured in Python code like this. Transformations that are more complicated than those allowed by Patsy can be specified in the CSV file - we use DataFrame.eval or straight Python eval when there is an @ as the first character. Allows more flexibility in the csv than is provided in current yaml files.
- Additionally, there are simulation settings here
- An example of using a skim object is here where OMX is read here, the specific matrix here, these are injected and configured here
- A dictionary of skims is passed to "simple simulate" - and a specified column (e.g. TAZ) should occur in the choosers and is used as the origin id and the alternatives and is used as destination id
A few design decisions to make
- YAML vs CSV
- and related: how much in configuration vs in code - how comfortable are folks with Python at this level?
- also related: dependencies in UrbanSim right now are basically the sim framework and low-level mnl routines. 3rd library or is everyone ok with that?
- main concern at this point is performance - we're not slow per se, but not fast and a lot of memory is used. Did I hear that folks actually parallelize households on different processors?
Next 2 weeks (actually first 2 weeks of Jan)
- If we're comfortable with this proposal, we can comment, document, and test the activitysim code
- Size variables in workplace location choice
- Next model is Coordinated daily activity pattern unless we want to head to something complicated

Thoughts?

workplace location or locations?

Hopefully this is a quick question. Let me see how close I am to some details here... We run the workplace location choice model early in the process for those people who are full or part time workers. But a few models later we run the mandatory tour generation model and the alternatives include "two work trips" or "2 school trips" or "work and school." Do people need to be assigned a second work location at some point or is there always only one location in the model?

Getting Started Tutorial

It may be best to amend the documentation that @fscottfoti started writing last year.

Design decisions around low-level vs high-level objects, yaml vs. csv serialization, etc...

The code in #4 already begs some interesting questions. The biggest of which is how much to use the dcm code that Matt is currently working on. As Matt and I discussed, dcm is primarily useful for serializing to YAML, for wrapping of low-level models inside larger segmented models and a few other small things.

If we're not necessarily wed to YAML and if we're primarily interesting in simulation as opposed to estimation (early feedback from Dave Ory says this might be the case), and if we want to stick with the CSV/XLSX format for storing coefficients (which is nice when doing alternative specific coefficients because that's naturally 2D), then we're free to address this problem a little more directly using the other underlying utilities we have built.

https://github.com/synthicity/activitysim/blob/adding-defaults/example/models.py does exactly this. The directness and conciseness of this approach is compelling. We could potentially build up a new set of utilities with a slightly different set of design decisions if we want to. At this time, the dependencies used from UrbanSim are essentially 1) the simulation framework for variables, tables, etc 2) low-level choice utilities from the urbanchoice directory and 3) utils.py for really nothing important. We've discussed putting these in a different repo before so it's worth mentioning again in the context of this larger discussion.

pandas 0.18 breaks _check_for_variability

DataFrame .describe() treats bool as categorical data in pandas>=0.18 and drops bool columns (unless returns include='all' is passed, in which case std is returned as Nan) so _check_for_variability in activitysim.py fails silently for specs with mixed bool and numeric target values and noisily for specs with only bool targets.

There is an easy fix: convert the bool columns to int in _check_for_variability before calling describe around line 149. But it is slightly tricky in that the sample returned by random_rows is a view into underlying data so the sample rows may need to be copied before assignment. (pandas seems to be a bit unpredictable as to whether it is actually a view or a copy depending on the size of the data. )

Come to think of it, this is also presumably broken for string types though there are perhaps no instances of them in specs currently. Certainly not any specs with only string columns so it won't fail noisily.

Alternatively I could handle bool types explicitly and avoid the need to copy data, at the cost of the added complexity of handling different column types explicitly.

correct CDAP

CDAP is supposed to only consider the interactions between up to 5 HH members and then apply some additional utility terms after that. We will review the MTC TM1 code and UECs and correct this. CDAP also currently loops by HHs but this is inefficient.

We will likely re-implement it as a series of batch vectorized calculations:

Select the 5 persons for consideration in HHs with 6+ people; select workers and youngest kids
Calculate person level utilities
Calculate HH level utilities
Calculate Person pairwise utilities for HH size 1
Calculate Person pairwise utilities for HH size 2
Calculate Person pairwise utilities for HH size 3
Calculate Person pairwise utilities for HH size 4
Calculate Person pairwise utilities for HH size 5
Sum up utilities at the HH level

The key is to organize the problem into a series of batch table operations

accessibility task

I want to make sure that we start with the right version of MTC's TM1 accessibility script. Please confirm that we need to re-implement this script into ActivitySim as a new fully featured sub-model.

https://github.com/MetropolitanTransportationCommission/travel-model-one/blob/master/model-files/scripts/skims/Accessibility.job

pytables 3.3.0 breaks openmatrix breaks activitysim

Version 3.3.0 of pytables expired deprecated camelCase function names in favor of underscore_delimited names thus breaking references in openmatrix to getNode (now renamed get_node).

I updated the travis test script to require pytables=3.2.3.1 until openmatrix is updated to the new function names.

I also updated setup.py to require tables >= 3.1.0, <3.3.0

but the current situation may present problems to anyone attempting to install activitysim via conda until openmatrix is updated.

Once that happens, the version restrictions should be removed from setup.py and .travis.yml

change all expression file skim references to something smarter

instead of using skim for OD, skim_t for DO, let's use some better conventions:

skim_od (origin to destination)
skim_do (destination to origin)
skim_os (tour origin to stop destination)
skim_sd (stop origin to tour destination)
etc

and this should work well for multiple zone systems as well:

skim_omdm (origin microzone to destination microzone)
skim_omdt (origin microzone to destination transit access point)
etc

let's update all the expression files

Fastest way to access skims

@jiffyclub At some point pretty soon we'll want to diagnose the fastest way to access skims. Given that we store the skims in OMX format (we might want to consider packing multiple matrices into a single h5 for convenience?), the big question is how to store/access them in memory.

Given our recent history with the .loc command I'm guessing storing zone_ids directly is basically a non-starter. Fortunately, we're storing a dense matrix so we can make sure every zone_id is in position 1 greater than it's index (i.e. zone 1 is in index 0). That way we can either 1) have a dataframe with a multi-index and call .take or 2) have a 2-D numpy array and access then directly, but only for one column at a time. Do we think that 1) is slower than 2) because 1) is definitely more attractive from a code perspective. I guess this "stacked" vs "unstacked" format.

At any rate, we should probably write a small abstraction to hide this from the user. Basically we pass in one of the formats above with dimension N and then pass in two series of "origin" and "destination" zone ids and get back the values.

destination not defined for mandatory_tours

destination_choice assigns destination taz for non mandatory tours.

destination is implicitly workplace_taz for work tours and school_taz for school taz, but the destination fields are never set, breaking a bunch of computed columns like dest_topology that are used by the mode_choice_simulate spec - they come up as NaN values resulting in Nan utilities that causes the make_choices call to fail in simple_simulate.

This is not caught in testing as mode_choice_simulate in master only runs tour_type 'eatout' which is a non mandatory tour type and thus has a valid destination.

I am holding on fixing this because it depends on several cascading shortcuts int the current implementation that we will want to address at once, and I don't want to add a fix until first writing a failing test...

implement nested logit

this is a place holder for task 7 nested logit comments, issues, discussion.

zone numbers must be sequential

ActivitySim assumes model zones (TAZs) are sequentially numbered from 1 to the max number of zones and that external zone numbers are after internal zone numbers. This is required by Cube and is therefore how MTC TM1 is setup. This will need to be revised since many regions and other modeling software platforms skip zones, use zone labels, and have external zone numbers before internal zone numbers. This is somewhat related to the fact that ActivitySim eventually needs to support multiple zone systems as well.

mini model test

The current example and tests use only the distance skim for all skim data and also runs only selective purposes, modes, and other expressions. This was likely done in part to be fast and to make use of free online Travis CI integration. In order to continue to flush out the implementation, we'd like to create a mini model test, with say 20 zones, so we can always be working with a more comprehensive test bed. We did this for the Oregon statewide integrated model and it proved to be extremely useful. We'll need to select 20 zones that have good coverage for the various input data - land use, mode availability, etc. We'll also create a script to automatically create the mini model inputs from the full set of inputs. What do others think about this?

Choose a license for ActivitySim

We should choose a license for ActivitySim. Here are some common ones: http://opensource.org/licenses

UrbanSim is currently covered by the GNU Affero GPL: http://opensource.org/licenses/AGPL-3.0

example broken?

Ben reports the example/simpulation.py is failing. It is working fine for me on OSX. It may be package version dependent?

@bstabler - could you do a pip freeze and capture output and I will see if I can reproduce.

progress report

As promised, here is my progress report with percent completion for the code/modeling for each model.

There are a few different things that we might prioritize.

near completion of these models
near completion of the entire model set, with partially finished implementations of each model
continued work on the core, including some catch-up work moving some helper functions to the core and unit testing them and maybe getting a draft of documentation out
going straight at whatever we think is the "hardest" problem to make sure we can do it and then doing one of the above. I think the hardest problem is probably the coordination of households (maybe in joint scheduling); others might think it's the computation burden required by running the full mode choice model.

I lean towards attacking the hardest problems (4), then catching up on unit testing (3), then doing a fairly complete implementation of the current model set (1) but am certainly open to alternatives.

Test Re-reading the distance matrix from disk

In the 3/25/2016 project meeting, @toliwaga indicated he made revisions to skim.py to test re-reading the distance matrix from disk as opposed to just getting it from memory. This is a ticket to serve two purposes:

Outcome of those tests.
If tests were useful, check code into repository.

The goal of this (as I can deduce from the notes) is read and use all the required OMX matrices referenced in tour_mode_choice.csv into memory and document the memory usage and runtime implications.

remove NA alternatives before calculating probabilities

In a choice model, if an alternative is not available, then it needs to be removed from the alternative set before calculating the probabilities for each available alternative. An alternative will be deemed to be not available if one of the expressions evaluates to the global NA value. This issue is discussed in more detail in #81.

agenda and updates for 5/8

Since we skipped a meeting, this is actually the updates for the last 4 weeks. Progress includes:

#50 Switched from simulation framework to "orca" (data pipeline orchestration). No longer have any dependencies on UrbanSim.
https://github.com/synthicity/activitysim/tree/mode-choice2 - A PR for a fairly complete spec of the tour mode choice model. A few caveats are described in the PR, and the progress report on the wiki is up-to-date with the remaining to-dos.

We spent the last 1.5 weeks working on getting a benchmark for the current approach and fixing obvious performance issues that came up. Right now we're looking at about 2.5 minutes per 10,000 households or about or a little less than 6 cpu/hours for all of the models except CDAP. CDAP is definitely the bottleneck and we'll have new benchmarks on that soon. I don't see any reason why we wouldn't be able to parallelize batches of households as well to get that down more if you throw hardware at the problem.

If you read in all of the skims at the same time it uses about 10GB of memory, but most of those skims only get used once or twice so I'm not sure there's a strong reason to keep them in memory at all times (this is also for a 1454x1454 O-D matrix).

I'll try and follow up on this with a benchmark and profile of the whole model set, and some information on the current state of the performance of CDAP, which is much faster than it was 2 weeks ago but might still be the bottleneck in the simulation.

a lot of PRs

As people may have noticed, there are a lot of pull requests right now (more than I would like). I'd love to start merging the early PRs, but we're waiting on a resolution to the omx issue. Right now all these PRs depend on the omx that we have within activitysim. If we want to depend on omx outside of activitysim then we need to make those changes in omx first. Or alternatively, I suppose we could leave omx in activitysim for now and then remove it once the omx repo is setup so that it can be a dependency of activitysim. Bottom line is I will start merging PRs if people are ok with omx being in activitysim for now. Any takers?

Misleading that Skims object setters and getters are not symmetrical

it is very misleading, confusing, and error prone that the Skims object __getitem__ and __setitem__ methods are NOT symmetrical.

__setitem__(self, key, value) adds a Skim object to the hash of skims

def __getitem__(self, key) does NOT return the corresponding Skim object, but calls lookup using the implicit left_key and right_key df values and returns a Series

The fact that __getitem__ is a synonym for lookup at first view appears convenient for the readability spec expressions where skims['DISTANCE'] returns NOT the Skim object but the contextualized skim lookup, but this apparent gain in readability comes at the expense of concealing what is really going on behind the scenes.

This is especially error prone as both Skim and pandas Series support two-arg get() methods, so that a confusion over whether one is dealing with a Skim or a Series is potentially difficult to detect. The dangers of this asymmetry is underlined (sic) by the fact that the skims.py injectables distance_skim, sovam_skim, etc made this mistake for a while without its being detected, and this code was apparently written by the authors of the Skims class.