===============================
=== gps2gtfs README ===
===============================
A suite of python and SQL tools for matching bus tracking data to GTFS data.
=== License ===
gps2gtfs is released under the MIT license.
See LICENSE in this directory.
=== About ===
At the moment this is a more-or-less scattered collection of tools which you
can use to match GPS tracking data of public transportation vehicles with their
GTFS schedule counterparts.
The instructions below give a straightforward step-by-step process which
will accomplish this, but as of yet there is no coherent single interface by
which this is done.
=== Instructions ===
0. Prerequisites
You will need:
- a SQL database in which you can make tables and such
- Python
- Python module for accessing your database (recommend psycopg for postgresql)
- gtfs_SQL_importer tool available at http://cbick.github.com/gtfs_SQL_importer
- GTFS data for your agency imported into the database using the above
- This software
If you are missing any of the first 3:
- I recommend PostgreSQL. It can be a pain to set up but handles very nicely.
- Get python from www.python.org . I used version 2.5.x when writing this.
- You can find python modules for your database from:
http://wiki.python.org/moin/DatabaseInterfaces
If you have PostgreSQL I recommend you get psycopg2:
http://initd.org/pub/software/psycopg/
1. Set up python database access
If you're using postgres or other socket-communicating db server:
In the config folder there is a file called db.config.example. Make a copy
of this file called db.config and edit it appropriately.
If you're using anything other than postgres:
Modify common/src/dbutils.py to handle your DBMS (shouldn't be difficult).
2. Create the schema for bus tracking data.
From the core/sql folder, apply the make_tables.sql file to you database.
Using psql as an example, you would run the command:
psql mydbname myusername -f make_tables.sql
3. Pull the bus tracking data into the DB.
It is up to you to figure out how to do this for your particular
data source. From the perspective of Python, the goal is to get
your data points into the format requested by the update_routes()
method inside the core/sql/dbqueries.py file. Otherwise, you will
need to fit your data into the db scheme, in particular into the
vehicle_track table that was created in step 2.
To grab San Francisco Muni tracking data you can just use the code
that is already provided in the nextmuni_import directory. This is
set up to pull the data directly from the nextmuni xml service.
To use it, follow these steps:
3.1. Fix erroneous SFMuni GTFS data:
From the nextmuni_import/sql directory, apply the change_sf_gtfs.sql
file.
3.2. Insert correct routeid/dirtag matchup data:
From the nextmuni_import/sql directory, apple the populate_rid_dirtag.sql
file.
3.3. Set up a job to collect the data:
Essentially, you want to run the python script found at
nextmuni_import/src/route_scraper.py to run every 1 minute
(or however often you choose), giving it "ALL" as an
argument:
python route_scraper.py ALL
To get this to run every minute, the easiest thing is to set up
a cron job. You're on your own to figure this out, or to figure
out any method to have this run on a regular basis.
*** Before you choose how to provide data to the db, see
*** step 4.2 below.
4. Do some pre-matchup setup.
Now that there is tracking data in your database, you probably want to
match that to the GTFS schedule, since you are using this package. To
do this you first need to do some preprocessing.
4.1. From the core/src directory, run the following:
python Preprocessing.py trips
This separates the tracking data into distinct routes and trips.
4.2. Then, you have two options:
a) Populate the routeid_dirtag table manually. (This was already done
in step 3 if you imported data from San Francisco Muni). The idea
is to match NextBus' "dirtag" values up with GTFS route ID's.
The reason to do this manually is that NextBus and GTFS both are
riddled with errors that you may or may not have caught yet.
*** If you aren't using a NextBus data source, then you must choose
*** what to do depending on what you provided for the dirtag field in
*** step 3.
b) Try to automatically populate the routeid_dirtag table. This will not
always work, since as I said in (a) there are lots of data errors over
which I have no control nor which do I have any method to detect.
You can do this as follows (from core/src directory):
python Preprocessing.py populate_ridtags
Before you do this though you might want to check out the results
from the following SQL queries:
select count(*), route_short_name as cnt from gtf_routes
group by route_short_name having count(*) > 1;
select count(distinct routetag) from vehicle_track
where dirtag != 'null' and routetag != 'null'
group by dirtag having count(distinct routetag) > 1;
If there are 0 results then you are in good shape. If there are a lot
of results then you are not in so good shape, and should give serious
consideration to (a). The data I currently have for SF Muni returns
121 rows for the second query.
5. Match GPS data to GTFS schedules (hoorah!)
** NOTE: You might want to selectively apply some of the indexes found in
** the core/sql/make_table_indexes.sql file, before doing the following.
From the core/src directory, run:
python Preprocessing.py match_trips
You have now successfully matched the trips from 4.1 to GTFS scheduled trips.
Now, run:
python Preprocessing.py gps_schedules
This populates the data telling you the actual schedule that each GPS trip
followed (i.e., when each bus actually arrived at each stop). Specifically,
this data is put into the gps_stop_times table (compare to gtf_stop_times).
6. Make some corrections
Some situations can result in erroneous assignments after following step 5.
To fix these, run
python Preprocessing.py fix_earlybirds
This searches for better matches for routes which are abnormally off-schedule.
*** You can visually test the gtfs<->gps matchings using the OSMViz package,
*** available at http://cbick.github.com/osmviz .
*** Once you have OSMViz, go to the core/src/ directory and look at
*** the visualizer.py file. By editing the query at the end of the file
*** (inside the "if __name__ == '__main__'" clause), you can choose what
*** to visualize. Run "python visualizer.py" and enjoy the show.
7. Get ready to analyze
From the core/sql folder, you will want to apply the make_table_indexes.sql
file to your database. This will make all future analyses much much faster.
8. Analyze
There is some postprocessing/analysis code in the postprocessing directory.
The postprocessing/sql directory holds code for creating a summary table of
bus lateness data, as well as some various queries that were useful to me
when exploring the data. The postprocessing/src directory holds the Stats.py
file which performs a bunch of statistical analysis and makes plots of the data,
but it requires the SciPy and Matplotlib libraries.