Code Monkey home page Code Monkey logo

sparkling's Introduction

Transitmap

For the final project in scalable machine learning (id2223) at KTH, we added live delay predictions to an existing public transport information system that we built in a previous course. Our prediction system gets only a stream of realtime vehicle positions and a scheduled timetable. It extracts from that the true arrival times at past stops and predicts the delay for all future stops in the trip.

transitmap.io is an interactive realtime visualisation of all public transport in Sweden (or those parts of it that have realtime geolocations, anyway).

id2223.transitmap.io is a new version of Transitmap that we heavily modified and extended for the final course project in scalable machine learning at KTH.

This work included collecting a large dataset from the continuous stream of position updates and metadata, implementing a fully modular machine learning pipeline for the delay prediction, and integrating that with the existing architecture of Transitmap. In the following we will describe our data, our architecture, and our prediction model.

Prediction Problem

The specific prediction problem that we are solving is to predict the delta between the scheduled arrival time and the real arrival time for all future stops for all running metros in Stockholm. In other words, we are predicting future delays (and early arrivals) for metros that are currently on their way at any given time.

Data

We are working with the public transport data available from Trafiklab for all of Sweden. This includes timetable data including metadata, as well as a stream of realtime vehicle position updates for many transport agencies. The timetable is in the static GTFS data format and is updates once per day. The realtime position updates are a data stream with new events every 3 seconds. On average this stream delivers over 4000 events per second and over 100 Million events per day.

Our custom event processing engine combines information from these two datasources in realtime to get a continuous stream of position updates joined with all relevant metadata. This combined stream serves as the input for our feature pipeline. We collected the whole stream for 3 weeks, totalling over 2.5 Billion events (~1TB uncompressed json). This dataset was the input for our batch feature pipeline. After that, our continuous feature pipeline is now continuously extracting new training samples from the data stream for future model iterations.

Architecture

We built a completely modular machine learning architecture for transitmap, following the best practices we learned in the course.

The below diagram shows transitmap's architecture, including the dataflow through the system. Components colored in green are completely new and were added as part of this course project. Components colored in yellow existed previously, but were changed in a major way for this course project.

Transitmap Architecture Dataflow

Feature Engineering

In order to be able to train a deep learning model on our laptops, we decided to scale down the prediction problem. For this iteration the predictions are limited to metros only (instead of all vehicles, as was planned initially). To generate training features from the previously collected data, we essentially simulated all public transport traffic from the 04.12.2023 to the 25.12.23 by pushing the collected events through the whole system in accelerated time. From this we extracted 3561 data samples.

The entire feature pipeline is implemented in Rust, within Transitmap's larger event processing pipeline. Previously this pipeline only attached the pre-aggregated metadata like the route name and trip headsign to each event. Now, the pipeline uses the raw vehicle coordinates, as well as the trip's timetable, to detect when a stop is reached. The feature pipeline collects this information for each trip in memory and transforms it into a sequence of tokens for training or inference.

Training Samples Validation Samples Test Samples Training Tokens
2507 537 537 125k

LSTM Model

Because we are working with sequence data, our model is a Recurrent Neural Network for a Sequence to Sequence prediction. We model our prediction problem similar to a language model, where the prediction task is to predict the next word. In our vocabulary we encode relevant metadata like route, direction, day-of-the-week, time-of-day as well as stop identifiers and stop to stop delay deltas in 1 minute increments between -15 and 15. The sequences start with the metadata for the specific trip and continoue with each stop and delay delta interleaved. During inference we provide all the realtime information up to the current stop in the trip and let the model predict the delay for the rest of the sequence.

We trained a number of models with different dimensions, the below table shows the notable examples that performed best at each model size.

Parameters Tokens / Param LSTM Layers Hidden Size Embedding Size Test loss
134k 0.93 1 128 64 0.251
90k 1.39 2 64 64 0.246
56k 2.23 1 64 64 0.241
30k 4.17 1 32 64 0.246
20k 6.25 1 32 32 0.263

We can see in the above table that the model size plays a central role in model performance. We find the tokens / parameter metric especially interesting to gather an intuition on how much training data is required for this specific model architecture and prediction problem. We found the best model performance at ~2.2 tokens per parameter, which is interesting for two reasons:

  • It is much lower than the roughly 10 tokens / parameter that is typical in LLMs. We think this is because our sequences are much more consistently structured than natural language.
  • Because of the structure of our sequences, we end up with roughly 1 delay token / parameter, which seems to make intuitive sense to us in this delay prediction problem.

The following diagram shows the layers and dimensions of our best model.

LSTM Model Architecture

How to run locally

Transitmap can run locally using docker-compose. This requires a small amount of setup, as follows.

TrafikLab API Keys

To run Transitmap yourself, you need to provide your own API keys from TrafikLab. This is completely free of charge. Follow these steps to set it up:

  1. Login or create an account on TrafikLab.
  2. Create a new project in your TrafikLab account.
  3. Add API keys for the GTFS Sweden Realtime (beta) and GTFS Sweden Static data (beta) APIs to your project.
  4. Create a file named .env in the root directory of this project and add your API keys to it. It should look like the following.
TRAFIKLAB_GTFS_RT_KEY=<realtime-api-key>
TRAFIKLAB_GTFS_STATIC_KEY=<static-data-api-key>

Note that, while the API keys you have just set up are are perfectly fine for testing, they not enough to run Transitmap continuously. For this, the Guld API tier is required on the realtime API. This can also be requested from TrafikLab free of charge, but processing the request typically takes a couple days.

Hopsworks API Keys

  1. Login or create an account on Hopsworks
  2. Follow the instructions to finish account creation
  3. Find your API key in the profile menu
HOPSWORKS_API_KEY=<hopsworks-api-key>

Note that the free tier should be enouph to support the feature pipeline for quite a long time.

Running

Once you have set up your API keys, you can simply run Transitmap with the following command.

docker-compose up --build

The cluster takes a couple minutes to start fully. Once everything is running, you can connect to the application in your browser on localhost.

We recommend running without the feature uploader and the data exporter, since these require private credentials for Google Cloud Storage access. They are not required for just running the application and are commented out in the compose file by default.

sparkling's People

Contributors

cetceeve avatar jonathanarns avatar pierrelefevre avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

pierrelefevre

sparkling's Issues

Aggregation Cloud Function: Filter by timeframe

The cloud function could produce a much smaller file that only contains trips for the next one or two days.

Then it would be important though to also re-download the file where needed on that schedule (this is not being done in event-engine currently).

Tracking Issue: More real time than real time

The idea here is to is to predict vehicle positions a few seconds into the future so that we can display them at the exact location where they are located in the real world.

Good features for this are probably the route, the current position, the time of day, and the day of the week. Time and day probably binned.

Trafiklab API Mirror

We might consider mirroring the parts of the trafiklab API that we use, both to reduce load on trafiklab, but mainly to avoid being rate-limited when running multiple instances of Transitmap simultaneously.

This has low urgency.

Logo & Favicon

We need to design a logo for transitmap and then create a favicon from that.

We could also make some merch like stickers with the logo :)

Marker Animations v0

Goal:
Animate markers while waiting for new data points

Solution:
Slide markers to a new location.
Benefits of this solution:

  • when markers get no updates they stay at a reasonable "last-seen" location (as long as we don't slide them very far)

Problems:

  • new location needs to be calculated from a best guess.

Approach:
Update Frequency: We can use spark to keep an aggregate value over a sliding window for the update frequency we have seen.

From the real time data we get current position, bearing and speed.
We can slide the marker to the current location, determining the animation speed from the update frequency.
We can use bearing to determine the vector along which to place the marker.
And we can use speed as well as update frequency to determine how far ahead to move the marker.
Our goal is to place the marker half way of where we expect a new update.
In case we have an object that get infrequent updates, we should limit how far ahead we should place the marker.

Library:
https://www.npmjs.com/package/leaflet.marker.slideto
This lib supports sliding the markers too a new position given a new location and animation time.

Future work:

  • take into account route to correct bearing and speed (stops, line-speeds, etc.)
  • take into account route info to detect start and end points of routes

Explore UI options for Commuters

Our current idea is that Transitmap can primarily be useful for frequent public transit users when traveling on their usual routes.
For example to help select between alternative routing options that depend on precise change times.

We would like to explore UX/UI options for this use case.

This will likely be an iterative process as we use the app and gather experience with what works.
Wire-frame prototypes or similar might be a good idea as well.

Frontend Performance is weak :(

We seem to have some serious issues with frontend performance (tested in Firefox) since recent updates.
The animation noticeably stutters.

Profiling

The red bars are event processing delays, up to almost 1s, occurring roughly every 3s, so likely in line with when events arrive. 1s delay is way too much and obviously noticeable to the user.
image

The flamegraph shows that JSON.parse is actually taking up a lot of cycles. Likely this is mostly parsing stops, which make up a big part of the data we send repeatedly. The animation itself is also still expensive, maybe we have some optimization potential there as well (lower framerate, not using icons when zoomed out - although I like the icons).
image

Maybe it would also help to spread out the load of the event stream instead of them arriving at the client essentially in batches.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.