Proposal: redesign machine with a task queue abstraction where a single task is processing one source. Cache, download, conform, extract a single source at a time. Doesn't require a fundamental rewrite, mostly refactoring existing code. Assumes all code running in Python, although shelling out to Node or ogr2ogr should be possible.
The current system runs all sources in sequential stages in a single Python process. process.py
invokes code in run.py
to first run_all_caches
, then run_all_conforms
, then run_all_excerpts
. It then generates an HTML report from all the status objects. This works fine but has some drawbacks. The whole run currently takes 10 hours and will only take longer as more sources are added. Work is lost if the job fails, particularly awkward with EC2 spot instances. The code mixes application logic with thread and process logic.
I think it makes more sense to slice the work the other way, process each source as an independent job. The sources are themselves independent (different servers, different schemas, etc). It will make it easier to do test runs on a single source. It will make it easier to only re-run a conform if the source changed. It also presents a nice atomic unit of work to a task scheduling abstraction.
Details below.
Task
A single task processes a single user-contributed data source file. A task consists of several subtasks to perform sequentially:
- Download from source URL / ESRI store. (Optionally cache it; see below.)
- Conform the data, transforming the source to the OpenAddress CSV schema.
- Extract the data, presenting a few lines from the source for user inspection.
- Communicate task execution stats to the reporting system.
I propose each subtask be written as a Python module. Subtasks should also be able to run standalone on a local machine for development. Each subtask should have tests from a synthetic data source.
Source caching
The current Node code downloads data from the official source and caches it to S3, and is intelligent about only downloading if the data changed. Mike indicated that S3 dependence makes development challenging. But it's also useful in the (frequent) case the source goes down; see issue #9. I propose retaining an S3 cache in the system, but make the subtask functions able to work from S3 or local files.
Task Scheduling
The simplest scheduler is none at all, just run a Python loop over all sources with some sort of basic threading or multiprocessing. That will be functionally equivalent to where we are today, a single monolithic Python process.
We can then migrate to a task queue written to a persistent store. A job wakes up every minute to check if any sources have changed and if so, posts a processing tasks to the queue. Another job wakes up every minute to run tasks on the queue. Is there some AWS-friendly task abstraction we can reuse rather than writing our own?
Reporting
Reporting re-centralizes all the tasks when they are complete. The current system is a batch job sort of report, "I ran all the sources and the result on Dec 13 is X." I think it'd be better to move to a rolling report where each task updates its own little task stats record, then the reporter just has to group all task statuses together and publish an HTML report of the latest status of everything. But that is a product change.
Migration Plan
I'd start with a simple Python loop scheduler. This means rewriting process.py:process()
to loop over sources. Most of jobs.py
would be removed. I think even at this first stage we need some parallelism; perhaps process.py can simply fork a subprocess for each task? The current code for tasks and subtasks mostly exists, just needs to be refactored a bit. (conform.py
is not yet complete.) The task status reporting requires some thinking.
Once that redesign is working as well as the current setup we can then move on to some more ambitious job scheduling system. I think that choice should be driven by what works best with AWS.
I also think it'd be worth re-examining the logic around caching data in S3 to make sure it's doing the right thing.
It's hubris of me to propose a redesign; I wrote this up after some conversations with Mike about the direction he was moving in.