openaddresses / machine Goto Github PK

View Code? Open in Web Editor NEW

96.0 96.0 36.0 34.72 MB

Scripts for running OpenAddresses on a complete data set and publishing the results.

Home Page: http://results.openaddresses.io/

License: ISC License

Shell 0.17% Python 96.52% HTML 2.92% Makefile 0.01% PLpgSQL 0.38%

machine's Introduction

OpenAddresses

Brief

A global collection of address, cadastral parcel and building footprint data sources, open and free to use. Join, download and contribute. We're just getting started.

This repository is a collection of references to address, cadastral parcel and building footprint data sources.

See openaddresses.io for a data download.

Contributing addresses

Open an issue and give information about where to find more address data. Be sure to include a link to the data and a description of the coverage area for the data.
You can also create a pull request to the sources directory.
More details in CONTRIBUTING.md.

Why collect addresses?

Street address data is essential infrastructure. Street names, house numbers, and post codes combined with geographic coordinates connects digital to physical places. Free and open addresses are rocket fuel for civic and commercial innovation.

Contributors

Code Contributors

This project exists thanks to all the people who contribute. [Contribute].

Financial Contributors

Become a financial contributor and help us sustain our community. [Contribute]

Individuals

Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]

License

The data produced by the OpenAddresses processing pipeline (available on batch.openaddresses.io) is not relicensed from the original sources. Individual sources will have their own licenses. The OpenAddresses team does its best to summarize the source licenses in the source JSON for each source. For example, the source JSON for the County of San Francisco contains a link to the County of San Francisco's open data license.

The source JSON in this repo (in the sources/ directory) is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication as described in the license file there. The rest of the repository is licensed under the BSD 3-Clause License.

machine's People

Contributors

Stargazers

Watchers

machine's Issues

ogr_source_to_csv() doesn't support encodings

Do the OGR Python bindings support a way to identify the source encoding? @iandees or @migurski do you know? Maybe @sgillies?

The source be-flanders is a shapefile in ISO-8859-1 (even if the source JSON doesn't say so yet) and isn't converting to Unicode correctly. As far as I can tell our code in ogr_source_to_csv() doesn't have any provision for dealing with source encodings. So far we haven't had any problems, but I'm wondering if we've just gotten lucky and most shapefiles are in UTF-8.

I can't find an OGR API for specifying the source encoding but I don't really understand OGR. Setting the environment variable SHAPE_ENCODING does seem to work but that's awfully hacky and might break if we multi-thread.

If OGR can't be made to do the conversion when reading the file, the other option would be to treat OGR's output as byte strings and decode to Unicode ourselves. A quick glance suggests that's what Fiona does. But I don't know OGR well enough to just charge in and do it.

For a specific test example, in the be-flanders shapefile there's a row with longitude 3.7097669529002757 early on that contains a non-ascii street name. That row in the projected, extracted CSV (with the 0xEF byte that marks this as not valid UTF-8) is 2000000686.0,70984.0,Ma[EF]sstraat,90,,,90,44021,Gent,9000,manueleAanduidingVanPerceel,3.7097669529002757,51.07133310993189

Make sure the output directory exists for parallel.py

Handle type:xml (for GML files)

Current Python code doesn't understand "type: xml". There are currently 16 source files with that type, all in Poland. I believe they are all GML blobs. The Node code processes them with ogr2ogr, but on a different code path than shapefiles.

https://github.com/openaddresses/openaddresses-conform/blob/572b7a8066db107f028ecd5407fda0b2c7041d84/Tools/convert.js#L258-L272

Remove S3 dependency from cache() and conform()

S3 was recently made an optional argument to cache() and conform(); it should be removed entirely and uploading constrained to the final output.

Remove S3 dependency from cache()
Remove S3 dependency from conform()
Remove TestOA.test_single_*() tests.

support ftp download

us-ca-placer-county is an FTP source, but the code doesn't work with FTP. Looks like the requests library doesn't do FTP. Here's one dubious-looking solution, may be easier just to fall back to stock python FTP.

I    cache:124  Requesting ftp://ftp1.placer.ca.gov/cdra/gis/parcel_point_ByAddress.zip
Process oa-cache-us-ca-placer_county:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nelson/src/oa/openaddresses-machine/openaddr/__init__.py", line 84, in thread_work
    downloaded_files = task.download(source_urls, workdir)
  File "/home/nelson/src/oa/openaddresses-machine/openaddr/cache.py", line 130, in download
    raise DownloadError("Could not connect to URL", e)
DownloadError: ('Could not connect to URL', InvalidSchema(u"No connection adapters were found for 'ftp://ftp1.placer.ca.gov/cdra/gis/parcel_point_ByAddress.zip'",))

Fix second-run bug

Recently started seeing this exception, after the first post-S3 run. The bug does not look to be in the Lexington source, but rather a result of that source finishing first.

Traceback (most recent call last):
  File "/usr/local/bin/openaddr-process", line 9, in <module>
    load_entry_point('OpenAddresses-Machine==1.1.1', 'console_scripts', 'openaddr-process')()
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process_all.py", line 47, in main
    process(s3, paths.sources, run_name)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process_all.py", line 113, in process
    results = jobs.run_all_process_ones(source_files, 'out', source_extras)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/jobs.py", line 114, in run_all_process_ones
    completed_path, result = result_iter.next(timeout=report_interval)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
    raise value
IOError: [Errno 2] No such file or directory: '/20150119/us-sc-lexington.json'

Issues when running machine

Hey,

I know machine is a work on progress. But still wanted to flag a few issues I had, running openaddr-process

$ openaddr-process -a MY_KEY -s MY_SECRET -l log.txt MY_BUCKETNAME
Traceback (most recent call last):
  File "/usr/local/bin/openaddr-process", line 9, in <module>
    load_entry_point('OpenAddresses-Machine==1.1.1', 'console_scripts', 'openaddr-process')()
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process.py", line 39, in main
    process(s3, paths.sources, run_name)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process.py", line 98, in process
    source_extras1 = read_state(s3, sourcedir)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process.py", line 69, in read_state
    state_key = s3.get_key('state.txt')
  File "/usr/local/lib/python2.7/dist-packages/openaddr/__init__.py", line 87, in get_key
    self._bucket = connect_s3(key, secret).get_bucket(bucketname)
NameError: global name 'key' is not defined

It's been years since I worked with Python, so correct me if I'm wrong. But it looks like line 87-95 needs to be changed to reference the class scope instead of the local scope, in addition bucketname needs to be changed to bucket.

    def get_key(self, name):
        if not self._bucket:
            self._bucket = connect_s3(self._key, self._secret).get_bucket(name)
        return self._bucket.get_key(name)

    def new_key(self, name):
        if not self._bucket:
            self._bucket = connect_s3(self._key, self._secret).get_bucket(name)
        return self._bucket.new_key(name)

After changing these lines, I can run process but I do however see failures on all sources that needs to be downloaded via FTP, just like this:

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/__init__.py", line 153, in thread_work
    downloaded_files = task.download(source_urls, workdir)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/cache.py", line 129, in download
    raise DownloadError("Could not connect to URL", e)
DownloadError: ('Could not connect to URL', InvalidSchema(u"No connection adapters were found for 'ftp://anonymous:[email protected]/CadastralFramework/Pondera/PonderaOwnerParcel_shp.zip'",))
   Thread-7    817606.5   INFO: /var/opt/openaddresses/sources/us-ia-winneshiek.json

It looks like the request class used for this doesn't seem to support FTP? It does however look like FTP sources are processed correctly on http://data.openaddresses.io/runs/1417451418.786/

I've tried using both the master branch and ditch-node branch. But seems to have the same issues.

Mathias

Remove excerpt()

Looks like excerpt() is only used by test code now.

Complete #22 to remove calls to excerpt().
Check that ExcerptDataTask produces the same results.
Remove excerpt().

Generate the one big .zip file

The one on the front page of http://openaddresses.io is now seven months out of date.

Fix cache type guessing

We have two related problems connected to bad guesses about content-type for cached files. We’re not using the compression core tag as a hint, and in many cases sources mis-report types. We’re also not using HTTP status codes to skip processing of 4xx and 5xx responses.

Use compression tag to recognize zip files.
Use HTTP status responses to recognize missing/bad data.
Recover from exceptions inside process_one.process().

See sample error log below.

  DEBUG: Content-Type says "text/html" for http://stjamesassessor.com/Downloads/DownloadFile.asp?FileID=4206
  DEBUG: Guessed us-la-st_james-72eb71dd.html for http://stjamesassessor.com/Downloads/DownloadFile.asp?FileID=4206
   INFO: Requesting http://stjamesassessor.com/Downloads/DownloadFile.asp?FileID=4206
: Downloaded 158 bytes for file out/process_one-_el6uX/cache-AWFAnY/http/us-la-st_james-72eb71dd.html
   INFO: Cached data in file:///home/ubuntu/out/process_one-_el6uX/cached/us-la-st_james-72eb71dd.html
  DEBUG: URL says ".html" for file:///home/ubuntu/out/process_one-_el6uX/cached/us-la-st_james-72eb71dd.html
  DEBUG: Guessed us-la-st_james-1f0ec642.html for file:///home/ubuntu/out/process_one-_el6uX/cached/us-la-st_james-72eb71dd.html
  DEBUG: File exists out/process_one-_el6uX/conform-zetfy_/http/us-la-st_james-1f0ec642.html
   INFO: Downloaded to ['out/process_one-_el6uX/conform-zetfy_/http/us-la-st_james-1f0ec642.html']
  ERROR: Error while running process_one.process

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/openaddr/jobs.py", line 61, in _run_process_one
    result = process_one.process(path, destination, extras)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process_one.py", line 38, in process
    conform_result = conform(temp_src, temp_dir, cache_result.todict())
  File "/usr/local/lib/python2.7/dist-packages/openaddr/__init__.py", line 149, in conform
    decompressed_paths = task2.decompress(downloaded_path, workdir)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/conform.py", line 108, in decompress
    with ZipFile(source_path, 'r') as z:
  File "/usr/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/usr/lib/python2.7/zipfile.py", line 811, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

Fix geometry type detection for ESRI sources

They used to be converted to GeoJSON, but as in #34 we decided to change them to CSV. We’ll need for excerpt to understand where geometry type comes from.

Support "skip"

We don’t skip sources marked as such, but we should.

Optimize conform CPU and disk IO

Now that the code is working and stable it'd be nice to spend some time optimizing performance. Some large sources like nl are timing out, and in general no effort has been made for efficiency. This issue is specifically about optimizing what happens after the source is downloaded. Some ideas:

Characterize performance of conform after the download is finished. Compare CSV vs. GeoJSON vs. Shapefile. Identify disk IO vs. CPU bottlenecks.
Experiment with psyco or PyPy and see if it helps. (Note: PyPy is mostly Py2 only.) Psyco uses a huge amount of RAM.
Consider removing the "extracted" intermediate CSV file; go straight from source material to output CSV in a row by row fashion. Should work conceptually, makes the code more complicated to debug. Only do it if disk IO is really a problem.
Random profiling / micro-optimizations in the inner loops.
I suspect the CSV DictReader is way inefficient for what we're doing, a simple indexed column thing might be better.

I'm happy to work on this when I'm back from vacation. Mostly opening the ticket to make some notes.

Test with Pythons 2.7 and 3.3

Whenever we get tests running in Python 3, add 3.3 to Travis config. Armin suggests skipping 3.0, 3.1, and 3.2.

Port basic conform behavior from Node implementation

Implement conform tags in ConvertToCsvTask:

Implement type processing tag.
Implement number and street attribute tags.
Implement lon and lat attribute tags.
Implement merge processing tag.
Implement split processing tag.

Remaining features to implement:

Things with no tests:

sources that aren't ASCII
sources with an encoding tag
sources requiring geographic reprojection
sources with srs processing tag

Add source paths directory as an argument to process-all

It’s dumb that it comes from openaddr.paths.

openaddr-process needs source-aware log streams

When processing a source fails, we'd like to present the debug logs from the program to the user on the dashboard at http://data.openaddresses.io/. So our code needs a way to collect per-source logs. openaddr-process is managing the jobs, using Python threads to run things in parallel. So really we need a per-Thread way to sort out logs.

Mike and I talked and we think the way to do this is to install a special logging.Handler in the Thread manager in jobs.py. It will intercept logging messages from the entire Python process and then identify which source the message came from based on the thread ID or thread name in the LogRecord. It can then use its own logger.Formatter to write them to a per-source output file. This apparatus runs alongside any other debug logs, we think all the code can be isolated in jobs.py.

Other ideas we discussed: having every Logging.debug() call include a keyword, generating new logger objects for each source, leveraging Thread-local variables somehow. The Java library log4j's Thread Context abstraction would be useful, but the Python library doesn't have anything like that. But we think the LogRecord's thread data will work best.

PS: logging is explicitly designed to be thread-safe. We don't use multiprocessing, but if we do it's more complicated how it interacts with logging.

Debugging aid: better logging config

The logging config is about 75% of the way to being useful. Here's a proposal to make it excellent. This article on logging practices has useful contextual advice.

Code should log to a logger obtained via logger.getLogger(\_\_NAME__). This results in loggers being named hierarchically after code modules, like openaddr.conform. With that naming scheme, logging levels can be configured on a per-module basis.
Each module should instantiate the logger once, at the top, via _L = logger.getLogger(__NAME__). That way normal code can just do _L.debug("foo"). Note that the logging config has to set 'disable_existing_loggers': False for this style to work right, see the linked article for details. I'm not wedded to the name _L, but I'd like it to be short.
There should be a logging setup utility method all the various main() entry points to our code shares. The default behavior is to log warning and above to stderr.
The default should be overrideable with a YAML file named $HOME/.openaddr-machine-logging, or pointed to by an OPENADDR_MACHINE_LOGGING environment variable.
Tests should not log to stderr by default. Maybe make logs write to a file in /tmp that's not deleted if a test fails? Or just rely on the file config for debugging.
Mike proposed command line switches like --verbose to change the log level. That sounds nice, not sure how it interacts with the YAML file.

sample excerpts don't work on Japanese CSV

With 15387dc the conform code now can handle the Japanese CSV sources like jp-nara.json. process_one fails when trying to write sample.json:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte
My guess is what's going on is the excerpt() function uses OGR to read all sources, but the OGR code can only read UTF-8. The Japanese CSV inputs are in a Shift-JIS variant.

@migurski do you have time to look at this or should I? I suspect the sampler will need to read a bunch of the CSV processing tags in order to sample correctly as a dictionary. Maybe it'd be easier just to publish the first 10 lines of the CSV file as a CSV sample. The conform code has a lot of CSV reading code, see csv_source_to_csv(). Unfortunately it also does some processing on the CSV that's not appropriate for sampling.

delete esri-outputs-csv branch?

Nice job merging code to master @migurski! I think it's safe to delete our one remaining GitHub branch, esri-commits-csv, but am not doing it in case I'm confused. That contained 1-2 days work Ian and I did changing the ESRI parser from GeoJSON to CSV, it's done and merged now.

Use httmock for test sources

They shouldn't actually be on S3.

cache should check local filesystem before downloading

cache.py seems to have no code to check if a download should occur. It'd be nice if it checked if the source changed first before doing the network traffic. In many cases we can't cache; either it's a run on a new directory or it's a web service that doesn't support checks. But when debugging we could save a lot of time looking on disk. Could also turn into a system to avoid regenerating outputs if the input didn't change.

Specific proposal: add code to download() here to do a check against the server first. Implementing the full HTTP caching protocol is too much work, but maybe we can get by with a check on content length and date.

Looking at that code now it looks like in some circumstances the file won't be redownloaded if it exists. I haven't seen that work in practice and, even if it does, we should be smarter about checking for cache expiration.

Use older cached data

From 11/22 to 11/27, some number of previously-cached data files disappeared. Older data should be used if newer data is not available.

log spam: "Could not reproject"

A few CSV sources have a bunch of bogus rows that don't contain valid coordinates. Right now those spam the logs, one debug entry per bad input row. It'd be nice to suppress those log messages at least in obvious cases where the inputs aren't a number.

es-32628/output.txt:2015-01-18 23:34:31,852    DEBUG: Could not reproject lon lat in SRS EPSG:32628
us-or-portland/output.txt:2015-01-18 23:31:37,637    DEBUG: Could not reproject None None in SRS EPSG:2913

sources that do this: es-25830 es-25831 es-32628 us-or-portland

lat/lon precision of 1cm, integer numbers where possible

Right now the code is emitting WGS 84 coordinates accurate to sub-micron scales. for instance, us-ca-nevada-county has locations like "-121.09431875581483,39.24202781459825". As much faith as I have in the Nevada County GIS department, that may be a bit optimistic. Coordinates should be rounded to 0.0000001 degree in output CSV (Python %.7f), which works out to about 1 centimeter. It will make the CSV files smaller.

In addition, some sources like us-ca-nevada-county also emit street addresses in decimal. That may be correct for 1/2 addresses, but for the majority of integers it'd be better to round. ie: the column should read "1337" and not "1337.0".

Street name expansion: 3Rd vs 3rd

The Python code for name expansion currently turns "3RD" in to "3Rd". It should be "3rd". Underlying cause is str.titlecase() is probably not very smart about things that start with numbers.

status.json

The status file is currently tab delimited. This adds extra complexity when integrating with a node/js application. We should also output a json version.

cc/ @migurski

Fix "None" values in status.json

See cache time and process time columns in http://data.openaddresses.io/runs/1418584913.103/state.json.

ESRI source fails quietly sometimes

A few of our sources look like they run correctly but produce empty out.csv files. In each case it looks like the problem is the ESRI service is not returning any data. Maybe it's a bug in our ESRI scraping code or maybe it's just the service is broken. Either way we should catch this case and signal an error.

At the moment, I think the following sources show the problem:

ca-bc-surrey, us-ak-matanuska_susitna_borough, us-al-calhoun, us-mi-kent, us-mi-ottawa, us-mn-otter_tail, us-nv-las_vegas, us-pa-beaver, us-sc-berkeley, us-sc-lexington, us-sd, us-va-accomack, us-va-city_of_norton, us-va-essex, us-va-fluvanna, us-wi-fond_du_lac, us-wi-vernon, us-wy-laramie

HTTP and job timeouts

Three of my jobs never completed last night: ca-bc-surrey us-tx-denton us-oh-hamilton. They're all ESRI-type sources and so require many HTTP requests. I'm assuming the server failed and somehow never closed the socket. (I'm surprised I never saw this happen before.)

Debugging aid: keep temporary files if programmer asks

We need some way to tell the code not to delete temporary files as a way to help debugging. I vote for an environment variable named OPENADDR_KEEPFILES. Also should probably log the paths of those files so the programmer can find them.

Make destination bucket configurable in parallel.py

Currently the bucket is not a run-time option, but it should be.

Make it easier to build conform objects

One of the more time-consuming parts of OpenAddresses is downloading the source data and looking at the available data to figure out what the conform object should look like.

I think we should convert the source data to CSV verbatim (before renaming/dropping columns) and extract the first few rows of data along with the headers into a separate CSV. Then we can show that data and let the user match what's available with what we want to store in OpenAddresses.

job queue not completely insulated from worker crashes

Over in issue #51 migurski reported

Getting better. I just had a whole run crash on this:

MainThread 589851.6 DEBUG: Requesting http://gisdata.cityofchesapeake.net/public/rest/services/A_POL_Basemap/MapServer/8/query
MainThread 589898.8 INFO: Converted to out/process_one-jt9HBi/conform-hUZ9tU/converted/us-sc-lexington-59dc607d.csv
MainThread 589908.5 INFO: Processed data in /home/ubuntu/out/process_one-jt9HBi/out.csv
MainThread 589908.8 INFO: Starting task for /var/opt/openaddresses/sources/us-ms-hinds.json with PID 15414
MainThread 589909.5 DEBUG: Downloading /arcgis/rest/services/Viewers/Hinds/MapServer/3 to us-ms-hinds-98c08b37.json
Traceback (most recent call last):
File "/usr/local/bin/openaddr-process", line 9, in 
load_entry_point('OpenAddresses-Machine==1.1.1', 'console_scripts', 'openaddr-process')()
File "/usr/local/lib/python2.7/dist-packages/openaddr/process_all.py", line 47, in main
process(s3, paths.sources, run_name)
File "/usr/local/lib/python2.7/dist-packages/openaddr/process_all.py", line 113, in process
results = jobs.run_all_process_ones(source_files, 'out', source_extras)
File "/usr/local/lib/python2.7/dist-packages/openaddr/jobs.py", line 114, in run_all_process_ones
completed_path, result = result_iter.next(timeout=report_interval)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
raise value
IOError: [Errno 2] No such file or directory: '/20150119/us-sc-lexington.json'
MainThread 589975.4 DEBUG: Requesting http://gis.mesaaz.gov/mesaaz/rest/services/Address/AddressIndexAcella/MapServer/0/query


`out/us-sc-lexington` is empty.

Serve compressed CSV

Product proposal: we should serve people compressed output files. Unfortunately S3 won't do this for us, in particular it won't do gzip compression in HTTP. There's a bunch of workarounds, I think the best choice for us is to stop serving out.csv files and instead serve out.csv.gz or maybe out.zip. The drawback is it puts the burden on the client to have to manually uncompress the files they download.

dk's out.csv is 130MB. gzipped that's 26MB. (bzip2 is 24MB) us-ca-los_angeles_county goes from 126MB to 33MB. So we're talking about a ~75% space, bandwidth, and time savings.

We could extend this idea to other large products, particularly any cache.json or cache.csv files.

Consider not rewriting street names

The current Node and Python code rewrites street names, turning things like "W ST SEBASTIAN ST" in the source data to "West Saint Sebastian Street" in the output. The problem is the transform can only degrade information from the source. We're not using any source-specific information to add value, it's all just a guess. The current code is definitely US English only.

I think this transformation is a bad feature for the product. The Python code I'm implementing in ditch-node does it for now to make it as compatible as possible as the Node code. If we decide to remove this feature from the processing, in the Python code it's as easy as removing expand.py and any calls to it.

Remove personal home directory from openaddr.paths

Currently looks like this, fix the /home/migurski bit to something universally useful.

Machine task queue architecture

Proposal: redesign machine with a task queue abstraction where a single task is processing one source. Cache, download, conform, extract a single source at a time. Doesn't require a fundamental rewrite, mostly refactoring existing code. Assumes all code running in Python, although shelling out to Node or ogr2ogr should be possible.

The current system runs all sources in sequential stages in a single Python process. process.py invokes code in run.py to first run_all_caches, then run_all_conforms, then run_all_excerpts. It then generates an HTML report from all the status objects. This works fine but has some drawbacks. The whole run currently takes 10 hours and will only take longer as more sources are added. Work is lost if the job fails, particularly awkward with EC2 spot instances. The code mixes application logic with thread and process logic.

I think it makes more sense to slice the work the other way, process each source as an independent job. The sources are themselves independent (different servers, different schemas, etc). It will make it easier to do test runs on a single source. It will make it easier to only re-run a conform if the source changed. It also presents a nice atomic unit of work to a task scheduling abstraction.

Details below.

Task

A single task processes a single user-contributed data source file. A task consists of several subtasks to perform sequentially:

Download from source URL / ESRI store. (Optionally cache it; see below.)
Conform the data, transforming the source to the OpenAddress CSV schema.
Extract the data, presenting a few lines from the source for user inspection.
Communicate task execution stats to the reporting system.

I propose each subtask be written as a Python module. Subtasks should also be able to run standalone on a local machine for development. Each subtask should have tests from a synthetic data source.

Source caching

The current Node code downloads data from the official source and caches it to S3, and is intelligent about only downloading if the data changed. Mike indicated that S3 dependence makes development challenging. But it's also useful in the (frequent) case the source goes down; see issue #9. I propose retaining an S3 cache in the system, but make the subtask functions able to work from S3 or local files.

Task Scheduling

The simplest scheduler is none at all, just run a Python loop over all sources with some sort of basic threading or multiprocessing. That will be functionally equivalent to where we are today, a single monolithic Python process.

We can then migrate to a task queue written to a persistent store. A job wakes up every minute to check if any sources have changed and if so, posts a processing tasks to the queue. Another job wakes up every minute to run tasks on the queue. Is there some AWS-friendly task abstraction we can reuse rather than writing our own?

Reporting

Reporting re-centralizes all the tasks when they are complete. The current system is a batch job sort of report, "I ran all the sources and the result on Dec 13 is X." I think it'd be better to move to a rolling report where each task updates its own little task stats record, then the reporter just has to group all task statuses together and publish an HTML report of the latest status of everything. But that is a product change.

Migration Plan

I'd start with a simple Python loop scheduler. This means rewriting process.py:process() to loop over sources. Most of jobs.py would be removed. I think even at this first stage we need some parallelism; perhaps process.py can simply fork a subprocess for each task? The current code for tasks and subtasks mostly exists, just needs to be refactored a bit. (conform.py is not yet complete.) The task status reporting requires some thinking.

Once that redesign is working as well as the current setup we can then move on to some more ambitious job scheduling system. I think that choice should be driven by what works best with AWS.

I also think it'd be worth re-examining the logic around caching data in S3 to make sure it's doing the right thing.

It's hubris of me to propose a redesign; I wrote this up after some conversations with Mike about the direction he was moving in.

Investigate new failures.

Out of 790 current sources, the Python version processed 31 that didn’t work under Node and downloaded 2 that didn’t cache under Node. 52 look like they used to process completely and now don’t, 43 used to cache and now don’t.

Output logs for each are linked below along with an automated guess so we can investigate.

52 Newly-failed to process:

au-queensland: python, node - ?
au-victoria: python, node - took too long
ca-bc-nanaimo: python, node - unknown content-type
ca-sk-regina: python, node - unknown content-type
us-al-huntsville: python, node - took too long
us-al-shelby: python, node - bad Esri metadata
us-ar: python, node - bad Esri metadata
us-ca-carson: python, node - exceeded request limit
us-ct-avon: python, node - bad Esri metadata
us-dc: python, node - unknown content-type
us-fl-alachua: python, node - bad Esri metadata
us-ga-gordon: python, node - bad Esri metadata
us-ia-linn: python, node - bad Esri metadata
us-id-canyon: python, node - bad Esri geometry
us-il-tazewell: python, node - bad Esri metadata
us-in-madison: python, node - bad Esri metadata
us-in-st_joseph: python, node - invalid geometry
us-mi-muskegon: python, node - invalid geometry
us-mi-ottawa: python, node - took too long
us-mn-metrogis: python, node - ?
us-mn-polk: python, node - bad Esri metadata
us-mn-pope: python, node - bad Esri metadata
us-mn-yellow_medicine: python, node - bad Esri metadata
us-mo-barry: python, node - bad Esri metadata
us-ms-madison: python, node - invalid geometry
us-nc-wake_county: python, node - ?
us-ne-omaha: python, node - invalid geometry
us-nm-san_juan: python, node - invalid geometry
us-nv-henderson: python, node - bad Esri metadata
us-nv-lander: python, node - bad Esri metadata
us-nv-nye: python, node - bad Esri metadata
us-tx-denton: python, node - took too long
us-tx-el_paso: python, node - invalid geometry
us-va-augusta: python, node - ?
us-va-city_of_petersburg: python, node - ?
us-va-richmond_city: python, node - ?
us-va-roanoke: python, node - ?
us-va-salem: python, node - ?
us-va-stafford: python, node - invalid geometry
us-wa-san_juan: python, node - incorrect conform
us-wi-adams: python, node - bad Esri metadata
us-wi-crawford: python, node - bad Esri metadata
us-wi-dodge: python, node - bad Esri metadata
us-wi-jefferson: python, node - bad Esri data
us-wi-juneau: python, node - bad Esri metadata
us-wi-lincoln: python, node - bad Esri metadata
us-wi-oneida: python, node - bad Esri metadata
us-wi-richland: python, node - bad Esri metadata
us-wi-sauk: python, node - bad Esri metadata
us-wi-superior: python, node - ?
us-wy-laramie: python, node - exceeded request limit
za-nl-ethekwini: python, node - ?

43 Newly-failed to cache:

ca-bc-langley: python, node cached 2014/10/30 - missing source
ca-bc-west_kelowna: python, node cached 2015/01/23 - missing source
ca-on-kingston: python, node cached 2015/01/23 - bad Esri data
us-ca-monterey_county: python, node cached 2014/12/19 - missing source
us-co-aurora: python, node cached 2014/03/30 - skipped
us-co-sanmiguel: python, node cached 2014/11/27 - unknown file extension
us-ct-city_of_hartford: python, node cached 2015/01/23 - exceeded request limit
us-fl-collier: python, node cached 2014/10/30 - missing source
us-fl-palm_beach: python, node cached 2015/01/20 - bad Esri data
us-ga-glynn: python, node cached 2015/01/23 - missing source
us-ga-muscogee: python, node cached 2014/04/01 - skipped
us-ia-johnson: python, node cached 2015/01/13 - invalid geometry
us-la-ascension: python, node cached 2014/08/04 - bad Esri metadata
us-mn-chisago: python, node cached 2015/01/23 - ?
us-mn-hennepin: python, node cached 2014/08/15 - exceeded request limit
us-mo-columbia: python, node cached 2014/04/01 - skipped
us-mo-st_louis: python, node cached 2014/04/08 - skipped
us-nc-charlotte: python, node cached 2014/08/13 - missing source
us-nc-columbus: python, node cached 2014/08/13 - missing source
us-nc-onslow: python, node cached 2015/01/13 - missing source
us-nc-surry: python, node cached 2015/01/23 - ?
us-nm-dona_ana: python, node cached 2015/01/23 - invalid geometry
us-ny-broome: python, node cached 2015/01/20 - missing source
us-oh-hamilton: python, node cached // - took too long
us-ok-city_of_lawton: python, node cached 2015/01/23 - ?
us-pa-cumberland: python, node cached 2014/12/01 - unknown content-type
us-pa-snyder: python, node cached 2015/01/23 - invalid geometry
us-ri: python, node cached 2015/01/23 - missing source
us-sc-horry: python, node cached 2014/12/30 - bad Esri data
us-sc-lexington: python, node cached 2014/04/01 - skipped
us-sc-york: python, node cached 2015/01/06 - ?
us-tx-new_braunfels: python, node cached 2015/01/20 - bad Esri metadata
us-tx-round_rock: python, node cached 2015/01/23 - missing source
us-va-bedford: python, node cached 2014/10/30 - missing source
us-va-city_of_falls_church: python, node cached 2015/01/23 - ?
us-wa-franklin: python, node cached 2014/10/30 - missing source
us-wa-island: python, node cached 2015/01/23 - missing source
us-wa-jefferson: python, node cached 2014/11/01 - missing source
us-wa-snohmish: python, node cached 2014/03/30 - ?
us-wi-brown: python, node cached 2014/12/03 - missing source
us-wi-city_of_milwaukee: python, node cached 2015/01/23 - took too long
us-wi-manitowoc: python, node cached 2015/01/23 - ?
za-wc-cape_town: python, node cached 2015/01/20 - took too long

S3 CORS support

At the moment browser based requests to resources are very difficult due to lack of JSONP or CORS.

Luckily S3 supports bucket based CORS support.

@iandees it would be great if you could enable this if you get a second.

encoding problem writing samples

@migurski I think this one is for you. A lot of Japanese and Korean sources are failing the same way, trying to write Unicode in process_one() and failing. For example kr-seoul-chunggu-old or jp-okinawa.

I process_:43   Processed data in /mnt/sdc1/geodata/oa2/out/process_one-DojCAp/out.csv
E process_:129  'utf8' codec can't decode byte 0x93 in position 0: invalid start byte
Traceback (most recent call last):
  File "/home/nelson/src/oa/openaddresses-machine/openaddr/process_one.py", line 127, in main
    file_path = process(args.source, args.destination)
  File "/home/nelson/src/oa/openaddresses-machine/openaddr/process_one.py", line 48, in process
    state_path = write_state(source, destination, cache_result, conform_result, temp_dir)
  File "/home/nelson/src/oa/openaddresses-machine/openaddr/process_one.py", line 76, in write_state
    json.dump(conform_result.sample, sample_file, indent=2)
  File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/usr/lib/python2.7/json/encoder.py", line 431, in _iterencode
    for chunk in _iterencode_list(o, _current_indent_level):
  File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 313, in _iterencode_list
    yield buf + _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte

bad ESRI sources yield 0 row CSV files that don't look like errors

In a run I did last night 27 of 578 sources resulted in a 23 byte out.csv file, something that just contains the headers. All of these sources have the same problem: "Downloaded 0 ESRI features" and a cache.json file that is an empty FeatureCollection.

One problem is that these ESRI sources aren't being processed correctly. It'd be nice to look at why. It could be the service is just broken in which case it'd be nice to signal that as a download error somehow.

The other problem is that the result collector should probably flag this kind of an output as an error. Maybe some fail safe in the reporting code that if the CSV file is too small it's assumed to be an error?

This is just a nice-to-have; it's pretty obvious to the user something went wrong. Here's the source list:

us-ia-polk us-pa-beaver us-ia-linn us-mn-polk us-mn-yellow_medicine us-mn-wadena us-va-roanoke us-tx-hurst us-in-madison us-mo-barry us-ga-gordon us-ct-avon us-nv-nye us-nv-lander us-nv-henderson us-tx-north_richland_hills us-mo-st_louis_county us-tx-colleyville us-tx-dallas us-va-fluvanna us-il-tazewell us-ar us-ct-lyme us-tx-keller us-mi-muskegon us-tn-memphis us-fl-alachua

worker process occasionally stops working, task is lost

With the new multiprocessing jobs.py there's an occasional bug that we can't reproduce.

The main visible effect is the run will idle at the end, reporting almost all the work is done. However no work is being processed and that whole run is stuck.

I 2015-01-18 23:59:39,533   MainThread     jobs:126  Job completion: 577/578 = 99%
I 2015-01-18 23:59:54,533   MainThread     jobs:126  Job completion: 577/578 = 99%

If that happens you can still salvage the work by doing a kill -USR1 on the parent process. That will abort all work and start generating a result report and upload data to S3.

I think there's a related thing which I've observed once or twice with htop, which is that there's a worker process sitting around that hasn't changed its name nor has it been assigned work. You can kill this idle worker just fine with a SIGTERM and the pool will start a new one which will do useful work again. But it may be that whatever task was supposed to be assigned to that worker has now gotten lost.

I don't know how to reproduce this bug; it's occurred twice in about 20 runs I've done. It could be a bug in mulitprocessing.Pool or it could be that somehow we screwed things up by not executing a job cleanly. If we could reproduce it, the thing to do is attach a debugger and start looking at the internal state of the pool and see if there's something obviously wrong.

Unknown Failure in Data Sources

@migurski do you have a copy of the logs from the last cache? There appears to be a lot of data that was cached successfully but failed for unknown reasons.

Generate graphs of system resource usage over time

openaddr-process should generate usage graphs of CPU, physical & swap memory, disk, and network over the course of its run, placing them in /var/www/html where openaddr-ec2-run can find them. I’ve messed around with dstat, gnuplot, and munin enough to know it’s a rabbit hole for me. What tools exist for this and how can we get them added to the chef scripts?

Stop clobbering state.txt on S3 with new runs

Keep old versions around.

Python2 or Python3?

Should Machine be Python2 or Python3? This discussion already somewhat happened in openaddress-conflate's issues: https://github.com/openaddresses/openaddresses-conform/issues/31 but I wanted to get it here for visibility. The code is currently Python 2 but sort of written in a 3-friendly way.

The main practical argument for Python3 is better Unicode support; Python2's CSV module is hopeless for Unicode (although we could use unicodecsv).The main practical argument for Python2 is compatibility. A few things are still missing. The big wrinkle IMHO is non-blocking IO coroutines like gevent, etc. But machine currently doesn't use those and the Python3.4 development on asyncio/Tulip is quite promising.

I checked with Howard Butler and he says python-gdal should present no problems. Both he and Sean Gillies also suggested using Rasterio and Fiona instead of the raw GDAL bindings; those work in Python3 as well.

Machine currently uses GDAL and Cairo, both of which work in Python3. We could install them either via pip or use the ubuntugis versions.

ESRI GeoJSON sources seem to use a lot of RAM

While doing some testing I noticed us-ca-kern, us-ca-marin_county, and us-ca-solano all map 1G+ of RAM while running, compared to 70M for other runs. Those are all ESRI sources, perhaps the ESRI download code tries to hold the whole dataset in RAM?

update: us-al-calhoun is the biggest I've seen, at 10G.

Make csv_source_to_csv() skip rows with the wrong number of columns

The Python CSV reader currently barfs on some Korean source files. Each row should only have 12 columns but some look like they have 13, probably because they are not properly quoted. The Python code should just skip those rows rather than throwing an exception and aborting the whole file. It'd be nice to do this in a way that prints warnings or something, so that the code isn't hiding the fact that it can't fully process the input data.

See [https://github.com/openaddresses/openaddresses/issues/851](Openaddresses Issue #851) for exhaustive detail on the Korean sources.

Move test code into openaddr submodule?

It’s currently in parts throughout the project, attached to openaddr.cache, conform, and elsewhere. Is that okay?

openaddresses / machine Goto Github PK

machine's Introduction

OpenAddresses

Brief

Contributing addresses

Why collect addresses?

Contributors

Code Contributors

Financial Contributors

Individuals

Organizations

License

machine's People

Contributors

Stargazers

Watchers

Forkers

machine's Issues

Task

Source caching

Task Scheduling

Reporting

Migration Plan

Recommend Projects

Recommend Topics

Recommend Org