18f / open-data-maker Goto Github PK

make it easy to turn a lot of potentially large csv files into easily accessible open data

License: Other

Ruby 96.22% Liquid 1.66% JavaScript 0.02% Shell 0.86% HTML 0.14% CSS 1.09%

open-data-maker's Introduction

Open Data Maker

The goal of this project is to make it easy to turn a lot of potentially large csv files into open data via an API and the ability for people to download smaller csv files with a subset of the data.

Preliminary research suggests that open data users (journalists and others) actually know how to work with spreadsheets really well, but a lot of the data sets that we have in government are huge.

The first version of this project will allow us to host a website for an agency with a specific set of csv files, which are deployed with the app. This will allows us to deploy more quickly since there will be a lower risk security profile than if an agency could upload the CSV files (which might be a nice longer term feature).

Install and Run the App (as a developer)

See our Installation Guide

How this works

By default, data will be loaded from /sample-data when you run rake import

cities100.csv - dataset of 100 most populous cities in the US
data.yaml - configuration for
- index name city-data
- api endpoint name cities
- how columns are mapped to fields in json output
- data types
- unique columns name

When you run the app, you can query the dataset via json API, like: /cities?name=Chicago

To use your own data, you can set a different directory, for example:

export DATA_PATH='./data'

Put csv files into /data
Import files from /data: rake import (or restart the app)
1. There can be multiple files (must end in .csv)
2. Optional data.yaml file that specifies index name, API endpoint, file list, and a dictionary of column -> field name mapping and types
  1. Optionally import all the columns, not just ones specified in dictionary (see example: import: all)
  2. If data.yaml not provided, all fields and fields will be imported with folder or bucket name used as the API endpoint (name is 'slugified' with dashes replacing spaces)
api endpoint to get the data /api=endpoint?field_or_column_name=value

More Configuration Options

Often while you are developing an API and data dictionary, it is helpful to include all the columns in the csv. If you add the following to data.yaml, the field names and types from the dictionary will be used and any unspecified columns will simply use the column name as the field name.

options:
  columns: all

You can use the dictionary to provide nice errors to developers who use the API. This can be used in conjunction with the above columns: all which will make it so that columns that are not referenced in the dictionary are not searchable, but will make it so that unspecified fields cause errors to be reported.

options:
  search: dictionary_only

Also for debugging, you can limit the number of files that will be imported. This is helpful when the import process is time consuming because you have many, many files, but can test format changes with a subset of the files.

options:
  limit: 4

Help Wanted

Try out importing multiple data sets with different endpoints and data.yaml configuration
Take a look at our open issues and our Contribution Guide

More Info

Here's how it might look in the future:

Acknowledgements

Zipcode latitude and longitude provided by GeoNames under under a Creative Commons Attribution 3.0 License.

Public domain

Except as noted above, this project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

open-data-maker's People

Contributors

Stargazers

Watchers

Forkers

parndt jackieslee mallsub ultrasaurus camertron valeriemettler ange3 jasoneb icstunna slhaines agius nfiniteset jmcarp mtvillwock pluribusdigital siruguri rachelcolby11 exercising katyduncan cherika-zz irinatag cutofmyjib annetorill pfgilbert frescoraja karla-isabel-sandoval conanza civic-hackers-sg vbrown608 salogel42 shamus ridiculous daimeonpilcher cdale77 pkang volgorean celestelayne theodi philbritton benjamin-chan lauraerinmann jfredrickson ladanazita jacqueline-homan madebydna commercedataservice gdpelican josefdaly natashahull heystenson reneighbor jenjwong lizzhale aneyzberg mwahab1 mcooledge bobby-h beccaades photomattmills hazmatzo naofunada sidhene sophiez2628 mikelblake pangperry benterprise yencarnacion maggiemoss pkarman marks kynetiv jpletner cullenkirkpatrick prachymohan anukat2015 shrutiinfrabix kmcodes theonlyrao djeraseit sternj mjoyce91 rticwdt akritisehrawat morgancmartin iabhatti zkenstein chrishagyard fullstackenviormentss mattgarmon2018 ultrageek carlvlewis cpflores thyncsol jiyongcho isabella232 sbrichardson donfanning cjthemagicgal

open-data-maker's Issues

when the app starts up, it should only create the index once

when there is a new data.yaml, it successfully creates index, then there are multiple additional calls to create_index. It is nice that we make sure the index is there on various entry points to DataMagic, but maybe we could just keep track that we've done it. I think it would speed up indexing many files (and also make it less confusing to look at the logs for newcomers)

Unlisted CSVs are imported silently

Two unexpected things currently happen when you move the .csv file listed in data.yaml to a new filename then run rake import. For instance, if you do this:

mv sample-data/cities100.csv sample-data/cities.csv
DATA_PATH=sample-data rake import

The cities.csv file is read anyway (and maybe even gets indexed?) but has no API endpoint associated with it, so it's basically inaccessible.
No warning (or error, which I think is appropriate) is thrown for the missing cities100.csv listed in sample-data/data.yaml, and the expected API endpoint (/school) throws a very ambiguous error:

What should probably happen is:

Only files listed in data.yaml should be indexed.
Any file listed in data.yaml that can't be found should throw an error on import.

support boolean types

use sass for more awesome css

make sure to update the development process section in the Contributors Guide

Make Elasticsearch and S3 service names dynamic.

Background:

Currently, both elasticsearch and s3 use hardcoded names.

Why:

I'd like to see this change for a couple of reasons which are both in support of adding flexibility and resiliency to college-choice immediately and anything else built on open-data-maker going forward.

Benefits:

Support blue-green style service updates. This will enable out of band data loading fast reversion of previous data snapshots. I'd say this is essential and could be handled by defaulting to 'eservice' unless an ESERVICE_NAME environment variable is found.
Enable straightforward horizontal scaling of the backend elasticsearch. This bears more discussion as it would require more logic in the app. The idea would be to have 2 or more elasticsearch instances available to 2 or more API instances.

Perhaps there's a straightforward to implement hash function that could take act on the instance_index against a list of bound elasticsearch instances?

Otherwise, binding a user service that contains a map of instance indexes to services instances would seem a doable MVP.

As an API user, I would like to be able to sort the results by any field

It should be possible to sort the results with an additional query string parameter, e.g. to sort results by name (ascending by default):

/api?name=foo&sort=name

Elasticsearch's sort parameters allow you to specify multiple fields and orders to sort on, but I'm not sure of the most intuitive way to surface these via query string parameters. Here's a proposal:

sort=field:order sorts by a field with an explicit order (asc is the default; desc is descending)

The only problem I see with this is if a field contains the : character, but that feels like an edge case to me. I think it's okay to reserve the sort parameter for this too, though that means that we should probably throw an error if a data.yaml (or bare CSV without one) declares a sort field of its own.

moved multiple column sort into its own issue: #107

sort by multiple columns

related to #95

sort=a,b specifies multiple sorts, each of which can have its own order (sort=foo:desc,bar)

As a data consumer, I would like to get aggregate statistics about the entire data set so that I can make more informed comparisons

One need that we have for the Ed project is to get aggregate statistics, e.g. the median graduation rate for all schools. I'm imagining an API endpoint that you could access like so:

/:endpoint/stats?fields=graduation_rate

When you need more than one field or statistic, as we will in our project, you could just comma-separate them:

/:endpoint/stats?fields=graduation_rate,net_cost

Make imports idempotent

Currently, if you run rake import multiple times, you get duplicated records for that API. It would be nice if imports could be safely run multiple times without worrying about this, especially if some data in the original CSV is updated between loads. There are two possible ways this could be accomplished:

Clear all the data for the API index before the load
Allow the config to specify a key column that is used as a primary key for the data records.

The second option would be preferable, especially for large datasets. This could also allow us to support ElasticSearch's versioning for documents. I know this project is still really new, and I'm super impressed by it thus far, but I just wanted to note this as a future enhancement.

As a developer, if I click on a category, I should see a page with the data dictionary for fields in that category

For an initial implementation, we should make all the text visible --- without the (+) interface.

fix pending spec: import_with_dictionary with errors raises an error with invalid type

recent work broke error checking when reading new data.yaml:

  1) DataMagic #import_with_dictionary with errors raises an error with invalid type
     # Temporarily skipped with xit
     # ./spec/lib/data_magic/import_with_dictionary_spec.rb:100

create more complex sample data set with categories

see info at top of current sample data -- would be great to have realistic categories and sample data from census so we can get a feel for how the category / data dictionary UI will work

As a developer or data analyst, I would like to be able to preview the data in the browser

The image from the README illustrates what this might look like:

Specifically:

there should be custom criteria inputs that translate directly to query string parameters (see #52)
there should be a heading that lists the number of matching rows according to the criteria
there should be an HTML table that displays the first 10 (or so) rows of data
there should be a download area that links to the filtered data in a specific format:
- CSV (default)
- JSON (disabled for large result sets?)
- Excel, maybe?

on startup, padrino WARNING no supported rendering engine found, but it actually works fine

I set this up with liquid for rendering static content.

On startup, it shows the following error, but actually works fine:

WARNING: no supported rendering engine found. To use Padrino::Helpers and 
Padrino::Rendering properly you should include `gem 'erubis'`, `gem 'haml'` 
or `gem 'slim'` in your Gemfile. If you are confident about using 
Padrino::Helpers without Padrino::Rendering, please define constant 
`Padrino::IGNORE_NO_RENDERING_ENGINE = true` before `require 'padrino-helpers'`.

location should not be a special name, but should be a type (distance search fails with nested location)

the app is currently hard-coded to geo-code & search for special field "location" at the root. Quick workaround for this to to change the data.yaml to have location at root (not nested under 'school')

When I have geo-coded data, I would like to search by distance from US zip code

We have the API implemented: /places?zip=94132&distance=100mi
if you have fields called location.lat and location.long those are geo-indexed, but the API does not dynamically figure out the zip code location -- it's currently hardcoded to SFO!

With the completion of this story, the API should work with any arbitrary US zip code

Make sure to remember to add attribution to the README -- the geonames dataset which seems like a good one to use has CCBY license.

Handle null values

Different data sets demarcate null values differently. For instance, some CSVs may use the literal NULL, whereas others might specify null numeric values as -9999 (yes, really). There should be a way to specify what values should be interpreted as null in the data.yaml, for instance:

null_value: 'NULL'

files:
  foo.csv:

Eventually this would be useful to provide on a per-column basis, but for now it should address a good portion of use cases.

As a data analyst, I would like to get the data as a csv, rather than json

the API should support whatever.json and whatever.csv
csv should ignore pagination args, since that doesn't really make sense to the humans

Allow nested filter parameters in query strings

E.g.

api?nested.field=value

Security vulnerabilities in current versions of rack and activesupport

I added open-data-maker to our Gemnasium acount, and it is reporting 3 security vulnerabilities due to outdated versions of rack and activesupport, which are gems that padrino depends on.

To fix, run bundle update rack activesupport.

I encourage you to sign up for a Gemnasium account so that you can keep track of your project's dependencies and update them when they become out of date, or when a security issue comes up.
If you send me an email with a list of email addresses for the people who should have access to our Gemnasium account, I'll add them.

Integrate New Relic

Server/app errors are not currently being monitored. Let's get NR integrated.

nested field request not providing expected results

Last night, I attempted to search for all 4-year schools using this query:

https://ccapi-dev.18f.gov/school?school.degrees_awarded.predominant=3

and got zero results. I verified that there are schools that should be in that result set. Here's one (https://ccapi-dev.18f.gov/school?id=104151):

test failure: import_all indexes rows from all the files

this spec passes in isolation, yet fails with rake spec
checked in with 'xit' as pending

  1) DataMagic#import_all indexes rows from all the files
     Failure/Error: expect(result["total"]).to eq(100)

       expected: 100
            got: 101

       (compared using ==)
     # ./spec/lib/data_magic_spec.rb:219:in `block (3 levels) in <top (required)>'

we should have specs for reading data from s3

currently doesn't have any test coverage

when a csv has a byte order mark (BOM), it shouldn't be included as part of the column name

failing spec here: https://github.com/18F/open-data-maker/tree/bom

As a developer, I should be able to download a JSON version of data.yaml so that I know how to query the API

Thinking that we would have data.yaml as json with a placeholder css file and jquery.

The app should serve up a JSON representation of data.yaml at /data.json that gives developers (and possibly the front end described in #44) a listing of the API endpoints and fields in each.

As a developer or data analyst, I should be able to read some text about what all this data is, so I can decide what I want to do with it

We are typically dealing with large datasets, sometimes from different sources, where data analysts and developers need to understand where the data comes from and exactly what it is, which doesn't all fit in a field name.

This story is for the "Data Details" area in the image below.

The text would be specified in data.yaml. We also need descriptive text for each category. Check out the data.yaml proposal for more info.

The JSON representation of data.yaml is being worked on in #8.

Pagination support

Pagination sucks, but we may have to do it to prevent serializing and sending very large responses. It can also be overwhelming to work with an API that sends too much data at once, e.g. when you give it overly general search parameters or forget them entirely.

I've mostly seen it done like this:

{
  "total": 100,
  "page": 1,
  "per_page": 50,
  "results": [ {} ]
}

Then if you pass ?page=2, you get the next page of results. I've seen this done a bunch of (sometimes subtly) different ways, so I don't know if there are best practices. Flickr, for instance, does it this way in XML:

<photos page="2" pages="89" perpage="10" total="881">

Hey @gbinal, can you weigh in on this?

we need a layout with jquery, etc. so we don't repeat code on every page

We're using liquid for the pages, with the idea that we want the code to be mostly what developers would write, but we'll use erb for layout (since liquid doesn't support yield).

Support multiple endpoints

Currently there is only one api contained in the data.yaml file. However, in the future, there will probably be multiple endpoints. Support in config and the data.json endpoint (and potentially other files) is needed for this.

When I search for New York, I shouldn't get New Hampshire

It's currently doing a text search where similar "relevant" documents are found, but that doesn't make sense for proper names like these.

If I searched for "New" I would expect New York and New Hampshire, etc., but searching for New York (in a list of states) should just produce New York.

Pagination is wonky

Something very strange is going on with pagination on the test API. For example, say we request all of the schools in California:

https://api.data.gov/TEST/ed/school?state=CA

The second school in the first page of results (total: 585) is the "San Francisco School", id: 1984. That school shows up again in the next page of data, assuming that the page parameter is zero-based:

https://api.data.gov/TEST/ed/school?state=CA&page=1

Strangely enough, in this response the "San Francisco School" is the first result.

We discovered this by attempting to map the California schools and quickly discovered that while iterating over the paginated data we were consistently getting duplicates. In the 30 pages needed to load 585 results, 49 schools were duplicated, some as many as 19 times.

As a developer using the API, I would like to get a subset of the fields in my results

I'd like to see something like http://localhost:3000/cities?state=CA&fields=name,size which would return just name and size fields, not the whole json document

I like the name of this repo, what does it do?

As a developer, I should be able to query the data with boolean operations (greater than, less than, within a range)

Some initial thoughts:

numeric fields should support >, >=, <, <=. Range queries could be implemented as repeated query string parameters.
(all?) fields should support set comparisons like SQL IN (foo IN ('a', 'b')). These could also be implemented as repeated query string parameters.
(all?) fields should support negation (foo != 'bar').

There are lots of different options for string parameter formats. I like the way that Django QuerySets allow you to specify SQL expressions as kwargs, which could translate to query string parameters like this, which would be the equivalent of foo BETWEEN 100 AND 200 in SQL:

foo__gt=100&foo__lt=200

JSON output: need configuration for types

A couple of notes on JSON output:

We should serve responses with the appropriate MIME type: application/json.
It should be possible to declare columns as numeric so that they can be rendered as such in JSON, rather than everything being strings. This can get a little funky with floats, so it may be necessary to tweak whatever JSON serializer we're using to limit float precision (to, say, two decimal places).

Proposal for a data dictionary in data.yaml that specifies types (and other things):

data_dictionary:
  year:   # column created via 'add' directive in files section, source is not required
    type: integer
  state: 
    source: USPS
    description: Two character abbreviation for State or US Territory
    type: string
    title: State or US Territory
  name: CITNM      # default to string
  region_id: 
    source: REGID 
    description: Deptartment of Interior standard region identifier
    type: integer
    title: Region ID
    range: 0..8
  population:
    source: POP10
    type: integer
  location:
    source: 
      lat: INTPTLAT
      lon: INTPTLONG
    type: location
  somelistofnumbers:
    source: [NUM25, NUM50, NUM75]
    type: float

As a developer or data analyst, I want to be able to explore the categories of data...

The idea is that the list of categories would be available as an API and displayed like the image below on the home page. Sample data: https://github.com/18F/open-data-maker/blob/136df96922911aacdb2d46d989503a5cb97dc824/sample-data/data.yaml
You can get at the categories from code with:

DataMagic.config.data['categories']

support date types

when I download a csv, I would like to select specific years of data

Support data.yml and data.yaml

I've noticed (particularly in 18F projects) that the .yml extension gets used more often than .yaml. We should look for both extensions when reading the data schema.

DATA_PATH should not be required to be set

Not sure if I did something wrong but in the installation guide, it didn't say to set DATA_PATH to anything, but when I ran rake import without having set that variable I get this error:

rake aborted!
Elasticsearch::Transport::Transport::Errors::NotFound: [404] {"error":"IndexMissingException[[development-city-data] missing]","status":404}
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/base.rb:135:in `__raise_transport_error'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/base.rb:227:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/client.rb:119:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-api-1.0.12/lib/elasticsearch/api/namespace/common.rb:21:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-api-1.0.12/lib/elasticsearch/api/actions/indices/delete.rb:44:in `delete'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/stretchy-0.4.6/lib/stretchy/utils/client_actions.rb:81:in `delete'
/Users/sameer/code/rails/open-data-maker/lib/data_magic/index.rb:76:in `import_with_dictionary'
tasks/import.rake:9:in `block in <top (required)>'
Tasks: TOP => import

When I exported DATA_PATH to point to ./sample_data, everything worked smoothly.

Happy to submit a PR with that sentence added somewhere in the installation guide, if that's the right idea.

Multiple query string parameters produce 400 Bad Request

Providing more than one search parameter in the query string raises a 400:

http://127.0.0.1:3000/cities?state=CA&name=Fairfield

Some text from the long, cryptic error message:

SearchPhaseExecutionException [Failed to execute phase [query], all shards failed; shardFailures
{[dSaEKL4bT3uiZx963QVbEw][xxx][0]: SearchParseException[[xxx][0]: from[-1],size[-1]:
Parse Failure [Failed to parse source [{\"query\":{\"match\":{\"state\":\"CA\",\"name\":\"Fairfield\"}}}]]];

Support gzip encoding

I think we could cut down pretty dramatically on response payload sizes with gzip encoding.

when I search with no terms, I should get only documents (not the index config entry)

failing spec in dev branch, returns this along with all the indexed documents
{"version"=>"fixture-import-all", "index"=>"city-data", "api"=>"cities", "global_mapping"=>{"USPS"=>"state", "NAME"=>"name", "POP10"=>"population", "INTPTLAT"=>"latitude", "INTPTLONG"=>"longitude"}

should be only searching type 'document'

add description of project to data.yaml and display on home page

We have the app hosted here: http://open-data-maker.herokuapp.com/
but it doesn't really make sense. It would be great to have a paragraph that explains what this is that is included as a sidebar or something

As a developer, I would like an example of using the API to put locations on a map

ideally we would have an html file that is stand-alone (but might be in this repo as its own page) and demostrates calling the API and showing cities as dots on a US Map, either with D3 or Google Maps or whatever you like

Multi-year data use case

Census (or similar) data is a good example of something that would make sense to combine into single rows. For instance, if you had yearly CSVs of populations by city:

# pop2012.csv
state,city,population
CA,San Francisco,854201

# pop2013.csv
state,city,population
CA,San Francisco,865063

In this case, we might want to create an API that returns JSON for San Francisco that looks like:

{
  "state": "CA",
  "city": "San Francisco",
  "population": {
    "2012": 854201,
    "2013": 865063
  }
}

In the case of the US Census, we'd have many more columns than just population that could be broken down year by year, so being able to nest columns from yearly datasets would be handy. I think the best way to do this is to allow each file to specify a prefix/suffix (e.g. .2012) to be added to specific fields, but I don't have a good sense for how this would be specified in data.yaml.