Code Monkey home page Code Monkey logo

open-data-maker's Introduction

Open Data Maker

Build Status

The goal of this project is to make it easy to turn a lot of potentially large csv files into open data via an API and the ability for people to download smaller csv files with a subset of the data.

Preliminary research suggests that open data users (journalists and others) actually know how to work with spreadsheets really well, but a lot of the data sets that we have in government are huge.

The first version of this project will allow us to host a website for an agency with a specific set of csv files, which are deployed with the app. This will allows us to deploy more quickly since there will be a lower risk security profile than if an agency could upload the CSV files (which might be a nice longer term feature).

Install and Run the App (as a developer)

See our Installation Guide

How this works

By default, data will be loaded from /sample-data when you run rake import

  • cities100.csv - dataset of 100 most populous cities in the US
  • data.yaml - configuration for
    • index name city-data
    • api endpoint name cities
    • how columns are mapped to fields in json output
    • data types
    • unique columns name

When you run the app, you can query the dataset via json API, like: /cities?name=Chicago

To use your own data, you can set a different directory, for example:

export DATA_PATH='./data'
  1. Put csv files into /data
  2. Import files from /data: rake import (or restart the app)
    1. There can be multiple files (must end in .csv)
    2. Optional data.yaml file that specifies index name, API endpoint, file list, and a dictionary of column -> field name mapping and types
      1. Optionally import all the columns, not just ones specified in dictionary (see example: import: all)
      2. If data.yaml not provided, all fields and fields will be imported with folder or bucket name used as the API endpoint (name is 'slugified' with dashes replacing spaces)
  3. api endpoint to get the data /api=endpoint?field_or_column_name=value

More Configuration Options

Often while you are developing an API and data dictionary, it is helpful to include all the columns in the csv. If you add the following to data.yaml, the field names and types from the dictionary will be used and any unspecified columns will simply use the column name as the field name.

options:
  columns: all

You can use the dictionary to provide nice errors to developers who use the API. This can be used in conjunction with the above columns: all which will make it so that columns that are not referenced in the dictionary are not searchable, but will make it so that unspecified fields cause errors to be reported.

options:
  search: dictionary_only

Also for debugging, you can limit the number of files that will be imported. This is helpful when the import process is time consuming because you have many, many files, but can test format changes with a subset of the files.

options:
  limit: 4

Help Wanted

  1. Try out importing multiple data sets with different endpoints and data.yaml configuration
  2. Take a look at our open issues and our Contribution Guide

More Info

Here's how it might look in the future:

overview of data types, prompt to download data, create a custom data set, or look at API docs

Download all the data or make choices to create a csv with a subset

Acknowledgements

Zipcode latitude and longitude provided by GeoNames under under a Creative Commons Attribution 3.0 License.

Public domain

Except as noted above, this project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

open-data-maker's People

Contributors

agius avatar ange3 avatar benterprise avatar camertron avatar cantino avatar dnesting avatar heystenson avatar hollyallen avatar jmcarp avatar karla-isabel-sandoval avatar lizzhale avatar madebydna avatar missherico avatar monfresh avatar natashahull avatar ozzyjohnson avatar parndt avatar pfgilbert avatar photomattmills avatar ridiculous avatar selene10 avatar sharms avatar shawnbot avatar siruguri avatar slhaines avatar tyronegrandison avatar ultrasaurus avatar vbrown608 avatar yozlet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-data-maker's Issues

when the app starts up, it should only create the index once

when there is a new data.yaml, it successfully creates index, then there are multiple additional calls to create_index. It is nice that we make sure the index is there on various entry points to DataMagic, but maybe we could just keep track that we've done it. I think it would speed up indexing many files (and also make it less confusing to look at the logs for newcomers)

Unlisted CSVs are imported silently

Two unexpected things currently happen when you move the .csv file listed in data.yaml to a new filename then run rake import. For instance, if you do this:

mv sample-data/cities100.csv sample-data/cities.csv
DATA_PATH=sample-data rake import
  1. The cities.csv file is read anyway (and maybe even gets indexed?) but has no API endpoint associated with it, so it's basically inaccessible.
  2. No warning (or error, which I think is appropriate) is thrown for the missing cities100.csv listed in sample-data/data.yaml, and the expected API endpoint (/school) throws a very ambiguous error:

image

What should probably happen is:

  • Only files listed in data.yaml should be indexed.
  • Any file listed in data.yaml that can't be found should throw an error on import.

Make Elasticsearch and S3 service names dynamic.

Background:

Currently, both elasticsearch and s3 use hardcoded names.

Why:

I'd like to see this change for a couple of reasons which are both in support of adding flexibility and resiliency to college-choice immediately and anything else built on open-data-maker going forward.

Benefits:

  1. Support blue-green style service updates. This will enable out of band data loading fast reversion of previous data snapshots. I'd say this is essential and could be handled by defaulting to 'eservice' unless an ESERVICE_NAME environment variable is found.

  2. Enable straightforward horizontal scaling of the backend elasticsearch. This bears more discussion as it would require more logic in the app. The idea would be to have 2 or more elasticsearch instances available to 2 or more API instances.

    Perhaps there's a straightforward to implement hash function that could take act on the instance_index against a list of bound elasticsearch instances?

    Otherwise, binding a user service that contains a map of instance indexes to services instances would seem a doable MVP.

As an API user, I would like to be able to sort the results by any field

It should be possible to sort the results with an additional query string parameter, e.g. to sort results by name (ascending by default):

/api?name=foo&sort=name

Elasticsearch's sort parameters allow you to specify multiple fields and orders to sort on, but I'm not sure of the most intuitive way to surface these via query string parameters. Here's a proposal:

  • sort=field:order sorts by a field with an explicit order (asc is the default; desc is descending)

The only problem I see with this is if a field contains the : character, but that feels like an edge case to me. I think it's okay to reserve the sort parameter for this too, though that means that we should probably throw an error if a data.yaml (or bare CSV without one) declares a sort field of its own.

moved multiple column sort into its own issue: #107

As a data consumer, I would like to get aggregate statistics about the entire data set so that I can make more informed comparisons

One need that we have for the Ed project is to get aggregate statistics, e.g. the median graduation rate for all schools. I'm imagining an API endpoint that you could access like so:

/:endpoint/stats?fields=graduation_rate

When you need more than one field or statistic, as we will in our project, you could just comma-separate them:

/:endpoint/stats?fields=graduation_rate,net_cost

Make imports idempotent

Currently, if you run rake import multiple times, you get duplicated records for that API. It would be nice if imports could be safely run multiple times without worrying about this, especially if some data in the original CSV is updated between loads. There are two possible ways this could be accomplished:

  1. Clear all the data for the API index before the load
  2. Allow the config to specify a key column that is used as a primary key for the data records.

The second option would be preferable, especially for large datasets. This could also allow us to support ElasticSearch's versioning for documents. I know this project is still really new, and I'm super impressed by it thus far, but I just wanted to note this as a future enhancement.

As a developer or data analyst, I would like to be able to preview the data in the browser

The image from the README illustrates what this might look like:

csv-download

Specifically:

  • there should be custom criteria inputs that translate directly to query string parameters (see #52)
  • there should be a heading that lists the number of matching rows according to the criteria
  • there should be an HTML table that displays the first 10 (or so) rows of data
  • there should be a download area that links to the filtered data in a specific format:
    • CSV (default)
    • JSON (disabled for large result sets?)
    • Excel, maybe?

on startup, padrino WARNING no supported rendering engine found, but it actually works fine

I set this up with liquid for rendering static content.

On startup, it shows the following error, but actually works fine:

WARNING: no supported rendering engine found. To use Padrino::Helpers and 
Padrino::Rendering properly you should include `gem 'erubis'`, `gem 'haml'` 
or `gem 'slim'` in your Gemfile. If you are confident about using 
Padrino::Helpers without Padrino::Rendering, please define constant 
`Padrino::IGNORE_NO_RENDERING_ENGINE = true` before `require 'padrino-helpers'`.

When I have geo-coded data, I would like to search by distance from US zip code

We have the API implemented: /places?zip=94132&distance=100mi
if you have fields called location.lat and location.long those are geo-indexed, but the API does not dynamically figure out the zip code location -- it's currently hardcoded to SFO!

With the completion of this story, the API should work with any arbitrary US zip code

Make sure to remember to add attribution to the README -- the geonames dataset which seems like a good one to use has CCBY license.

Handle null values

Different data sets demarcate null values differently. For instance, some CSVs may use the literal NULL, whereas others might specify null numeric values as -9999 (yes, really). There should be a way to specify what values should be interpreted as null in the data.yaml, for instance:

null_value: 'NULL'

files:
  foo.csv:

Eventually this would be useful to provide on a per-column basis, but for now it should address a good portion of use cases.

Security vulnerabilities in current versions of rack and activesupport

I added open-data-maker to our Gemnasium acount, and it is reporting 3 security vulnerabilities due to outdated versions of rack and activesupport, which are gems that padrino depends on.

To fix, run bundle update rack activesupport.

I encourage you to sign up for a Gemnasium account so that you can keep track of your project's dependencies and update them when they become out of date, or when a security issue comes up.
If you send me an email with a list of email addresses for the people who should have access to our Gemnasium account, I'll add them.

Integrate New Relic

Server/app errors are not currently being monitored. Let's get NR integrated.

test failure: import_all indexes rows from all the files

this spec passes in isolation, yet fails with rake spec
checked in with 'xit' as pending

  1) DataMagic#import_all indexes rows from all the files
     Failure/Error: expect(result["total"]).to eq(100)

       expected: 100
            got: 101

       (compared using ==)
     # ./spec/lib/data_magic_spec.rb:219:in `block (3 levels) in <top (required)>'

As a developer or data analyst, I should be able to read some text about what all this data is, so I can decide what I want to do with it

We are typically dealing with large datasets, sometimes from different sources, where data analysts and developers need to understand where the data comes from and exactly what it is, which doesn't all fit in a field name.

This story is for the "Data Details" area in the image below.

image

The text would be specified in data.yaml. We also need descriptive text for each category. Check out the data.yaml proposal for more info.

The JSON representation of data.yaml is being worked on in #8.

Pagination support

Pagination sucks, but we may have to do it to prevent serializing and sending very large responses. It can also be overwhelming to work with an API that sends too much data at once, e.g. when you give it overly general search parameters or forget them entirely.

I've mostly seen it done like this:

{
  "total": 100,
  "page": 1,
  "per_page": 50,
  "results": [ {} ]
}

Then if you pass ?page=2, you get the next page of results. I've seen this done a bunch of (sometimes subtly) different ways, so I don't know if there are best practices. Flickr, for instance, does it this way in XML:

<photos page="2" pages="89" perpage="10" total="881">

Hey @gbinal, can you weigh in on this?

Support multiple endpoints

Currently there is only one api contained in the data.yaml file. However, in the future, there will probably be multiple endpoints. Support in config and the data.json endpoint (and potentially other files) is needed for this.

When I search for New York, I shouldn't get New Hampshire

It's currently doing a text search where similar "relevant" documents are found, but that doesn't make sense for proper names like these.

If I searched for "New" I would expect New York and New Hampshire, etc., but searching for New York (in a list of states) should just produce New York.

Pagination is wonky

Something very strange is going on with pagination on the test API. For example, say we request all of the schools in California:

https://api.data.gov/TEST/ed/school?state=CA

The second school in the first page of results (total: 585) is the "San Francisco School", id: 1984. That school shows up again in the next page of data, assuming that the page parameter is zero-based:

https://api.data.gov/TEST/ed/school?state=CA&page=1

Strangely enough, in this response the "San Francisco School" is the first result.

We discovered this by attempting to map the California schools and quickly discovered that while iterating over the paginated data we were consistently getting duplicates. In the 30 pages needed to load 585 results, 49 schools were duplicated, some as many as 19 times.

As a developer, I should be able to query the data with boolean operations (greater than, less than, within a range)

Some initial thoughts:

  • numeric fields should support >, >=, <, <=. Range queries could be implemented as repeated query string parameters.
  • (all?) fields should support set comparisons like SQL IN (foo IN ('a', 'b')). These could also be implemented as repeated query string parameters.
  • (all?) fields should support negation (foo != 'bar').

There are lots of different options for string parameter formats. I like the way that Django QuerySets allow you to specify SQL expressions as kwargs, which could translate to query string parameters like this, which would be the equivalent of foo BETWEEN 100 AND 200 in SQL:

foo__gt=100&foo__lt=200

JSON output: need configuration for types

A couple of notes on JSON output:

  • We should serve responses with the appropriate MIME type: application/json.
  • It should be possible to declare columns as numeric so that they can be rendered as such in JSON, rather than everything being strings. This can get a little funky with floats, so it may be necessary to tweak whatever JSON serializer we're using to limit float precision (to, say, two decimal places).

Proposal for a data dictionary in data.yaml that specifies types (and other things):

data_dictionary:
  year:   # column created via 'add' directive in files section, source is not required
    type: integer
  state: 
    source: USPS
    description: Two character abbreviation for State or US Territory
    type: string
    title: State or US Territory
  name: CITNM      # default to string
  region_id: 
    source: REGID 
    description: Deptartment of Interior standard region identifier
    type: integer
    title: Region ID
    range: 0..8
  population:
    source: POP10
    type: integer
  location:
    source: 
      lat: INTPTLAT
      lon: INTPTLONG
    type: location
  somelistofnumbers:
    source: [NUM25, NUM50, NUM75]
    type: float

Support data.yml and data.yaml

I've noticed (particularly in 18F projects) that the .yml extension gets used more often than .yaml. We should look for both extensions when reading the data schema.

DATA_PATH should not be required to be set

Not sure if I did something wrong but in the installation guide, it didn't say to set DATA_PATH to anything, but when I ran rake import without having set that variable I get this error:

rake aborted!
Elasticsearch::Transport::Transport::Errors::NotFound: [404] {"error":"IndexMissingException[[development-city-data] missing]","status":404}
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/base.rb:135:in `__raise_transport_error'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/base.rb:227:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/client.rb:119:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-api-1.0.12/lib/elasticsearch/api/namespace/common.rb:21:in `perform_request'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/elasticsearch-api-1.0.12/lib/elasticsearch/api/actions/indices/delete.rb:44:in `delete'
/Users/sameer/.rvm/gems/ruby-2.2.2@global/gems/stretchy-0.4.6/lib/stretchy/utils/client_actions.rb:81:in `delete'
/Users/sameer/code/rails/open-data-maker/lib/data_magic/index.rb:76:in `import_with_dictionary'
tasks/import.rake:9:in `block in <top (required)>'
Tasks: TOP => import

When I exported DATA_PATH to point to ./sample_data, everything worked smoothly.

Happy to submit a PR with that sentence added somewhere in the installation guide, if that's the right idea.

Multiple query string parameters produce 400 Bad Request

Providing more than one search parameter in the query string raises a 400:

http://127.0.0.1:3000/cities?state=CA&name=Fairfield

Some text from the long, cryptic error message:

SearchPhaseExecutionException [Failed to execute phase [query], all shards failed; shardFailures
{[dSaEKL4bT3uiZx963QVbEw][xxx][0]: SearchParseException[[xxx][0]: from[-1],size[-1]:
Parse Failure [Failed to parse source [{\"query\":{\"match\":{\"state\":\"CA\",\"name\":\"Fairfield\"}}}]]];

Support gzip encoding

I think we could cut down pretty dramatically on response payload sizes with gzip encoding.

Multi-year data use case

Census (or similar) data is a good example of something that would make sense to combine into single rows. For instance, if you had yearly CSVs of populations by city:

# pop2012.csv
state,city,population
CA,San Francisco,854201

# pop2013.csv
state,city,population
CA,San Francisco,865063

In this case, we might want to create an API that returns JSON for San Francisco that looks like:

{
  "state": "CA",
  "city": "San Francisco",
  "population": {
    "2012": 854201,
    "2013": 865063
  }
}

In the case of the US Census, we'd have many more columns than just population that could be broken down year by year, so being able to nest columns from yearly datasets would be handy. I think the best way to do this is to allow each file to specify a prefix/suffix (e.g. .2012) to be added to specific fields, but I don't have a good sense for how this would be specified in data.yaml.

store the config in elasticsesarch for more resilient version checking

Right now, we're checking the environment variable DATA_VERSION to see if we should re-index when there's a new config. However, since the app can't update the environment variable across re-starts, it creates this error-prone manual step.

Let's keep the config in elasticsearch, then we can store version number in the same index that it is related to.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.